How to create a pipeline with multiple inputs?

I want to apply feature selection by using a binary mask. The binary mask is created by using a template file derived from a prior meta-analysis. I use a threshold so that ones and zeros in the mask file correspond with z-values in the template object above or below this threshold.

In the next step I want to use this mask to ‘cut out’ features in the data set and pass this ‘feature selected subset’ over to other following pipeline steps. Finally I want to run an estimator such as SVM to fit a model to the data and predict the outcome.

Both the mask building procedure and other following pipeline steps use parameters which can be treated as hyperparameters and thus could be optimized using nested cross-validation. For example one could vary the threshold value I just mentioned above.

The problem: scikit-learn’s build-in functions such as cross_validate or GridSearchCV only accept two arguments (a feature set X and a label list y) but I want to a model building procedure which accepts three arguments namely X, y and the template file.

How can I (or better is it possible to) implement both the optimization of the mask building procedure and the following pipeline steps in one pipeline? In other words: Does scikit-learn contains build-in options for building a pipeline which takes more than just the feature set X and the label list y?

You might need to create a custom transformer object that respects the
scikit-learn API
is initialized with a z-map and a threshold, and computes the mask in its fit

Thanks @jeromedockes for the answer, I took a look at your link and started to build a custom transformer class that inherits from BaseEstimator and TransformerMixin. Then I realized that I face a second problem. The instance of that custom class itself has to have all the properties that a nibabel.nifti1.Nifti1Image object has (e.g. get_data(), shape, etc.) so that I can pass it over as mask to NiftiMasker.

I started to build a class that inherits not only from BaseEstimator and TransformerMixin but also from Nifti1Image but I couldn’t get it work. In theory these are the needed steps:

  1. Initialize instance of the above described custom nibabel.nifti1.Nifti1Image class. For that take a template niimg-like object as argument in __init__. The custom class can also optionally be initialized with a threshold or otherwise the threshold can be set during nested cross validation by set_params.
  2. Call fit on that instance to set the right threshold (this is only needed in case threshold is provided as percentile string in format such as “80%”)
  3. Call transform on that instance to binarize all voxels according to the given threshold and set all needed attributes of the instance equal to the binarized niimg-object (so the instance is a modified copy of the provided image).

Here’s my first attempt (without inheritance from Nifti1Image):

class NiftiBinarizer(BaseEstimator,TransformerMixin):

    def __init__(self, img, threshold=None): 
        self.img = img
        self.threshold = threshold
    def fit(self):
        # if threshold is provided as percentile calculate corresponding
        # percentile rank based on image data
        if isinstance(self.threshold,str):
            percentile = float(self.threshold.strip('%'))
            img_data = self.img.get_data()
            rank = np.percentile(img_data,percentile)
    def transform(self):
        # binarize image using threshold
        binary_img = math_img(f'img > {self.threshold}', img=self.img)
        return binary_img

your NiftiBinarizer will replace NiftiMasker in the pipeline, not a nifti image, so it doesn’t need to implement the Nifti1Image interface. I think you’re on the right track, but transform needs to receive as an argument the data that needs to be transformed (ie the time series) so it would look like

class NiftiBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, img, threshold=None):
        self.img = img
        self.threshold = threshold

    def fit(self, *args):
        # if you want to compute the threshold explicitely use
        # nilearn._utils.param_validation.check_threshold(self.threshold)
        # otherwise you can call threshold_img directly with the
        # str/float/NoneType threshold
        # also, to respect scikit-learn API, don't change the `threshold`
        # provided by the user, store the computed threshold in `threshold_`
        self.mask_img_ = nilearn.image.threshold_img(self.img, self.threshold)
        self.masker_ = nilearn.input_data.NiftiMasker(self.mask_img_).fit()

    def transform(self, img):
        if not hasattr(self, 'masker_'):
            raise ValueError('transformer not fitted yet.')
        return self.masker_.transform(img)

@jeromedockes thank you very much, you helped me for the second time!

I managed to set up a working example which is slightly different from your suggestion (it uses the function binarize_img to create a mask image:

def binarize_img(img,threshold,ignore_zeros=True):    
    """Binarize an image depending on a provided threshold.
    img: Niimg-like object
    threshold: float or str
        If float, threshold is interpreted as absolute voxel intensity value.
        If provided as string, threshold is calculated based on percentile
        score corresponding to the provided percentile.
    ignore_zeros: boolean
        This parameter is intended to be used in combination with a string 
        threshold. If True, voxels containing zeros will be ignored when
        calculating corresponding percentile rank. This is useful when the 
        provided image already contains lots of zero voxels (so zeros will not
        be taken into account when calculating percentile rank and thus there
        is no bias in percentile calculation).
        Default: True
    mask_img: Niimg-like object
        A binarized version of the provided image. 

    if isinstance(threshold,str):
        percentile = float(threshold.strip('%'))
        img_data = img.get_data()
        if ignore_zeros == True:
            img_data = img_data[np.nonzero(img_data)]
        threshold = np.percentile(img_data,percentile)
    # FIXME: For reasons of readability replace this with .get_data,
    # calculate binary img and then new_img_like?
    mask_img = math_img(f'img > {threshold}', img=img)
    return mask_img

class NiftiProcessor(BaseEstimator,TransformerMixin):
    def __init__(self, tpl_img, threshold=None):
        self.tpl_img = tpl_img
        self.threshold = threshold

    def fit(self, X, y=None):
        self.mask_img_ = binarize_img(self.tpl_img, self.threshold)
        self.masker_ = NiftiMasker(self.mask_img_).fit()

        return self

    def transform(self,X,y= None):
        if not hasattr(self, 'mask_img_'):
            raise ValueError('transformer not fitted yet.')
        return self.masker_.transform(X)

I am not sure if the function threshold_img does the right job here, since it doesn’t outputs a binary image object but only sets values below the treshold to zero. I tried your code and NiftiMasker complained that it wasn’t provided with a binary image.

I also wonder about why you set up fit and transform like this?

Thanks again, you really helped me get on the right track. Any comments or suggestions concerning my code are greatly appreciated.

sorry about that, you’re quite right it needs to be binarized. if you don’t want to reimplement the thresholding logic you could still use nilearn._utils.param_validation.check_threshold, otherwise your solution is also a good one.

1 Like

I’m not sure I understand this question. the idea is that to be compatible with the scikit-learn API, and for example be usable in a Pipeline or GridSearch, the object must do nothing in its __init__ except store the parameters without changing their name. then all the preparation is done in the fit, which can be called as, y) (here you don’t care about X and y because only the template and threshold are used). fit also must return self. finally, transform transforms the given image (after checking that the object is fitted to get a more helpful error message if it isn’t).

don’t hesitate if you have more questions or if this is not what you were asking.

in this line I would maybe pass threshold as an argument to math_image, or use str.format explicitly, because f’’ strings don’t exist in python 3.5 AFAIK, which is still widely used.

1 Like

Thanks for the hint, I looked at check_threshold and in principle it does what I need but I also need an option ignore_zeros because I don’t want the zeros to be included when calculating the percentile rank from the provided percentile string. That’s why I build my own function. I don’t know if this could be useful for somebody else in the future. If yes, one could think about making a pull request in nilearn.

Yes, good idea. I am also not really happy with my solution. It is not much code (which I wanted in the beginning) but I now I think it’s not very good in term of readability.

This is good to know and another reason to change it.

I tried to do that but I don’t know how (see my comment on this post). Maybe I just switch back to functions like .get_data() -> binarize data -> .new_img_like.

I think in this case it would be as simple as passing data[data != 0] instead of data to check_threshold. but your own function also works :slight_smile:

ah yes you’re right. I guess new_img_like(img, img.get_data() > 0) would have been my reflex too, but the math_img solution is good, I just would write '...{}'.format(threshold) rather than f'...{threshold}'

I’m not sure I understand this question. the idea is that to be compatible with the scikit-learn API […]

And I think I still struggle a little bit with understanding the setup of a scikit-learn pipeline. When do you explicitly have to put in X and y as arguments to fit and transform? I understand that *args acts like a ‘wildcard’ so it will accept any arguments that you provide to it including X and y. When I look at examples of fit and tranform it seems that everbody writes it as fit(X,y=None) and transform(X,y=None) which made me think that including y is sort of mandatory here.

I also don’t understand how scikit-learn internally passes X and y through the pipeline. Will every new object get the output from the former object or will every pipeline object be provided with the same X and y? The second option doesn’t make sense to me because we want to transform our objects and pass the transformed X to the next pipeline object. But if the first option is true why don’t we also have to return y?

all steps of the pipeline but the last must be transformers; they are used as, y, *fit_params).transform(X), where fit_params can be empty. then the transformed X gets passed on to the next step in the pipeline. y is passed to the fit methods but doesn’t get transformed (and transformers typically don’t use it).

How does the next transformer knows what it received from the transformer before? Will it automatically assume that the output from the former transformer is the transformed X? What if I also wanted to transform y (for example for correction of noise in labels), will it automatically assume that the first returned object of the former transformer is X and the second object is y?

the next transformer receives transformed X and original y. they don’t transform y