I want to apply feature selection by using a binary mask. The binary mask is created by using a template file derived from a prior meta-analysis. I use a threshold so that ones and zeros in the mask file correspond with z-values in the template object above or below this threshold.
In the next step I want to use this mask to ‘cut out’ features in the data set and pass this ‘feature selected subset’ over to other following pipeline steps. Finally I want to run an estimator such as SVM to fit a model to the data and predict the outcome.
Both the mask building procedure and other following pipeline steps use parameters which can be treated as hyperparameters and thus could be optimized using nested cross-validation. For example one could vary the threshold value I just mentioned above.
The problem: scikit-learn’s build-in functions such as cross_validate or GridSearchCVonly accept two arguments (a feature set X and a label list y) but I want to a model building procedure which accepts three arguments namely X, y and the template file.
How can I (or better is it possible to) implement both the optimization of the mask building procedure and the following pipeline steps in one pipeline? In other words: Does scikit-learn contains build-in options for building a pipeline which takes more than just the feature set X and the label list y?
Thanks @jeromedockes for the answer, I took a look at your link and started to build a custom transformer class that inherits from BaseEstimator and TransformerMixin. Then I realized that I face a second problem. The instance of that custom class itself has to have all the properties that a nibabel.nifti1.Nifti1Image object has (e.g. get_data(), shape, etc.) so that I can pass it over as mask to NiftiMasker.
I started to build a class that inherits not only from BaseEstimator and TransformerMixin but also from Nifti1Image but I couldn’t get it work. In theory these are the needed steps:
Initialize instance of the above described custom nibabel.nifti1.Nifti1Image class. For that take a template niimg-like object as argument in __init__. The custom class can also optionally be initialized with a threshold or otherwise the threshold can be set during nested cross validation by set_params.
Call fit on that instance to set the right threshold (this is only needed in case threshold is provided as percentile string in format such as “80%”)
Call transform on that instance to binarize all voxels according to the given threshold and set all needed attributes of the instance equal to the binarized niimg-object (so the instance is a modified copy of the provided image).
Here’s my first attempt (without inheritance from Nifti1Image):
class NiftiBinarizer(BaseEstimator,TransformerMixin):
def __init__(self, img, threshold=None):
self.img = img
self.threshold = threshold
def fit(self):
# if threshold is provided as percentile calculate corresponding
# percentile rank based on image data
if isinstance(self.threshold,str):
percentile = float(self.threshold.strip('%'))
img_data = self.img.get_data()
rank = np.percentile(img_data,percentile)
setattr(self,'threshold',rank)
def transform(self):
# binarize image using threshold
binary_img = math_img(f'img > {self.threshold}', img=self.img)
return binary_img
your NiftiBinarizer will replace NiftiMasker in the pipeline, not a nifti image, so it doesn’t need to implement the Nifti1Image interface. I think you’re on the right track, but transform needs to receive as an argument the data that needs to be transformed (ie the time series) so it would look like
class NiftiBinarizer(BaseEstimator, TransformerMixin):
def __init__(self, img, threshold=None):
self.img = img
self.threshold = threshold
def fit(self, *args):
# if you want to compute the threshold explicitely use
# nilearn._utils.param_validation.check_threshold(self.threshold)
# otherwise you can call threshold_img directly with the
# str/float/NoneType threshold
# also, to respect scikit-learn API, don't change the `threshold`
# provided by the user, store the computed threshold in `threshold_`
self.mask_img_ = nilearn.image.threshold_img(self.img, self.threshold)
self.masker_ = nilearn.input_data.NiftiMasker(self.mask_img_).fit()
def transform(self, img):
if not hasattr(self, 'masker_'):
raise ValueError('transformer not fitted yet.')
return self.masker_.transform(img)
@jeromedockes thank you very much, you helped me for the second time!
I managed to set up a working example which is slightly different from your suggestion (it uses the function binarize_img to create a mask image:
def binarize_img(img,threshold,ignore_zeros=True):
"""Binarize an image depending on a provided threshold.
Parameters
----------
img: Niimg-like object
threshold: float or str
If float, threshold is interpreted as absolute voxel intensity value.
If provided as string, threshold is calculated based on percentile
score corresponding to the provided percentile.
ignore_zeros: boolean
This parameter is intended to be used in combination with a string
threshold. If True, voxels containing zeros will be ignored when
calculating corresponding percentile rank. This is useful when the
provided image already contains lots of zero voxels (so zeros will not
be taken into account when calculating percentile rank and thus there
is no bias in percentile calculation).
Default: True
Returns
-------
mask_img: Niimg-like object
A binarized version of the provided image.
"""
if isinstance(threshold,str):
percentile = float(threshold.strip('%'))
img_data = img.get_data()
if ignore_zeros == True:
img_data = img_data[np.nonzero(img_data)]
threshold = np.percentile(img_data,percentile)
# FIXME: For reasons of readability replace this with .get_data,
# calculate binary img and then new_img_like?
mask_img = math_img(f'img > {threshold}', img=img)
return mask_img
class NiftiProcessor(BaseEstimator,TransformerMixin):
def __init__(self, tpl_img, threshold=None):
self.tpl_img = tpl_img
self.threshold = threshold
def fit(self, X, y=None):
self.mask_img_ = binarize_img(self.tpl_img, self.threshold)
self.masker_ = NiftiMasker(self.mask_img_).fit()
return self
def transform(self,X,y= None):
if not hasattr(self, 'mask_img_'):
raise ValueError('transformer not fitted yet.')
return self.masker_.transform(X)
I am not sure if the function threshold_img does the right job here, since it doesn’t outputs a binary image object but only sets values below the treshold to zero. I tried your code and NiftiMasker complained that it wasn’t provided with a binary image.
I also wonder about why you set up fit and transform like this?
Thanks again, you really helped me get on the right track. Any comments or suggestions concerning my code are greatly appreciated.
sorry about that, you’re quite right it needs to be binarized. if you don’t want to reimplement the thresholding logic you could still use nilearn._utils.param_validation.check_threshold, otherwise your solution is also a good one.
I’m not sure I understand this question. the idea is that to be compatible with the scikit-learn API, and for example be usable in a Pipeline or GridSearch, the object must do nothing in its __init__ except store the parameters without changing their name. then all the preparation is done in the fit, which can be called as estimator.fit(X, y) (here you don’t care about X and y because only the template and threshold are used). fit also must return self. finally, transform transforms the given image (after checking that the object is fitted to get a more helpful error message if it isn’t).
don’t hesitate if you have more questions or if this is not what you were asking.
in this line I would maybe pass threshold as an argument to math_image, or use str.format explicitly, because f’’ strings don’t exist in python 3.5 AFAIK, which is still widely used.
Thanks for the hint, I looked at check_threshold and in principle it does what I need but I also need an option ignore_zeros because I don’t want the zeros to be included when calculating the percentile rank from the provided percentile string. That’s why I build my own function. I don’t know if this could be useful for somebody else in the future. If yes, one could think about making a pull request in nilearn.
Yes, good idea. I am also not really happy with my solution. It is not much code (which I wanted in the beginning) but I now I think it’s not very good in term of readability.
This is good to know and another reason to change it.
I tried to do that but I don’t know how (see my comment on this post). Maybe I just switch back to functions like .get_data() -> binarize data -> .new_img_like.
ah yes you’re right. I guess new_img_like(img, img.get_data() > 0) would have been my reflex too, but the math_img solution is good, I just would write '...{}'.format(threshold) rather than f'...{threshold}'
I’m not sure I understand this question. the idea is that to be compatible with the scikit-learn API […]
And I think I still struggle a little bit with understanding the setup of a scikit-learn pipeline. When do you explicitly have to put in X and y as arguments to fit and transform? I understand that *args acts like a ‘wildcard’ so it will accept any arguments that you provide to it including X and y. When I look at examples of fit and tranform it seems that everbody writes it as fit(X,y=None) and transform(X,y=None) which made me think that including y is sort of mandatory here.
I also don’t understand how scikit-learn internally passes X and y through the pipeline. Will every new object get the output from the former object or will every pipeline object be provided with the same X and y? The second option doesn’t make sense to me because we want to transform our objects and pass the transformed X to the next pipeline object. But if the first option is true why don’t we also have to return y?
all steps of the pipeline but the last must be transformers; they are used as est.fit(X, y, *fit_params).transform(X), where fit_params can be empty. then the transformed X gets passed on to the next step in the pipeline. y is passed to the fit methods but doesn’t get transformed (and transformers typically don’t use it).
How does the next transformer knows what it received from the transformer before? Will it automatically assume that the output from the former transformer is the transformed X? What if I also wanted to transform y (for example for correction of noise in labels), will it automatically assume that the first returned object of the former transformer is X and the second object is y?