Stratification using Imbalanced-Learn?

JohannesWiesner · June 26, 2019, 3:14pm

I have a dataset with imbalanced sample sizes of my two classes. Moreover, each subject belongs to a different site. I know I can use the imblearn module to resample my whole sample so that I got equal sample sizes for both classes. But is it also possible to apply stratified sampling using imblearn, i.e. ensuring that each site contains more or less an equal size of subjects with the minority and majority labels?

glemaitre · June 28, 2019, 8:48am

Provided resampler in imblearn will just use the target (i.e. the classes) to resample. No option such as a GroupResampling is existing. The easiest way to achieve what you want is to use the imblearn.FunctionSampler where you can pass a function which will implement your desired group resampling using additional information. You can refer to the following example to see how it works. You can use this sampler in a pipeline as well.

JohannesWiesner · June 28, 2019, 10:10am

@glemaitre I wrote a class that can handle this stratified random undersampling. It is provided with a dataframe and the names for a one dimensional feature variable (in my case a vector with nifti paths), a class label variable and the name for the grouping variable.This class is not very dynamic but it works. Maybe I will implement a future Version where the class inherits from FunctionSampler so that it can be used in a pipeline as well. Also it shouldn’t be necessary to provide a dataframe. It would be better to provide a feature variable, a label variable and a grouping variable. The resampled indices can still be used for resampling the data frame where those variables come from.

from imblearn.under_sampling import RandomUnderSampler

class StratifiedRandomUnderSampler():
    
    def __init__(self,df,X_name,y_name,groupvar_name,sampling_strategy='auto',
                 return_indices=False,random_state=None,replacement=False,ratio=None):
        
        self.df = df
        self.X_name = X_name
        self.y_name = y_name
        self.groupvar_name = groupvar_name
        self.sampling_strategy = sampling_strategy
        self.return_indices = return_indices
        self.random_state = random_state
        self.replacement = replacement
        self.ratio = ratio
        
    def resample(self):
        self.df_res_ = self.df.groupby(self.groupvar_name).apply(self._random_undersample,
                                  X_name=self.X_name,
                                  y_name=self.y_name
                                  )
        
        self.df_res_.set_index(self.df_res_.index.get_level_values(1),inplace=True)
        
        self.X_ = self.df_res_[self.X_name]
        self.y_ = self.df_res_[self.y_name]
        
        return self.X_, self.y_, self.df_res_
        
    def _random_undersample(self,group_df,X_name,y_name):
        # resampling is only possible when there is more than one class label
        # NOTE: this means that groups which contain samples with only one class
        # label will be dropped.
        if group_df[y_name].nunique() > 1:
            
            # convert feature column into (n_samples,1) dimensional numpy array
            # NOTE: this reshaping is required by imblearn when dealing with one
            # dimensional feature matrices.
            X = group_df[X_name].values.reshape(-1,1)
            
            rus = RandomUnderSampler(sampling_strategy=self.sampling_strategy,
                                     return_indices=self.return_indices,
                                     random_state=self.random_state,
                                     replacement=self.replacement,
                                     ratio=self.ratio)
    
            rus.fit_resample(X,group_df[y_name])
            
            indices = rus.sample_indices_
            
            return group_df.iloc[indices]