I have a dataset with imbalanced sample sizes of my two classes. Moreover, each subject belongs to a different site. I know I can use the imblearn
module to resample my whole sample so that I got equal sample sizes for both classes. But is it also possible to apply stratified sampling using imblearn
, i.e. ensuring that each site contains more or less an equal size of subjects with the minority and majority labels?
Provided resampler in imblearn
will just use the target (i.e. the classes) to resample. No option such as a GroupResampling is existing. The easiest way to achieve what you want is to use the imblearn.FunctionSampler
where you can pass a function which will implement your desired group resampling using additional information. You can refer to the following example to see how it works. You can use this sampler in a pipeline as well.
@glemaitre I wrote a class that can handle this stratified random undersampling. It is provided with a dataframe and the names for a one dimensional feature variable (in my case a vector with nifti paths), a class label variable and the name for the grouping variable.This class is not very dynamic but it works. Maybe I will implement a future Version where the class inherits from FunctionSampler so that it can be used in a pipeline as well. Also it shouldn’t be necessary to provide a dataframe. It would be better to provide a feature variable, a label variable and a grouping variable. The resampled indices can still be used for resampling the data frame where those variables come from.
from imblearn.under_sampling import RandomUnderSampler
class StratifiedRandomUnderSampler():
def __init__(self,df,X_name,y_name,groupvar_name,sampling_strategy='auto',
return_indices=False,random_state=None,replacement=False,ratio=None):
self.df = df
self.X_name = X_name
self.y_name = y_name
self.groupvar_name = groupvar_name
self.sampling_strategy = sampling_strategy
self.return_indices = return_indices
self.random_state = random_state
self.replacement = replacement
self.ratio = ratio
def resample(self):
self.df_res_ = self.df.groupby(self.groupvar_name).apply(self._random_undersample,
X_name=self.X_name,
y_name=self.y_name
)
self.df_res_.set_index(self.df_res_.index.get_level_values(1),inplace=True)
self.X_ = self.df_res_[self.X_name]
self.y_ = self.df_res_[self.y_name]
return self.X_, self.y_, self.df_res_
def _random_undersample(self,group_df,X_name,y_name):
# resampling is only possible when there is more than one class label
# NOTE: this means that groups which contain samples with only one class
# label will be dropped.
if group_df[y_name].nunique() > 1:
# convert feature column into (n_samples,1) dimensional numpy array
# NOTE: this reshaping is required by imblearn when dealing with one
# dimensional feature matrices.
X = group_df[X_name].values.reshape(-1,1)
rus = RandomUnderSampler(sampling_strategy=self.sampling_strategy,
return_indices=self.return_indices,
random_state=self.random_state,
replacement=self.replacement,
ratio=self.ratio)
rus.fit_resample(X,group_df[y_name])
indices = rus.sample_indices_
return group_df.iloc[indices]