Hello everyone,
i want to do a cross-validation with 10-folds on regression task ,
so i need to do a thing like stratifiedKFolds in classification .
I want to split my data wich contains 262 subjects into two part (train_index and test_index) , so to have the same distribution(uniform) in train and test (like stratifiedKFolds), I want to do this 100 times , each one contains 10 folds. and take the best one.(that have the best uniform distribution).
X of shape(170000,262)
Y of shape(262,1) ((y in [-3,8]))
Can you help me to do this , please and thank you very much?
Hello,
Stratifying helps having balanced classification problem (with classes proportion in each fold that are in line with the complete dataset) but usually every regression target is equivalent a priori so you can’t stratify it unless you build some custom categories or you have additional information on your subjects.
So you can simply use either Kfold or Shufflesplit that will split your dataset in train and test randomly as many times as you need.
- Use a bootstrap to quantify the 10, 20, … 90% quantiles for the outcome/response
- Use the quantile intervals to create your strata (10 strata).
- Now use this to create stratifiedKfolds
- Repeat for each cross-validation repetition. (ideally 5-fold or 10-fold CV should be repeated ~ 100 times).
Important to do Step 1 as a samples size of around 250 is probably not estimate perfectly accurate quantiles.
Thank you very much ,
how can i do this in python please, im a beginner with python please ?
and thank you very very much
@mnarayan
Use scikit-learn.model_selection module (just take a look at their doc, it’s easy to use).
Best,
Bertrand