TDT: Combine unbalanced data approaches with crossnobis?

Hi Martin and all,

I want to classify using crossnobis and leave-one-run-out, but have unbalanced data. I’d like to deal with that ‘properly, so I’ve looked at your unbalanced_data_template, and it suggests to use either (1) AUC, (2) repeated subsampling, or (3) ensemble balance. So I’ve tried to combine crossnobis with either of these approaches, but failed so far.

Would it be possible to make either of these to work with crossnobis, and if so; how? Could I e.g. use something like combine_designs(cfg, cfg2)? Because for e.g., the combination of crossnobis with repeated subsampling I can understand it doesn’t work, because they both use another call for the design: make_design_similarity_cv and make_design_boot_cv, respectively.

Or is it really impossible to combine any of these approaches with crossnobis? And if so, should I then switch to e.g. liblinear and repeated subsampling, or e.g., compare the results when using crossnobis versus liblinear + repeated subsampling as a robustness check? But what if they’d give rather different results?

Furthermore, I would be very grateful if I could receive a little bit more information on how each method deals with unbalanced data, because I don’t understand the descriptions given in the template very well?

E.g., for AUC, what does the (area under the) curve represent, exactly? How does it not depend on a classifier’s bias? And how are repeated subsampling and balance ensemble different from each other, exactly? They both use subsampling across many classification iterations and do a classification each time, but for repeated subsampling the accuracy across iterations is then simply averaged, or …? And what is meant with “use combined decision values to create a majority vote” in the balance ensemble approach?

And do I understand correctly that using the ensemble balance approach would be considered ‘best practice’ and AUC the ‘least good practice’ to deal with unbalanced data – apart from not taking it into account at all? Or is it not that simple?

Lastly; I haven’t been able to run the balance ensemble approach at all (using libsvm, as set in the template). I got this error:
Unable to perform assignment because dot indexing is not supported for variables of this type.
Error in decoding_template_UnbalData_ensemble_balance (line 221) = ‘libsvm’;

It seems specifically the last part (.software) is not recognized as valid input. The same holds for cfg.decoding.train.classification_kernel.model_parameters.model_parameters.How can I fix this? What should I change so that it accepts the second .software and .model_parameters?

Thanks in advance for answering my (again) many questions!