I’m currently training a decoder with SVM on one subject’s fMRI data, using several consecutive TRs as sample input.
My questions are:
Since SVM sees every TR as independent sample, would it be acceptable if I choose to shuffle samples (for example using StratifiedKFolds as cv) to get better classification results?
If shuffling samples is not reasonable, given that TRs have strong temporal correlation, why would people keep using SVM to analyze time-series data?
Oh, I happened to understand why shuffling had given me better results. Because it splits TRs belonging to the same trials into train and test set, making the two sets not independent.
That way, the answer to my first question in my case could be NO.
However, what if I intentionally split TRs within the same trials for cross-validation to improve model accuracy, and the real testset is completely independent of the cv sets? Would this shuffling method be reasonable?
BTW, I find that if I only use 1 TR for each trial, then shuffling will still be acceptable. And many papers actually average TRs within trials, so each sample is not temporal correlated, making it alright for SVM
Regarding 2: Correlation problems are distinct from those related to the choice of a classifier. The point is simply that you cannot measure accuracy on samples that are not independent from your test data. This is a general statistical principle.
Concretely, to use machine learning solutions, you need to have your data organized in several runs that can be considered as independent.
However, what if I intentionally split TRs within the same trials for cross-validation to improve model accuracy, and the real testset is completely independent of the cv sets? Would this shuffling method be reasonable?
I don’t understand what you would gain doing shuffling then ?
BTW, I find that if I only use 1 TR for each trial, then shuffling will still be acceptable. And many papers actually average TRs within trials, so each sample is not temporal correlated, making it alright for SVM
You can train a SVM on correlated data, no dispute about that. What has to be avoided in all cases is to measure accuracy on non-independent samples.
Best,
Bertrand
I was hoping to find papers regarding the cross-validation problem on fMRI.
I have 4 conditions in my experiment. I’m training on 2 of the conditions to decode the other 2 (test-set). It happens that if I use leave-one-group-out (1 or 2 runs as a group, 8 runs in total), the SVM trained will only give around 60% accuracy in cross-validation (in test-set it’s even worse).
However, if I shuffle the samples in cross-validation (which may split samples of the same trials into train and valid-set), it gives me 70-80% accuracy both in cross-validation and test-set (independent of the train and valid-set). This is what I’m facing right now.
PS: I refer to the valid-set as the one being tested during cross-validation, not the one to be finally decoded. Not sure if it’s the right way to call it.
Hi,
You’re referring to a train / validation / test splitting, which is fine.
What is unclear is that you’re trying to generalize both across conditions and across runs.
What is absolutely needed is the validation/test data to be from different runs.
HTH,
Bertrand