Leave One Run Out CV across exemplars

JimT · July 19, 2022, 4:25am

I’m running into difficulties trying to set up cross-validation for a surface-based search_light svm in nilearn. In each fold of a leave-one-run-out cv, I’d like to train on data from one set of data (A1 vs B1) then from the hold out run, test on a different set of data (A2 vs B2). Can anyone provide any advice or an example from nilearn/sklearn that works in search_light?

I’ve tried making my own cv (see below), which works when I run it once, but if I try to call it a second time (eg in a loop or when the search_light moves) I get a ‘list index out of range’ error:

import numpy as np
from sklearn import svm
from sklearn.model_selection import cross_val_score

group = np.array([1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9,10,10,10,10])
tmpX = np.random.sample((40,100))
y = [1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2]
   
class CustomCrossValidation:
    @classmethod
    def split(cls,
              X: np.ndarray = None,
              y: np.ndarray = None,
              groups: np.ndarray = None):
        assert len(X) == len(groups),  (
            "Length of the predictors is not"
            "matching with the groups.")

        for group_idx in range(groups.min(), groups.max()+1):
            if group_idx <= groups.max()/2:
                training_indices = np.where(
                    groups[groups<=groups.max()/2] != group_idx)[0]
                test_indices = np.where(groups == group_idx)[0]  + np.floor_divide(X.shape[0],2)
            else:
                training_indices = np.where(
                    groups[groups>groups.max()/2] != group_idx)[0] - np.floor_divide(X.shape[0],2)
                test_indices = np.where(groups == group_idx)[0]
            if len(test_indices) > 0:
                yield training_indices, test_indices

## this cv gives the correct train/test splits                
for train_index, test_index in CustomCrossValidation.split(tmpX, y, group):
    print("TRAIN:", train_index, "TEST:", test_index)

## example using cross_val_scores fails, as does search_light with same error
cv = CustomCrossValidation.split(tmpX, y, groups)
scores = np.zeros(2)
for i in range(2):
clfa = svm.SVC(kernel='linear', C=1)
    scores = cross_val_score(clfa, a, y, cv=cv)

EDIT/UPDATE

(nb fixed a couple of typos) If I do the following in a loop I don’t get the ‘list index out of range’ error:

cv = CustomCrossValidation()
for i in range(2):
    clfa = svm.SVC(kernel='linear', C=1)
    scores = cross_val_score(clfa, a, y, cv=cv.split(tmpX, y, group), groups=group)
    acc_parcel[i] = scores.mean()

but it is not working with searchlight yet:

scores = search_light(X, y, estimator, adjacency, cv=cv.split(tmpX, y, group), groups=group, n_jobs=1)

ymzayek · July 21, 2022, 2:41pm

Hi @JimT, this link might be helpful as it provides some info on Leave-one-run-out cross-validation for searchlight: Nilearn: Statistical Analysis for NeuroImaging in Python — Machine learning for NeuroImaging

Regarding that information, have you tried using sklearn.model_selection.LeaveOneGroupOut — scikit-learn 1.1.1 documentation from sklearn?

Following this example, you should be able to use the following to set cv

from sklearn.model_selection import LeaveOneGroupOut
cv = LeaveOneGroupOut()

instead of using kfold.

JimT · July 21, 2022, 3:24pm

Hi @ymzayek yes, thanks, both of those links have been super helpful for generally setting up a cv, and while I can use a standard LeaveOneGroupOut with no probs, it is the training on one set of data and testing on a different set of data in the left out group that has proven tricky (training on dataset A from runs 1-4, testing on dataset B from run 5, etc).

I did find this very helpful: inter_subject_pattern_analysis/inter_subject_searchlight_InterTVA.py at master · SylvainTakerkart/inter_subject_pattern_analysis · GitHub
(thanks @SylvainTakerkart), so I think I can just train each fold separately using the cv in the OP, then average. Probably not v efficient, but might just work. If it does I’ll post.

Now it is just a matter of working out how to write the output back to a cifti2 correctly…

JimT · July 21, 2022, 8:04pm

Maybe there is a much easier way of doing this, but if like me you’ve struggled to work this out the following works pretty well using the CustomCrossValidation in the OP and “borrowing” code from @SylvainTakerkart Make sure you structure your data as dataset A followed by dataset B

cv = CustomCrossValidation()
searchlight = []
for split_ind, (train_inds,test_inds) in enumerate(cv.split(tmpX,y,group)):
    single_split = [(train_inds,test_inds)]
    print("...split_ind", split_ind + 1)
    searchlight.append(search_light(X, y, estimator, adjacency, cv=single_split, groups=group, n_jobs=4))

acc = np.array(searchlight) - 0.5 #chance
mean_acc = np.mean(l_acc, axis=0)

ymzayek · July 22, 2022, 7:11am

Great, thanks for the extra information and posting a solution!

SylvainTakerkart · July 25, 2022, 1:58pm

thanks! always happy if my code can help

this is the only solution I’d found when we coded this; the good thing is that it allows you to do really whatever you want!..

to propose a cleaner solution, I believe that the Searchlight object should be reworked, but 1. that would clearly need more work, and 2. I’m not even sure it’s doable without loosing some of the underlying computational efficiency of the Searchlight object…