Hyperalignment training and test format with fMRIprep outputs

liaok · December 11, 2018, 8:36am

In context of using fMRIprep outputs to run searchlight hyperalignment from PyMVPA, what should the set up be for the training and testing datasets?

To my understanding, the Hyperalignment tutorial dataset contains an array of subjects where each individual’s tasks/runs (target/chunk attributes) are all within the same Dataset object. We then split the data into training by leaving one of the target/chunk runs out for testing, more specifically referring to these lines:
ds_train = [sd[sd.sa.chunks != test_run, :] for sd in ds_all]
ds_test = [sd[sd.sa.chunks == test_run, :] for sd in ds_all]

However, for fMRIprep outputs, the data (fMRIprep output) from one subject is split up into different Dataset objects or different runs. In other words, there is a different Dataset for subject 1 with “rest” task and subject 1 with “stop signal” task. As a result, we can run hyperalignment training with all separate runs of subject and task combinations, but what would be the appropriate test set to apply the hyperalignment mappers? Is there a way to combine different runs of the same subject into one, keeping in mind that the dimensions would be different?

Would very much appreciate any insights!

feilong · December 12, 2018, 4:38pm

Typically, the more training data, the better hyperalignment works. Thus, in most scenarios, I would use all data available, combine them with vstack, train hyperalignment with the combined dataset and obtain a transformation matrix per subject, and use the transformation matrix to transform (i.e., hyperalign) the data.

However, this heavily depends on the design of your analysis, especially the cross-validation scheme. For example, for the analyses we performed in the hyperalignment papers (Haxby et al., 2011; Guntupalli et al., 2016, 2018), we need to ensure the increases in inter-subject correlation (ISC) and between-subject multivariate pattern classification (bsMVPC) accuracy are generalizable and not just overfitting. Because the algorithm guarantees an increase in ISC in training data, we split the data into training and testing runs, and showed that the transformation obtained from one half of the movie worked well on the other half of the movie. Similarly, in a recent study combining hyperalignment and encoding models (Van Uden et al., 2018), hyperalignment was also trained with only the training runs, and then applied to the testing runs to evaluate generalizability. In other words, if you want to generalize your encoding/decoding model to independent runs, hyperalignment should also be trained with only the training runs (i.e., without touching the hold-out data that would be later used for evaluation).

liaok · December 12, 2018, 7:08pm

Hi @feilong,

Thank you for the valuable details regarding the analysis, will certainly keep that in mind!

Regarding vstack, I have tried that but encountered the ValueError due to the different dimensions for each task run after being masked. That is because with fMRIprep outputs, we have different masks for separate tasks, i.e.:

We have a mask for “rest” (sub-10171_task-rest_space-MNI152NLin2009cAsym_desc-brain_mask.nii.gz)
and another mask for “bart” task (sub-10171_task-bart_space-MNI152NLin2009cAsym_desc-brain_mask.nii.gz)

By masking the respective volumetric data (MNI2009cAsym) with their mask as shown above, we obtain different dimensions of masked fMRI data. How may I address this issue? A possible solution could be to use the same mask file for all tasks under the same subject, but this is probably suboptimal?

feilong · December 13, 2018, 7:43pm

If you are using mvpa2.datasets.mri.fmri_dataset, there is an add_fa parameter that allows you to load additional feature attributes, which can later be used as masks as well. You can use it to add all these masks to your Dataset.

If all your data have been anatomically aligned to the MNI152NLin2009cAsym template, using the brain mask from the template is probably a good choice. However, if the mismatch of masks is caused by lack of coverage in some of the functional runs, I think it’s better to use the joint of these masks (i.e., voxels covered by all tasks and runs).