How to create ML labels from behavioral data?

Dear Friends of Research,

I am working my way through the (excellent) workshop by Michael Notter and Peer Heerholz ( Now I want to apply the machine learning tutorial (workshop_pybrain/05b_machine_learning_nilearn.ipynb at master · miykael/workshop_pybrain · GitHub) to a dataset of my own, but am failing at creating the labels.

In Michael Notter’s example, the eyes are each opened for 4 volumes and closed for 4 volumes. A Numpy array is created that looks like this: array(['closed', 'closed', 'closed', 'closed', 'open', 'open', 'open', 'closed', 'closed', 'closed', 'closed', ...])

My experiment on the other hand consists of 4 runs with 211 volumes each and a TR of 2. Each trial lasts a total of 2.5 seconds (2.25 seconds stimulus presentation, 0.25 seconds ITI). Per run 167 trials were completed. There are 6 different conditions that were presented in random order.

Based on the tutorial, it seems to me that the labels must be assigned specifically for each volume. Now would the assumption be correct that ~ 1.25 volumes need to be assigned to each trial? If so, does the overlap between the trials and the volumes pose a problem? Is my experiment even analyzable in this form?

Thanks for the help and if any information is missing, I’ll be happy to provide it

Hi @nira,

welcome to neurostars and thank you for your post, it’s great to have you here!
I’m very sorry for the late reply, but very happy to hear that you found our tutorials helpful!

That being said, the ML tutorial might actually not be the best concerning common ML workflows applied to neuroimaging data. Especially, with regard to how labels are generated. The tutorial dataset we’re using was picked because it’s rather small and allows to explore different aspects. However, it’s rather uncommon in that volumes and labels are precisely matched, thus enabling to assign one label per volume. You’re right that this won’t work for your (and most other) designs/paradigms as conditions/trials will most certainly be distributed across several volumes. What folks in these (and other) cases do is to obtain beta images per/across trials/runs via a GLM and submit those to subsequent ML analyses. For example, if your experiment consisted of 8 runs within which 4 different auditory categories were presented and you want to evaluate if the voxel pattern of certain ROIs carry information regarding these conditions, you could submit your preprocessed data to a GLM to compute run-wise beta images for each condition (4 beta images per run, 32 in total) and then train a classifier to differentiate them (e.g. based on voxels within the auditory cortex), cross-validating based on runs (e.g. leave-n-out). Nilearn has a lot of great tutorials on this. Since the nistats merger, nilearn also supports GLMs and thus got rid of the necessity to obtain beta images via other software packages.

If you have further questions, please don’t hesitate to ask!

HTH, cheers, Peer

Hi @PeerHerholz!

It is almost a year later and I found your post really helpful. I’m new to MVPA and trying to decode without perfect labels (as most tutorials have) and I am getting tripped up.

Would you have a suggestion for how to do this with limited runs? Our task has two runs and so cross-validating based on runs doesn’t make sense, although this is the ideal approach (as would having more than 2 runs of data).

Would a between-subjects approach seem appropriate? To cross-validate across subjects? The problem here is that there would be repeated measures (multiple images for the same subject) in the dataset… perhaps I could run ML analyses on two datasets that don’t have more than one image per subject? Dataset 1 would have an image for condition 1 for subject X and an image for condition 2 for subject Y. Dataset 2 would have an image for condition 2 for subject X and an image for condition 1 for subject Y.

Hopefully that makes sense! If I am completing misunderstanding how this works, please let me know!

Thanks :slight_smile: