Run searchlight analysis on within-subjects data

I have 18 .nii.gz files (after fmriprep).
Each file is from a different subject.
I have 2 conditions, each subject belongs to a different condition (9 subjects are False and 9 subjects are True)
I want to run searchlight analysis to classify these files. (meaning have a classifier that predict subject’s condition based on its data).
Is this possible? In all the examples I saw, the searchlight is always within subject.

Hi @orko, if I understand correctly, I think your question may be related to this other post: Haxby searchlight example for multiple subjects?

If not, can you give more detail about your data and what you want to do?

@ymzayek Thanks, I’m not sure that this is related to my problem.
To my understanding, in the Haxby dataset, each subject can have various labels per session - house, face, etc.
In my case, the classification is per-subject.
Meaning, subject #1 only saw faces and subject #2 only saw houses. So I have two types of subjects - house or faces.
I want to run searchlight analysis of a classifier that predict type of subject, given the data.

Ok I see. I am not sure this is implemented, but perhaps there is a more appropriate approach for your analysis in nilearn. I will let someone else chime in. Perhaps @bthirion can better answer this?

Formally, the problem of inter-subject searchlight analysis is the same as the intra-subject analysis: you have a bunch of fmri files associated with labels, and you want to measure the statistical associations between images and labels. So you can essentially use the same code.
This being , there is an outsanding problem: performing such a statistical analysis on n=18 images will not yield any meaningful result, especially because there is a multiple comparison issue (across all searchlights).

@bthirion Thanks for the response!

  1. Actually in all the examples I saw, the Searchlight object gets 1-subject data and fit the classifier on it. So not sure how to use existing codes when only a single label per subject. Do you maybe have an example?
  2. Can you please elaborate on why the results will be meaningless?
  1. Assuming that you have performed standard preprocessing of your data, they are samples in the same space, so you can formally act as if they were data from one subject. Tu put it differently, you can concatenate individual images into a 4D-image using e.g. Nilearn: Statistical Analysis for NeuroImaging in Python — Machine learning for NeuroImaging, and get the corresponding vector of labels, and give it to the searchlight function. I can try and craft an example, but this won’t be much different from what I described here.

  2. the confidence intervals around the searchlight statistics you obtain will be huge. indeed, these confidence intervals decrease as 1/sqrt(n_samples). You can e.g. take a look at
    Cross-validation failure: Small sample sizes lead to large error bars - ScienceDirect


@bthirion Thanks!
So just to make sure I understand:
You suggest that I will take each of the 18 4D-images (one per subjects) and concat the into a 18 items list of 4D data.
This list, together with 18 items labels vector, should be fed into the searchlight. Thus, each subjects data (~6 minutes) will be “treated” by the searchlight model as a single TR normally does?

However, this will make the data much smaller - instead of the usual hundreds of samples, I will actually have only 18 samples. Therefore, for each sphere, the error bars will be roughly 1/sqrt(18) = 0.23, making the result meaningless ?

I thought that you had one volume per subject, but it looks like it is not the case. What are the 6 minutes of data you have per subject ? Are these time courses ?
It is unclear to me what per-subject information you actually want to use in the searchlight analysis.

@bthirion I have 250 TRs per subject.
I want to run searchlight to discover which voxel are relevant to classification of subject’s condition. It is possible to mean the data of a subject to 1XnVoxels to get “One volume” per subject, if it help - that the data from each subject is a vector of mean activity per voxel.

I don’t think that you can discover much if you don’t have more structure in your data. Do the 250 scans correspond to a synchronized stimulation condition ?
The average activity across time is usually not considered a meaningful feature. What matters are synchronized variations in brain activity.

Yes, the 250 TRs are all the same song that the subjects are hearing. I understand that this is sub-optimal but we were asked to run this searchlight.
Is there a way to do so intra-subject?

Hi @orko,

could you maybe outline the task/paradigm a bit further?
Depending on if participants performed a passive listening paradigm (“just” hearing the song) or conducted a certain task (e.g. “press a button whenever you hear a guitar”, etc.), different modelling/analyses approaches might be feasible.
The same holds true for the spatial and temporal alignment of participants:

  • did you perform a spatial transformation into template space
  • did participants hear the same part of the song at the “same time” (in terms of TR and stimulus synchronization)

Cheers, Peer

P.S.: as outlined by @bthirion: independent of your answers to the above questions, whatever results you obtain won’t be meaningful or even statically valid (worst case), as the low n will prevent obtaining any reliable outcome measure and furthermore won’t allow any assessment of generalization. Isn’t there any room for discussion re the analysis of this dataset? For example, going a bit more into the direction of connectivity analyses?

@PeerHerholz Thank you!
regarding ur questions:

  1. It was a passive paradigm, no task was given

  2. Yes, I transformed them to MNI152

  3. Yes, it was synchronised.

  4. There might be some room for discussion but I want to make sure I understand the problem first in order to communicate it correctly forward - I will basically have 1 label per subject, resulting in 18 samples, meaning very low predictive power, and very prone to false positive considering the high number of comparisons in searchlight - right?

  5. Would love to hear any idea u may have re connectivity or any type of analysis. Overall, the goal is to check if a certain area (as defined using a specific existing mask we have) is indicative to condition (group of subjects)

Hi @orko,

cool, thx so much for the information!

Gotcha! So this makes “classic analysis approaches” a bit difficult as e.g. for a GLM you would need somewhat distinct regressors/events (e.g. music onset, tempo change, etc.). Furthermore, depending on the regressors/events and GLM approach you would most likely get a very limited amount of estimated responses, e.g. 1 beta/z map per regressor/event. Other analysis approaches might be more feasible/suitable, please see the response to 5.

Alrighty. In this case and without any further steps you would assume “anatomical feature correspondence” regarding the searchlight, cross-validation, etc., i.e. that a given voxel/signal location in one participant as identical or comparable in another participant. However, while often “ok”, spatial transformation into a reference/template space is of course never 100% perfect and additionally, there’s a prominent inter-participant variability regarding voxel/signal and thus feature location. Thus, it’s always a bit “sub-optimal”. Other approaches, e.g. functional alignment could be interesting to look into here but please see the response to 5.

This might allow for other analysis approaches, for example also spatio-temporal searchlights but please see 5. re this.

Yeah, it’s a rather holistic problem: few participants and few data per participants. If your goal is to evaluate if the response to music can “predict” certain participant groups, then having an appropriate amount of data (including SNR, etc.) and participants (inter vs. intra-variability) is crucial re obtaining reliable estimates, certainty, generalization, etc. . As mentioned by @bthirion, the confidence intervals you would most likely obtain will be huge and together with the other outlined problems render the performance and interpretation of your model and results drastically limited at best. Are there potentially other, comparable datasets out there you could maybe utilize for training a model and then test it on your data? Then again, the question would if this is somewhat meaningful/feasible/suitable.

If you already have an ROI that you want to evaluate, a searchlight analysis might not be the approach you want to take, as it tries to obtain insights on “where” information re a given classification task (and thus potentially cognitive process) is located. It’s thus rather “explorative” in terms of spatial things. However, if your ROI is rather “big”, ie. the temporal lobe, you could utilize it as a searchlight mask, ie. an ROI within which the searchlight is run to evaluate “where” in the ROI certain information is entailed. Otherwise, you might just want to use your ROI in rather “classic” decoding approach, ie. extract signal/patterns (whatever these means atm) and evaluate if this entails information re your question, that is if it can distinguish groups. Then again, the same problems as for the searchlight re amount of data are present.

As mentioned before, there might be other analyses you could look into, for example connectivity, encoding and/or functional alignment/SRM.
Re the first, you employ a functional connectivity analysis, computing the correlation of ROI time series and then compare that between groups via a non-parametric test, maybe with permutations, to address the amount and distribution of the data.
Re the second, you could extract stimulus features of the song, ie. chroma, pitch, timbre, valence, etc. and utilize that in a regression analysis, predicting brain responses, either for certain ROI or voxels. The resulting maps, or performance of the model could then be compared between groups.
Re the third, you could treat your data as originating from a (somewhat) naturalistic paradigm and go into the direction of functional alignment/shared response modelling/intersubject correlation/etc. .

HTH, cheers, Peer

@PeerHerholz Thank you very much for this very detailed answer!
My ROI is pretty big (the reward system) and using “classic” decoding approach we did found it to be significant. However, we were advised to run an whole-brain searchlight analysis to ensure these results are unique to this area and are not a fluke. Do you have any other suggestion as to how to do it?

Re ur suggestions:
A) Do you have maybe an example tutorial to the connectivity analysis u propose?
B) We did tried encoding model - but results were inconclusive
C) I did saw the option if ISC and we will probably try it. It should be done within my ROI and then I should asses statistical significance cane by comparing to null distribution? I read about the option of functional alignment and SRM but to my understanding it is more about to reduce data dimensions than to reach a certain conclusion about it? Maybe I missed something?


Ok, thanks for the information! It’s not my research field, thus please excuse the questions but does this “reward system” ROI entail one big cohesive ROI or multiple ROIs at different locations that are not connected? Also: would you mind sharing how this ROI was derived (sorry if I missed this)?
Instead of running a whole-brain searchlight you could also utilize a whole-brain atlas based on functional markers/aspects, for example DiFuMo, and run one ROI decoding analysis for each. In its"highest form" DiFuMo contains 1024 ROIs (IIRC) which would result in 1024 decoding analyses which is notably less than e.g. a whole-brain searchlight for 70-100k voxels concerning both computational resources and number of comparisons you have to correct for. Additionally, through its characteristics of being a whole-brain atlas/parcellation and high number of ROIs you won’t lose a lot of insights re spatial information, as well as already have some labeling and ease up statistical inference. However, for all the things mentioned above, the same shortcomings as for the searchlight apply, ie. small number of participants, etc. .

There are multiple options to do this I think. Most commonly it might be beta series correlations or DCM. Re the first, you could have a look at Nibetaseries and a respective tutorial (nb: I’m biased here because it worked on this package, sorry). Re the latter, this resource might be helpful. Please note that both traditionally assume a different type of experimental design/paradigm and thus some modifications might be required.


Re ISC and comparable approaches, I would suggest checking out Brainiak, e.g. this tutorial.
I think I wouldn’t agree with these approaches being mainly used for data dimensionality reduction than analyses. In fact, the opposite might be the case as you might drastically increase the dimensions of shared information and actually enable certain analysis approaches that won’t be possible otherwise, as well as enhance existing ones, However, all of that of course heavily depends on the data one has and the hypotheses/ideas one wants to test. That being said: do you have additional data from your participants other than the 6 min songs? I’m asking because that might not be enough data to implement the above-mentioned approaches in a feasible/suitable manner.

Cheers, Peer

@PeerHerholz Thank you very much for the elaborated response!

  1. OK I think I will just skip all the searchlight issue
  2. Is it common to run intra-subject beta series analysis?
  3. No other data. 6 minutes should be enough data for ISC? Is there any reference for it, so I will be able to justify why I don’t run it?