MVPA searchlight based decoding

Hi Martin and all,

I am trying to perform a searchlight-based 8-way classification. I have 8 conditions and overall 512 beta estimates (64 betas for each condition). The estimated time to perform MVPA in the gray matter (~28,000 searchlights) is approximately four months.

RSA performed on the same dataset takes 40-60s depending on the number of searchlights.

Do MVPAs (8-way classification with 512 betas) usually take that long?


I think this definitely sounds unusually long. How large is your searchlight? If it’s really really small it might take a lot longer. Internally, this is running 28 pairwise classification analyses, but it still shouldn’t take much longer than a few minutes.

I think it would help if you could provide more information about the settings you used for the analysis (regarding the searchlight, the classifier, and the expected output). What you could also do is run “profile on” in the command window, then start the analysis, after 5min cancel the analysis, then run “profile off” and finally “profile viewer”. That should tell you where the bottleneck is. Feel free to report back!

One thing that helps if you are dealing with unusually large betas or betas with very different scale is to scale the data. Check out the scaling options in decoding_scale_data. This can often speed things up a lot.

Hope this helps!

Thank you Martin for the prompt reply.

I was using the following settings for my analyses-
Searchlight radius: 3 voxels
Classifier: 8-way classifier (to classify 8 graphemes)
Expected output: accuracy minus chance

Profile viewer showed ‘libsvm_train.m’ and ‘libsvm_test.m’ to be taking up the most time.

The analyses reduced to an 40-90 mins once I increased the searchlight radius to 15 voxels (while keeping all the other settings the same). Nevertheless, we do no intend to use pursue with that large radius.

Is there a principled way to choose the searchlight radius given the number of classifications?


Hi Vinodh,

Apologies for the delay, I was on a vacation and only just returned. Regarding the searchlight radius: There is no principled approach for choosing a radius, it just depends on the size you would find appropriate. 12mm seems quite common (which is 4 voxels for 3x3x3 and 6 voxels for 2x2x2). Please make sure you really choose voxels and not mm!

Glad you found the bottleneck. It seems you a running the analysis on single trials. I would try scaling the data first to see if that improves the speed, using decoding_scale_data. I would also consider not using single trial estimates (64 betas per condition) but really only one beta per condition per run. If you really want to use single trial estimates and if scaling doesn’t help, then I’d encourage you to reduce the classifier cost c to 0.01 or 0.001. This can be done by setting the following:

for classification_kernel (which is an internal trick for speeding up the computation by precomputing the linear kernel)
cfg.decoding.train.classification_kernel.model_parameters = '-s 0 -t 4 -c 0.01 -b 0 -q';
and if you don’t want to use that trick:
cfg.decoding.train.classification.model_parameters = '-s 0 -t 0 -c 0.01 -b 0 -q';

If that is still not satisfying, then you should probably use a different classifier. I would probably recommend using crossnobis (see our template) since it’s comparably fast, generally performs pretty well, and yields nice continuous results rather than binary accuracies.


Thank You Martin!

Choosing a larger searchlight radius and a reduced classifier cost reduced the computation time significantly.


Hi Martin,
I’m also using searchlight analysis on single trials. I am exploring the different parameters in the analysis in order to find the best fit, and I was wondering what do you mean by decoding_scale_data ?

Hi Tamir,

Great you like our toolbox! Scaling is used mostly to make the range of the data more “normal” for the classifier and can speed up things significantly depending on what classifier you use. I’d check out the help file for decoding_scale_data.

Also, this is from our paper on TDT.l and goes into a bit more detail. Have a read, and in case you have other questions, perhaps we have covered it!

Hope this helps,


Scaling is the process of adjusting the range of data which enters the classifier. This can be done to bring data to a range which improves the computational efficiency of the classifier (for example LIBSVM recommends scaling all data to be between 0 and 1). It can, however, also be used to change the relative contribution of individual features or individual samples or to remove the influence of the mean spatial pattern (Misaki et al., 2010; but see Garrido et al., 2013) which might affect classification performance. Scaling is also known as normalization, but we prefer the term scaling to distinguish it from another meaning of the term “normalization” which is commonly used in the MRI community to refer to spatial warping of images.

Typically, row scaling is used, i.e., scaling across samples within a given feature. Although scaling can theoretically improve decoding performance, for some data sets it may not have any influence (Misaki et al., 2010). Practically, scaling often has little or no influence on decoding performance when beta images or z-transformed data are passed, because this data already represents a scaled form of the raw images which is scaled relative to each run, rather than to all training data. However, scaling may still speed-up classification.

TDT allows a number of different settings: Either all data are scaled in advance (in TDT: “all”), which is only valid when scaling carries no information about class membership that influences test data, or scaling is carried out on training data only and these estimated scaling parameters are then applied to the test data (in TDT: “across”). The typically used scaling methods which have also been implemented in TDT are min0-max1 scaling or z-transformation. Min-max scaling scales all data to a range of 0 and 1, while z-transformation transforms data by removing the mean and dividing by the standard deviation. In addition to scaling data to a specified range, cut-off values can be provided for outlier reduction (Seymour et al., 2009). With this setting, all values larger than the upper cut-off are reduced to this limit, and all values smaller than the lower cut-off are set to this value. In TDT, these approaches can be combined with outlier reduction.

Example call:

cfg.scale.method = ’across’;
% scaling estimated on training data and
applied to test data
cfg.scale.estimation = ’z’;
% z-transformation as scaling approach
cfg.scale.cutoff = [-3 3]; % all values
> 3 are set = 3 (here: 3 standard
deviations, because data is