In my experiment I presented an ambiguous auditory stimulus (a syllable intermediate between da and ga) to one ear and a disambiguating acoustic cue (high or low third formant frequency) to the other ear. In ~70% of the trials, participants gave a response that is consistent with the presented cue. They integrate the cue.

Now, I would like to compare neural representations that are driven by the stimulus acoustics (the presented cue) from representations that are driven by the participants response (ga or da).

I built two cross-validation models, in one I train a classifier to decode the presented acoustic cue, in the other the responses given by the participants. As a dependent variable, I used area under the curve minus chance (as suggested for 2nd level stats in spm). The most simple approach I can imagine would be to simply normalize and smooth the AUC minus chance maps and to do a second level paired sample ttest.

I think there are some problems related to this as described in your article. Moreover, I would ignore whether the derived patterns are significant at the firstlevel.

Hebart, Martin N., und Chris I. Baker. „Deconstructing Multivariate Decoding for the Study of Brain Function“. NeuroImage, New advances in encoding and decoding of brain signals, 180 (15. Oktober 2018): 4–18. https://doi.org/10.1016/j.neuroimage.2017.08.005.

For a binary classification problem with accuracy as dependent variable you would probably recommend firstlevel permutations and second level prevalence statistics as implemented in TDT.
I was wondering what analysis procedure you would recommend in the described case.

The significance at the first level is not so much of a problem when you assume that the errors at the subject level can be ignored (which is what a random-effects test at the second level does). As you pointed out, there is a problem with running a t-test at the second level. There is currently no good solution to this problem. I can think of two ways of dealing with this: Approach 1: Run prevalence inference at the second level and can make claims about the population.
Apart from this taking a while to run with searchlight analysis, current prevalence tests have low statistical power. Currently, at least 50% of participants should be showing an effect. The problem is that not showing an effect can have two causes: The participant doesn’t have an effect (which is what the test assumes) or the data are too noisy to reveal an effect. You could use a manipulation check (e.g. independent decoding analysis) where you would have to find an effect in every participant (e.g. ring finger vs. middle finger decoding). Then, say, you find it to be significant in 80%. Then you could assume that not 50% carry an effect but only 40% and use that as an estimate. However, this is statistically speaking not quite correct, as we are chaining several probabilities, some of which are unknown (this is not assuming false positive / false negative effects). And this would be region-specific and wouldn’t apply easily to searchlight analysis. So, I’d rather stick to the default of 50% for now. Approach 2: You just run the second-level t-test and make clear that you are not making claims about the population
This is fine as long as people know that you cannot generalize to the population but that it ends up being a fixed-effects test (with reference to e.g. Allefeld et al. and perhaps our paper that you cited).

Many thanks for your comprehensive response. I have a follow-up question on approach 1 that you suggested. As far as I understood Allefeld et al. (https://www.sciencedirect.com/science/article/pii/S1053811916303470#bb0210) prevalence statistics is better way to address questions usually addressed with second-level one sample ttest against chance. Is there also something similar to a paired sample ttest or a ANOVA. That you could say above chance classification in area x is significantly more prevalent for the participants responses (e.g. 80%) than for the stimulus acoustics (e.g. 70%). Or would just run second level prevalence statistics for each classification (decoding of the stimulus acoustics and response) separately, and then compare the results at a given prevalence threshold e.g. 70%, to compare the differences across the different classifications?

Ah, yes, I forgot about this, and thanks for bringing this up. A couple of years ago, I came up with this same thought and wanted to discuss it with Carsten but then forgot.
Indeed, when you are not testing against chance, in theory this would mean you should be able to do random effects (RFX) testing again, since true negative accuracy differences are again possible. I just don’t know whether this works in practice. The idea behind RFX testing is that the variability within individual participants can be ignored. Carsten made a good point that it cannot be ignored, but the degree to which intraindividual variability dominates interindividual variability under these cases is unclear. I’d say it’s ok to go for an RFX test in this case when you think that the assumptions of the test can hold. Not sure how you would go about and test it. Perhaps Carsten simulated it.

Hi Martin,
I have a follow up question on approach 2. Could this approach be improved by using SnPM (Statistical NonParametric Mapping) toolbox with permutation testing at the second level?
Many thanks for your advice!
Basil

Approach 2 is definitely not bad, specifically when you work with unsmoothed data or ROI data. The background is that without smoothing or in ROIs you end up with non-normally distributed data. With smoothing, there shouldn’t be much of a difference, but it can never hurt to go non-parametric. However, this approach does not resolve the issues of approach 1. At the same time, I have seen a lot of reviewers ask for permutation tests at the group level, possibly assuming that they would resolve issues that came about at the subject level (which they don’t), possibly for other reasons I’m not aware of. Even though I do not see a huge statistical benefit (but again, it should only be better with SnPM, not worse), it may help convince reviewers that your statistics are in fact ok, without necessarily resolving issues with the tested null hypothesis.

I have a follow up question. As discussed below in the thread, the RFX is currently not implemented in prevalence statistics (Approach 1). I am currently pursuing Approach 2. I would like to test whether a certain area is more strongly contributing to the representation of the acoustic stimulus or percetion (i.e., the response of the participant). I was wondering, if it would be legal to contrast the AUC maps derived from acoustic stimulus and response decoding?

I think whether it’s allowed or not depends on whether you believe the assumptions of your test hold. Are the differences in AUC maps normally distributed around 0 and can you ignore the subject level effect? If yes, then it should be fine.

We constrained our searchlight analysis to voxels that show a response to our auditory stimuli (based on a grouplevel mask sound>baseline, p<.001 uncorrected).

If I observed AUC maps that are not normally distributed around zero, but skewed towards positive values, does this suggest overfitting, or could that simply be explained by using a subsample of voxels?

The other question I have: Do I ignore the subject level effect, if I compare the AUC maps with a paired ttest in SnPM?

I think it’s possible given your constraint that things are above zero, but I don’t know your data, so it’s hard to tell if you introduced some circularity in your analysis. Perhaps try your selection on a region outside of the brain or a region where you would definitely not expect to find anything (e.g. a white matter ROI) and see if you still find your effects the same way.

Yes, you ignore them. It’s just like a non-parametric t-test, which is a random-effects test.

Indeed, I find that some participants show a positively skewed AUC minus chance distribution for a ROI of no-interest in the white matter.

My dataset includes 4 fMRI runs each including 60trials (30 of each stimulus). For the decoding of the participants’ responses, I balanced the number of sets for each response, subsampling the more frequent response option. So far, I used a leave 4 out cross-validation design and the AUC maps were quite convincing. Showing for example good decoding of the response in the hand area of the motor cortex.

However, according to Varoquaux et al. (see reference below) my decoding design is not recommended. They advocate for leaving out entire runs, or random splits. I remember that I used 10 fold crossvalidation in the past with random splits, testing on 10% of the data. I remember this analysis was not sensitive enough. I think I have an optimization problem here. What strategy would you suggest to optimize my decoding parameters?

Many thanks for your advice.

Cheers,

Basil

Varoquaux, Gaël, Pradeep Reddy Raamana, Denis A. Engemann, Andrés Hoyos-Idrobo, Yannick Schwartz, and Bertrand Thirion. “Assessing and Tuning Brain Decoders: Cross-Validation, Caveats, and Guidelines.” NeuroImage, Individual Subject Prediction, 145 (January 15, 2017): 166–79. https://doi.org/10.1016/j.neuroimage.2016.10.038.

Not sure what you mean by balancing the number of sets for each response, but if you mix runs in your decoding analysis, this will likely mess things up, and if you use subsampling, this can also mess things up. This is the reason why we usually throw an error in TDT when you try mixing runs. It all depends on whether the patterns that go into your analysis can be assumed to be independent, and if they aren’t then whether you can make sure that you stratify classification, i.e. make sure that any imbalances in the design cannot explain your results. For example, the effect of run is massive. If you have a different number of trials in one condition coming from one run, you are almost guaranteed to get skewed results.

My recommendation is: if you aren’t entirely sure that you have non-independence or trial-wise / run-wise confounds under control, then do leave-one-run out. I think 4 runs is ok for that. You can keep your conditions unbalanced in training if you use AUC. You can even model each condition as one regressor. The big issue is that the resulting AUC values will give you quite large steps, rather than being continuous. You could still do trialwise prediction using AUC to overcome this. An alternative suggestion would be to use crossnobis instead, which offers a continuous measure of discriminability and can be used nicely even with runwise betas.