Comparing neural representations derived from different cross-validation models

Dear Martin,

In my experiment I presented an ambiguous auditory stimulus (a syllable intermediate between da and ga) to one ear and a disambiguating acoustic cue (high or low third formant frequency) to the other ear. In ~70% of the trials, participants gave a response that is consistent with the presented cue. They integrate the cue.

Now, I would like to compare neural representations that are driven by the stimulus acoustics (the presented cue) from representations that are driven by the participants response (ga or da).

I built two cross-validation models, in one I train a classifier to decode the presented acoustic cue, in the other the responses given by the participants. As a dependent variable, I used area under the curve minus chance (as suggested for 2nd level stats in spm). The most simple approach I can imagine would be to simply normalize and smooth the AUC minus chance maps and to do a second level paired sample ttest.

I think there are some problems related to this as described in your article. Moreover, I would ignore whether the derived patterns are significant at the firstlevel.

Hebart, Martin N., und Chris I. Baker. „Deconstructing Multivariate Decoding for the Study of Brain Function“. NeuroImage, New advances in encoding and decoding of brain signals, 180 (15. Oktober 2018): 4–18.

For a binary classification problem with accuracy as dependent variable you would probably recommend firstlevel permutations and second level prevalence statistics as implemented in TDT.
I was wondering what analysis procedure you would recommend in the described case.

Many thanks for your advice.

Kind regards,


Dear @Martin please apologies my persistence.

Hi Basil,

The significance at the first level is not so much of a problem when you assume that the errors at the subject level can be ignored (which is what a random-effects test at the second level does). As you pointed out, there is a problem with running a t-test at the second level. There is currently no good solution to this problem. I can think of two ways of dealing with this:
Approach 1: Run prevalence inference at the second level and can make claims about the population.
Apart from this taking a while to run with searchlight analysis, current prevalence tests have low statistical power. Currently, at least 50% of participants should be showing an effect. The problem is that not showing an effect can have two causes: The participant doesn’t have an effect (which is what the test assumes) or the data are too noisy to reveal an effect. You could use a manipulation check (e.g. independent decoding analysis) where you would have to find an effect in every participant (e.g. ring finger vs. middle finger decoding). Then, say, you find it to be significant in 80%. Then you could assume that not 50% carry an effect but only 40% and use that as an estimate. However, this is statistically speaking not quite correct, as we are chaining several probabilities, some of which are unknown (this is not assuming false positive / false negative effects). And this would be region-specific and wouldn’t apply easily to searchlight analysis. So, I’d rather stick to the default of 50% for now.
Approach 2: You just run the second-level t-test and make clear that you are not making claims about the population
This is fine as long as people know that you cannot generalize to the population but that it ends up being a fixed-effects test (with reference to e.g. Allefeld et al. and perhaps our paper that you cited).

Hope this helps!

Hi @Martin

Many thanks for your comprehensive response. I have a follow-up question on approach 1 that you suggested. As far as I understood Allefeld et al. ( prevalence statistics is better way to address questions usually addressed with second-level one sample ttest against chance. Is there also something similar to a paired sample ttest or a ANOVA. That you could say above chance classification in area x is significantly more prevalent for the participants responses (e.g. 80%) than for the stimulus acoustics (e.g. 70%). Or would just run second level prevalence statistics for each classification (decoding of the stimulus acoustics and response) separately, and then compare the results at a given prevalence threshold e.g. 70%, to compare the differences across the different classifications?

Many thanks for your advice!

Hi Basil,

Ah, yes, I forgot about this, and thanks for bringing this up. A couple of years ago, I came up with this same thought and wanted to discuss it with Carsten but then forgot.
Indeed, when you are not testing against chance, in theory this would mean you should be able to do random effects (RFX) testing again, since true negative accuracy differences are again possible. I just don’t know whether this works in practice. The idea behind RFX testing is that the variability within individual participants can be ignored. Carsten made a good point that it cannot be ignored, but the degree to which intraindividual variability dominates interindividual variability under these cases is unclear. I’d say it’s ok to go for an RFX test in this case when you think that the assumptions of the test can hold. Not sure how you would go about and test it. Perhaps Carsten simulated it.

@Kai, what’s your take on this?