I have been using the decoding toolbox to conduct ROI-based analyses within one group and opted for permutation testing as my statistical method of choice. My design per participant consists of 5 different runs with 2 betas each (one for each label) which allows for 252 permutations in total. Within each participant, I have used each possible permutation to generate a 5-step leave-one-run-out cross-validation, as automatically implemented by TDT. As a result, I receive 252 prediction accuracies per participant which I can average and from which I can, in turn, create a group distribution that can be tested against the original prediction accuracies obtained with the real labels. However, the mean prediction accuracies that result from my permutations are consistently negative and thus below chance. I was wondering whether you have any idea where this bias might stem from or any advice on how to improve my statistical testing or my general design?

Below-chance decoding accuracies typically indicate that something is wrong (although not always, sometimes they happen by chance - cross-validated null distributions have a long tail in the negative range). Below-chance accuracies can happen when there is non-stationarity in the data - normally not the case - or if there is another variable that correlates with the variable you are interested in and you donâ€™t control for it - normally the case. If you donâ€™t have below-chance accuracies in your original data to start off with, but only in the permutations, then that is really weird (it can happen but it indicates there already is something wrong with your data).

I have to admit I am not sure how you end up with 252 permutations. Even if you are permuting your labels between runs, I think there should only be (10 2) possible permutations, i.e. 45. However, permuting between runs is not valid because betas in runs are quite similar to each other, i.e. there is a dependence that would break by permuting across runs. That means you can only permute within run, which gives you 2^5 = 32 possible permutations. Since those are symmetric (you can exchange all labels completely, which gives you the same result) you actually only have 16 different permutations. If you want to run a real permutation test, that is unfortunately not enough (the best possible p-value is 0.0625 uncorrected). Solutions for getting more permutations are to do trial-wise analysis, but there the problem is that trials within a run are often not exchangeable - unless they are far apart in time. If they are, then that would be a solution: Model your trials separately at the first-level and then permute using TDTs permutation scheme. If they are not independent: Thatâ€™s where we currently have a problem with decoding analyses and permutation testing. Sigh. So in this case going back to good old second-level T-tests and ignoring some of the inferential statistical issues* is probably the best course of action.

Now for running a permutation test, under most circumstances it is not valid to calculate the significance relative to some sort of â€śempirical baselineâ€ť on the same data since (1) unless your classifier is broken it should give you something very close to chance and (2) the permutation test is a test against a null distribution, i.e. you can still be positively biased even when you correct for your â€śempirical chanceâ€ť due to the shape of the null distribution. You want to test if your result is unlikely to come from the null distribution, not compare it to some empirical chance level. I know this is quite common, but I donâ€™t think it is valid. What I think you can do is test whether your results lead to higher accuracy than a region that shouldnâ€™t carry any information. This also weirdly may address some of the issues I point out below*.

Hope that helps a bit!
Martin

*The problem is that true accuracies are always larger than chance, so testing a null distribution at the group level that has chance as a mean assumes that some values fall above chance and some below. Since in reality no true values can ever be below chance, this collapses the variance to 0, i.e. any variance in your model you measure cannot reflect the random effect but just the effect of some individual subjects (i.e. the fixed effect). In other words, finding a significant effect is like finding that at least one subject carries an effect, not that the distribution of your population means is different from chance. See this for a detailed explanation.