I have three different experimental conditions, with different numbers of classes (2,3 and 4) and hence different chance levels (50%,33.33%,25%). What would be the correct approach to make these chance levels comparable, assuming that an above-chance accuracy scale with the chance level?
Since oftentimes multiclass classification is based on multiple binary classifiers anyway, I would instead run all pairwise comparisons in the other classifiers and then average results. Else a direct comparison is just non-trivial.