Nilearn Predict Probability Multiclass

Hi folks,

I am stuck on a multi-class classification problem using nilearn.
My fMRI data has 5 classes, which I would like to decode via nilearn.

The first question I have is: Does the off-the-shelf Decoder actually perform multi-class classification when presented with multiple labels? I ask this because the Decoder yields markedly higher per-class accuracies compared to the “OneVsRest” sklearn pipeline using the same data and estimator setup.

Second, why does the standard Decoder not include “predict_proba”? So far, I can only get predicted probabilities for each class when I build a custom sklearn pipeline where the classes do contain “predict_proba”, but again these yield much, much lower accuracies. Does anyone know a way to get probabilistic scores after fitting the standard Decoder?

Many thanks in advance,

Indeed it is advised to resort to sklearn for most advanced usage. Nilearn’s Decoder is simply meant to make things easier for beginners. In particular, Nilearn’s Decoder does not have a predict_proba method.
Am I surprised that you get strongly different btw a Decoder object (that basically wraps sklearn’s function) and sklearn’s object (I’d rather expect sklearn’s method to outperform). If there is a way to share the data/algorithm or to reproduce the behavior on a public dataset, that would be great…

Hi Bertrand,

absolutely. I put together a minimum working example by tweaking the Haxby multiclass tutorial.

Please see my questions (#?#) in the code…

### Imports
from nilearn import datasets
import numpy as np
import pandas as pd
import seaborn as sns
from nilearn.image import load_img, index_img
from nilearn.decoding import Decoder
from sklearn.model_selection import LeaveOneGroupOut
from nilearn.input_data import NiftiMasker
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from joblib import Parallel, delayed

### Load and prep haxby data
haxby_dataset = datasets.fetch_haxby()

func_filename = haxby_dataset.func[0]
mask_filename = haxby_dataset.mask

labels = pd.read_csv(haxby_dataset.session_target[0], sep=" ")
y = labels['labels']
session = labels['chunks']

non_rest = (y != 'rest')
y = y[non_rest]
session = session[non_rest]

func_file = index_img(load_img(func_filename), non_rest)

unique_conditions, order = np.unique(y, return_index=True)
unique_conditions = unique_conditions[np.argsort(order)]

### Nilearn Decoder
cv = LeaveOneGroupOut()
decoder = Decoder(estimator='svc', mask=mask_filename, standardize=True, cv=cv, n_jobs=-1, verbose=3), y, groups=session)

#?# Does the above (decoder.cv_scores_) actually compute MULTICLASS (OvR) scores? 
# Visualize results 
decoder_scores = pd.DataFrame(decoder.cv_scores_)
decoder_scores['Average'] = decoder_scores.mean(axis=1)
decoder_scores = decoder_scores.reset_index(drop=True)
sns.barplot(x='variable', y='value', data=decoder_scores.melt())

### Sklearn
nifti_masker = NiftiMasker(mask_img=mask_filename, standardize=True)
X = nifti_masker.fit_transform(func_filename)

# Remove the "rest" condition
X = X[non_rest]
session = session[non_rest]

svc_ova = OneVsRestClassifier(Pipeline([
    ('svc', SVC(kernel='linear'))

### Function to fit and predict on fold with Sklearn Model, returns multiclass accuracies

def cv_sklearn(train, test):
    X_train, y_train = X[train], y.iloc[train]
    print(train, test), y_train)
    X_test, y_test = X[test], y.iloc[test]
    y_pred = svc_ova.predict(X_test)
    cm = confusion_matrix(y_test, y_pred, labels=unique_conditions, normalize='true').diagonal()
    return cm

cms = Parallel(n_jobs=12)(delayed(cv_sklearn)(train, test) for (train, test) in cv.split(X,y,session))

### Visualize Sklearn Multiclass scores
sklearn_accs = pd.DataFrame(cms)
sklearn_accs.columns = unique_conditions
sklearn_accs['Average'] = sklearn_accs.mean(axis=1)
sklearn_accs = sklearn_accs.reset_index(drop=True)
sns.barplot(x='variable', y='value', data=sklearn_accs.melt())

#?# Why are these scores much lower than those from the Decoder object? 

### Likewise:
cv_scores_ova = cross_val_score(svc_ova, X, y, session, cv=cv, verbose=3, n_jobs=-1)
print('OvA:', cv_scores_ova.mean())

#?# Why is this value much lower than the "average" from the Decoder?

Thanks much for any help here. I mainly want to know which pipeline to use/trust when doing multiclass prediction.

If I’m not mistaken, the decoder uses roc_auc as a metric, while the sklearn pipeline returns prediction accuracy IIUC.
This might explain differences in the values reported.

Thanks for your swift response! Would it be possible to get a minimum-working example that directly replicates the decoder object’s prediction with an sklearn pipeline? I think this would add a lot of educational value for both beginners and more advanced users. Even when changing the scoring method, I still cannot reproduce the results from the Decoder object.

1 Like

Something like (with code taken from python - Calculate sklearn.roc_auc_score for multi-class - Stack Overflow):

from sklearn.metrics import roc_auc_score

def roc_auc_score_multiclass(actual_class, pred_class, average = "macro"):

  #creating a set of all the unique classes using the actual class list
  unique_class = set(actual_class)
  roc_auc_dict = {}
  for per_class in unique_class:
    #creating a list of all the classes except the current class 
    other_class = [x for x in unique_class if x != per_class]

    #marking the current class as 1 and all other classes as 0
    new_actual_class = [0 if x in other_class else 1 for x in actual_class]
    new_pred_class = [0 if x in other_class else 1 for x in pred_class]

    #using the sklearn metrics method to calculate the roc_auc_score
    roc_auc = roc_auc_score(new_actual_class, new_pred_class, average = average)
    roc_auc_dict[per_class] = roc_auc

  return roc_auc_dict

def cv_sklearn(train, test):
    X_train, y_train = X[train], y.iloc[train]
    print(train, test), y_train)
    X_test, y_test = X[test], y.iloc[test]
    y_pred = svc_ova.predict(X_test)
    return roc_auc_score_multiclass(y_test, y_pred, average="micro")


1 Like

Excellent, this definitely made the process more transparent.
For the sake of reliability, this still does not replicate what the Decoder produces (almost 6% in average classification accuracy difference):

bottle 0.482069
cat 0.821576
chair 0.738830
face 0.703410
house 0.784098
scissors 0.566285
scrambledpix 0.714580
shoe 0.606261
Average 0.677138

bottle 0.558862
cat 0.708333
chair 0.563492
face 0.611772
house 0.794974
scissors 0.554233
scrambledpix 0.555556
shoe 0.573413
Average 0.615079

In the description of the Decoder class it is stated that “It implements a model selection scheme that averages the best models within a cross validation loop. The resulting average model is the one used as a classifier.”

I guess the remaining difference can be attributed to this “model selection” process? Could you perhaps make this part explicit as well? Otherwise, I fear using the opaque Decoder class will be difficult to openly replicate.

Many thanks in advance.

1 Like

“Model selection” is a bit misleading, as the procedure “selects and averages” models, in a process analogous to bagging. This brings robustness, and this seems to pay off.
You mean that it should be better documented, right ? If yes, can you open an issue on Nilearn, with documentation tag ?