Leakage effects when using tangent space representations for predictive modelling?

Benedikt_Sundermann · November 20, 2020, 10:13am

Dear all,

tangent space representations as an alternative to correlation seem to offer more powerful features for functional connectome-based predictive models in individuals (see https://doi.org/10.1016/j.neuroimage.2020.116604 and https://hal.inria.fr/hal-01824205v1 ).

As far as I understand, calculation of tangent space representations (as for example in nilearn.connectome.ConnectivityMeasure) relies in part on group information. Is this correct? A technical solution to re-calculate primary tangent space representations based on fitting the training data only in each cross validation fold has been suggested in a previous topic ( Difference between Nilearn ConnectivityMeasure fit_transform & transform ). Generally, leakage of any kind of information from testing data into training data and the models needs to be avoided at all cost (even if there is no obvious association with the prediction target). However, a potential leakage of information during connectome feature estimation using tangent space representations before cross validation (which would be straightforward for e.g. functional connectivity based on individual correlations) has not been highlighted very much.

Is there any kind of consensus that tangent space representations should be calculated either separately within each cross validation fold or that the risk of leakage is negligible when carried out in the entire dataset?

Kind regards,
Benedikt

GaelVaroquaux · November 20, 2020, 10:37am

It is clear that the group-model of the tangent space must be fit only on the train set. Fitting also on the test set risks creating leakage.

jeromedockes · November 20, 2020, 12:08pm

see for example https://nilearn.github.io/auto_examples/07_advanced/plot_age_group_prediction_cross_val.html#sphx-glr-auto-examples-07-advanced-plot-age-group-prediction-cross-val-py