Potential dependency between training and testing datasets for machine learning

Hi all,

Alas, I hope someone has been here with this problem before.

I want to check if there is a problem of dependency between a training and testing datasets if I first account for total intracranial volume (TIV) in the whole dataset before splitting the dataset into training and testing datasets.

For example, if you have a total sample of 500 participants with voxel-based morphometry data and you want to account for TIV in all of them before doing any training/testing (i.e., learning/validation), would it be ok to regress out TIV from the 500 subjects?

Or, would the regression to remove TIV need to be done separately for each instance of the training and testing sets?

Cheers, E

Hi Eunice,

I’m not familiar with voxel-based morphometry pipelines, but I’d imagine what you’re effectively doing is within-subject normalization, either z-scoring or simply rescaling to values between 0 and 1 (or -1 and 1). If this is the case, then there’s no bleed-over of information between the training and testing sets, and you’re definitely okay.

If there is information from both, it’s not necessarily the case that you’re poisoning the analysis, but you may be reducing its external validity, as you are normalizing to some sample statistic that could vary with a new sample, as opposed to some constant function of each input. Here I’d say you should perform the regression/normalization step independently in the training and testing sets, so at least you’re getting a measure of the external validity.

Just some quick thoughts. Hopefully somebody more familiar with the specifics of VBM analyses can chime in.