Recently a few papers came out on ComBat; a tool that can be used for removing site effects from your multi-site datasets. According to these papers; it does a pretty good job removing unwanted scanner artifacts from modalities like DTI, functional connectivity (ref) and cortical thickness measurements ref.
In short, ComBat is an extension of a linear regression model that uses empirical Bayes to improve the estimation of the site parameters. I wanted to give this a try myself; and luckily ComBat code is available for Matlab; R and Python
Now I want to try out ComBat for my classification study, in which I try to separate patients from controls using cortical features of >4000 subjects, coming from 46 unique sites around the world. My model optimization, training and testing are performed in separate (inner- and outer-) cross-validation loops. The problem however is that ComBat seems to be a “one-shot” approach; in the sense that it is run only once on the entire data set instead of fitting the harmonization model’s parameters on the training data only; and apply them on both training and test data like you would typically do.
Is anyone more familiar with these kind of harmonization techniques, or has some recommendations on how to get rid of site-specific effects in multi-site data-sets in a cross-validated manner?