Nilearn vs. SPM very different results

Couple of points:

  • to make things more comparable maybe use the same preprocessing (fmriprep) - the strategy you used for dataset 2. As far I know there are some pre-preprocessing steps in frmiprep that SPM cannot even do.
  • to go in the sense of @psadil, not sure how you are comparing results, but I would start by comparing unthresholded group statistic maps with a Bland–Altman plots as they did in the earlier of the 2 references I mentioned above.
    I think the code they used to create those plots can be found here:
    https://github.com/AlexBowring/Software_Comparison/blob/master/figures/ds001_notebook.ipynb
  • try to base your pipeline selection on datasets or results that is independent of the results you are actually asking questions about otherwise you are entering “p-hacking” territory (using the results of your analysis to decided which analysis you run): maybe you can do that on some “positive control” condition (for example: button presses activate motor cortex)
  • note that the differences between software is in itself interesting and may be worth reporting as it speaks to the computational robustness (same data - different methods) of a given result (though it does not make for easy story telling)