Couple of points:
- to make things more comparable maybe use the same preprocessing (fmriprep) - the strategy you used for dataset 2. As far I know there are some pre-preprocessing steps in frmiprep that SPM cannot even do.
- to go in the sense of @psadil, not sure how you are comparing results, but I would start by comparing unthresholded group statistic maps with a Bland–Altman plots as they did in the earlier of the 2 references I mentioned above.
I think the code they used to create those plots can be found here:
https://github.com/AlexBowring/Software_Comparison/blob/master/figures/ds001_notebook.ipynb - try to base your pipeline selection on datasets or results that is independent of the results you are actually asking questions about otherwise you are entering “p-hacking” territory (using the results of your analysis to decided which analysis you run): maybe you can do that on some “positive control” condition (for example: button presses activate motor cortex)
- note that the differences between software is in itself interesting and may be worth reporting as it speaks to the computational robustness (same data - different methods) of a given result (though it does not make for easy story telling)