This feels like a good use case for datalad. It would be a lot for me to give the background and argument on why you want to use it, so I’ll just point you to the handbook (please ask questions on its GitHub issues page).
BIDS has the notion of a sourcedata/
subdirectory where you can store your original dataset all at once. If the main thing you’re doing with large files is renaming them (i.e., you don’t need to modify headers, so the contents of the majority of your large files will be unchanged), then datalad has some convenient deduplication features.
So you can start by creating a dataset. Here is a basic approach:
datalad create --description "BIDS reformatting of legacy project" \
-c text2git /path/to/dataset
cd /path/to/dataset
mkdir sourcedata
datalad run rsync -avP /path/to/original_dataset/. sourcedata/.
This will copy all of your data into the repository, and large, non-text files will be annexed (see documentation for more details), while text files will be preserved as-is, and they will be committed to the history of the dataset.
If you then store whatever scripts you’re using to convert your source data to BIDS in code/
, that will be a kind of provenance. At its simplest, you could have a big renaming script that explicitly shows the mapping:
#/bin/bash
cp sourcedata/some/image.nii.gz sub-01/anat/sub-01_T1w.nii.gz
...
If you run that script with datalad run
, then the resulting file will end up being a symlink pointing to the exact same content, so you won’t have a second copy of each file floating around. (It may use some extra disk space in the meantime.) There might be some datalad magic to do this without copying the file contents, making it an extremely quick operation, but I don’t know it off the top of my head.
And even if you don’t go down the datalad route, there’s no reason you can’t take the same sourcedata/
and code/
approach to preserving the information in the legacy dataset. It will just be larger.