BIDS - preserving legacy provenance

Bennett · September 12, 2019, 2:05am

Good evening,

We are starting to integrate BIDS across our legacy projects. One issue that comes up is the BIDS file (re)naming. I am worried about renaming files on download as we move between databases and the possibility for user error/confusion. What is the recommended procedure to link original (pre-BIDS) file names within BIDS compliant structures? If at all possible, I would like to keep this information in human readable text as “physically” close to the imaging data as possible.

Thanks,
Bennett

effigies · September 12, 2019, 4:37pm

This feels like a good use case for datalad. It would be a lot for me to give the background and argument on why you want to use it, so I’ll just point you to the handbook (please ask questions on its GitHub issues page).

BIDS has the notion of a sourcedata/ subdirectory where you can store your original dataset all at once. If the main thing you’re doing with large files is renaming them (i.e., you don’t need to modify headers, so the contents of the majority of your large files will be unchanged), then datalad has some convenient deduplication features.

So you can start by creating a dataset. Here is a basic approach:

datalad create --description "BIDS reformatting of legacy project" \
    -c text2git /path/to/dataset
cd /path/to/dataset
mkdir sourcedata
datalad run rsync -avP /path/to/original_dataset/. sourcedata/.

This will copy all of your data into the repository, and large, non-text files will be annexed (see documentation for more details), while text files will be preserved as-is, and they will be committed to the history of the dataset.

If you then store whatever scripts you’re using to convert your source data to BIDS in code/, that will be a kind of provenance. At its simplest, you could have a big renaming script that explicitly shows the mapping:

#/bin/bash
cp sourcedata/some/image.nii.gz sub-01/anat/sub-01_T1w.nii.gz
...

If you run that script with datalad run, then the resulting file will end up being a symlink pointing to the exact same content, so you won’t have a second copy of each file floating around. (It may use some extra disk space in the meantime.) There might be some datalad magic to do this without copying the file contents, making it an extremely quick operation, but I don’t know it off the top of my head.

And even if you don’t go down the datalad route, there’s no reason you can’t take the same sourcedata/ and code/ approach to preserving the information in the legacy dataset. It will just be larger.

Bennett · September 12, 2019, 5:04pm

Datalad looks interesting, especially for one-time renaming. Unfortunately, we have a large user space who will often retrieve subsets (or supersets) of data across project. Also, we need a solution that does not generate online resources for a variety of confidentially / data use agreement / phi restriction reasons.

Has anyone defined a protocol for storing the renaming structure in sourcedata? I need every copy of the data to have the history of from where it came… even if the datasets are downloaded multiple times by many different individuals.

effigies · September 12, 2019, 5:44pm

Datalad does not require you to publish any data to anywhere in particular; it’s just an organizational scheme. This can be done entirely on a single machine, if desired, or on a LAN with a private git server, or any other distribution setup. If you did want to make the dataset public for simpler distribution, any sensitive data could be placed in the annex, so only its metadata would be public. The remotes that contain the actual private data can require authentication, so that unauthorized downloaders would only see the public metadata and have no access to the private data.

No, sourcedata is completely unspecified, and it kind of needs to be in order to avoid making people leave out data they think is important. You’re free to use whatever technique you want. A simple option would just be a TSV where the first column is the BIDS location and the second is the sourcedata location, which would make lookup very easy.

I’m not really sure how multiple downloads affects provenance. Could you elaborate?

adina · November 4, 2019, 11:43am

I’m seconding @effigies on the recommendation of Datalad. This tutorial may contain useful insights for you, @Bennett: http://www.repronim.org/ohbm2018-training/03-01-reproin/ (starting from heading “Modular Study Components”)