Replicable scripts, BIDS, and curating data

toddt · October 11, 2018, 1:35am

I’m currently analyzing a dataset, and I would like for each step of the analysis to be completely automated, so that I could publish this dataset and have the analysis be replicated exactly. (I’m not editing freesurfer surfaces, so I shouldn’t actually need to do a lot of interacting with the pipeline.)

The BIDS framework and all of its apps make that pretty simple – I run heudiconv on a list of subjects to get BIDS directories. I run MRIQC and fmriprep to check data quality and do preprocessing. Then I have a nipype modeling script, et cetera.

The thing I’m struggling with is this: in most large projects with a bunch of subjects, you’ve got some one-off subjects that you need to exclude. Maybe there’s an excessive amount of motion and the data is garbage. Maybe there’s a run where the projector turned off midway through. Maybe there’s one subject with a really unfortunately slice prescription that you don’t want to include in group analysis because the intersection of his mask with the other masks excludes too much data.

How is this documented and managed? Is there a BIDS standard for this? I’d like to keep the data in the dataset that we upload, and even if I didn’t, I wouldn’t want to deal with this in the heudiconv heuristics file (“if TRs = 128 and task = ‘faces’, process it, unless it’s the 2nd run of subject 8, or the 3rd run of subject 10, or …”)

Ideally, there’d be something like an “excluded runs” file somewhere so that there was documentation of the bad runs in a standardized place, and also so that by the time modeling scripts were active, they could intelligently exclude running first-level models on garbage data.

Has anyone done this in a clever way?

ChrisGorgolewski · October 11, 2018, 5:12pm

The core of this problem is that “exclusion of runs” is your particular interpretation of quality of the data which is dependent on what tools you used to asses it and what you are planning to use the data for. So the answer which runs to keep will differ from one person to another and from one analysis to another (T1w scans with some motion could be good as intermediate coregistration target, but not good for cortical thickness measurements).

At the moment the spec does not specify how to do this, but you can do the following:

Add a known issues section to the README describing what you found problematic about specific runs
Add a custom column to_scans.tsv files denoting which scans should be excluded or not. Add a_scans.json data dictionary explaining what this column means and how you made the decision.

This might be also a good thing to add to the spec. If you could propose a change on https://github.com/bids-standard/bids-specification that would be great.

Ursula_Tooley · January 11, 2019, 10:37pm

Is this something that has been implemented/could be accessed when running fMRIPREP? For example, I have some subjects who have two T1’s, and one of them has much more motion distortion than another, so I’d like to have FMRIPREP only use the good T1 rather than averaging them together. How could I point fMRIPREP to that column in order to decide which T1 to use?

I know I could just put a number of specific subject file names into the .bidsignore file, but that doesn’t seem to be the best long-term solution. We have a similar situation with some rest scans, where we want to exclude one of two rest scans when we saw that the subject fell asleep.

ChrisGorgolewski · January 11, 2019, 10:54pm

This would be an interesting new feature. However, FMRIPREP currently does a robust averaging of T1w images that excludes outlier voxels. Check if the output volume looks good. Perhaps manual exclusion is not necessary.

I agree (and I’m not sure if FMRIPREP uses .bidsignore anyway). A solution based on a specific column in _scans.tsv would probably be best.

Ursula_Tooley · January 11, 2019, 11:02pm

Thanks, @ChrisGorgolewski! I’ll check how it runs on a few people with that situation.

As for the rest scans (wanting to ignore specific scans where subjects fell asleep), I’ll see if the .bidsignore workaround does anything at all. I think it is looked at at some point, as I used it to ignore fieldmap scans that I don’t want to use (before I realized there was a flag for that). But the column idea seems a much better solution longterm.

crodriguez · July 23, 2020, 3:32pm

Hi folks,
I wanted to post a question here as it relates to the original post. I have a BIDS data set that includes multiple runs of multiple tasks and multiple rest scans. There’s one task at the moment that I’m interested in having fmriprep conduct the pre-processing, but it’s still a little fuzzy to me on how to get fmriprep to ignore specific scans. I can move the files I’m not interested in or copy and then pare down the root bids directory to contain only the scans I’d like to process, but I’d like to know what other options are out there before proceeding. Any thoughts on this would be greatly appreciated.

Thanks,
C

tsalo · July 23, 2020, 5:48pm

fMRIPrep has a --task-id argument (see here). That lets you limit your fMRIPrep run to only that task.

crodriguez · July 23, 2020, 6:05pm

Thank you @tsalo, I see that in the usage notes. I don’t think I would have made that connection from the argument description alone, so that helps a lot to clarify what that argument does.
Best,
C

crodriguez · July 24, 2020, 4:24pm

Hi @tsalo, please forgive my ignorance on this issue, but I have to be missing something here. If I use the --task-ID flag with MID as the argument, I don’t get any of the expected output. No processing of functional data, but I do get _desc-preproc_T1w .json and .nii.gz files in the anat folder of fmriprep/sub- output directory. This approach took about 30 minutes to complete so I’m sure something isn’t right.

If I remove the --task-ID flag, then I get the expected output. All anatomical and functional scans are processed, normalized, etc. The problem though is that I have rest data with multiple runs, three different tasks with two runs each and I’m only interested in processing one task. So processing the whole enchilada takes a bit longer and takes up un-necessary disk space.

It seems as if the freesurfer step isn’t running since that’s one that takes up quite a bit of time. I’ve looked at the usage notes for the version that I’m using and I’m not seeing anything that may be missing. I see there’s an --anat-only, but that doesn’t sound right to include. What am I missing from this?

docker run -ti --rm \

-v /scratch/crodriguez/data:/data:ro
-v /scratch/crodriguez/prep/derivatives:/out
-v /scratch/crodriguez/prep/wd:/work
-v /export/apps/linux-x86_64/freesurfer/freesurfer_6.0/license.txt:/opt/freesurfer/license.txt
poldracklab/fmriprep:20.1.1 /data /out/fmriprep-20.1.1
participant -w /work
–ignore slicetiming
–output-spaces MNIPediatricAsym:cohort-3:res-2
–participant-label 003
–task-id MID \

tsalo · July 24, 2020, 9:11pm

The function that identifies relevant files based on your criteria must not be catching the MID task data. Can you share an example filename (and path) for the MID task in the dataset?

crodriguez · July 24, 2020, 9:28pm

Absolutely!

MID data path:
/scratch/crodriguez/data/sub-*/ses-baselineYear1Arm1/func/sub-*_ses-baselineYear1Arm1_task-MID_run-01_bold.nii.gz

Each MID task has two runs and each *bold.nii.gz file is accompanied by a name-matched .json file. I’m starting to wonder if it has something to do with how the func, anat, fmap folders are nested in a session directory.

tsalo · July 24, 2020, 9:55pm

Hm, --task-id should identify MID as long as the filename has MID for the task entity. I was hoping that the filename had something like mid instead.

The ABCD data you’re working with (I assume from Fair’s BIDS version) should be in valid BIDS format, so I wouldn’t be too concerned about the folder structure.

What about that final backslash? When you ran the workflow did you include that?
Alternatively, are you 100% sure that sub-003 has the MID task?

crodriguez · July 27, 2020, 2:41pm

Thanks @tsalo! I ran that same subject with the --sloppy command and it ran just fine. I was hoping that it would just run quicker, but what ended up happening was that it didn’t stop after about 30 minutes like before. I then tried eliminating the final backslash and that worked too. When I ran it with the --sloppy flag it didn’t have a back slash at the end, so my guess at this point is that the backslash was the culprit. I had included the backslash originally as a way to double check the command in the terminal before running it, but guess it throws things off.
Thanks again for the help.