Quality Control files in BIDS

earl · November 18, 2022, 6:44pm

Hello BIDS Community members who like quality control (QC)! I find myself once again pondering the need for a QC file in BIDS so I am starting a Neurostars thread for anyone to weigh in before jumping into making a GitHub pull request or reporting an issue to the bids-specification repo.

The problem is there appears to be no BIDS mechanism for reporting QC metrics on an individual file level. If I curate a dataset now where I have QC metrics and comments available for each file, I would like to share that in a standardized way. However it doesn’t belong in the participants.tsv because there are many ratings per subject, likewise for the sessions.tsv, and it doesn’t really belong in the phenotype/ folder as a quality_control.tsv file because it is not phenotypic data.

We did something like this in the ABCC (NDA Collection 3165) for fMRI quality control scores which would not pass the BIDS validator, but it feels like a good step forward:

https://collection3165.readthedocs.io/en/stable/recommendations/#3-the-bids-quality-control-file

The NDA also has another few nice examples of this:

The qc_outcome and qc_fail_quest_reason fields in NIMH Data Archive - Data Dictionary: Data Structure
The abcd_fastqc01 data structure here NIMH Data Archive - Data Dictionary: Data Structure

Notably each of these solutions involve some granularity of file level or folder level, but importantly they contain QC judgments with a variable amount of information besides pass or fail and the file path. I am leaning toward suggesting something similar to files in the phenotype/ folder in BIDS, but instead it would be a quality_control/ folder.

We are preparing a dataset now at the NIMH Data Science & Sharing Team for which we have QC ratings for each individual defaced anatomical file and I intend to use the phenotype/quality_control.{tsv,json} filename pattern for now to avoid issues with the BIDS Validator.

So, what do other folks think about this problem? Has anyone else done something similar? Should this really be a BIDS Extension Proposal (BEP) with a GitHub Issue and associated BIDS Specification Pull Request or should we just deal with it existing in the phenotype/ folder?

Thanks all!

robert · November 19, 2022, 12:45pm

Dear Earl,

Although not applicable to your use case, for EEG/iEEG and MEG we have the channels.tsv file which can contain the status column that can detail the quality of individual channels in a recording. That is at a more fine-grained level than what you have in mind though, and not for MRI.

In your case I would use the existing scans.tsv file. It already lists each individual recording and hence is at the level of granularity at which you want to store QC. In the scans.json you can store the data dictionary that explains what the QC scores mean.

best regards,
Robert

Ariel_Rokem · November 21, 2022, 12:57am

Hey @earl : in HBN POD2, we dealt with this by adding several different QC columns to the participants.tsv file and relevant metadata in the participants.json. But I can see why this would not work in all cases. I think the scans.tsv/scans.json solution that @robert proposes seems good to me as a general approach.

Related, but not exactly the same, is a BEP we started drafting up around the time of OHBM last summer (together with @richford) that describes how you would organize the html reports that many pipelines are generating: https://docs.google.com/document/d/15C1jI1YC9Yx-Bzo-_LYx12okWY5XaOWR1Xu2dOOvj20/edit#heading=h.4k1noo90gelw.

If the above solution doesn’t work, maybe one BEP could encompass both of these issues?

earl · November 21, 2022, 5:39pm

@robert and @Ariel_Rokem,

I think Robert is right about the scans.tsv being the right intent for what I’m talking about so I’m marking that as the “Solution” here on Neurostars.

And Ariel, I like your approach to putting info like that in the participants.tsv. I’ve done similar things to that before as well and it works well to aggregate info at the top level that is “one info per level of granularity”. In other words, it works well for one thing per participant_id or one thing per session_id, but it does not extend well to many items per participant_id or session_id, I feel. That is a strong motivator I think of the BEP036 we’ve got going. Sort of this idea of whether to aggregate tabular data at the root BIDS level or to segregate at the subject or session level, especially.

Thanks again!

Marcel_Zwiers · November 21, 2022, 10:53pm

I think the scans.tsv file is a tempting and convenient target to store QC data, but I think the derivatives folder is a more correct target. The purpose of the scans file is to describe timing and other properties of each neural recording file, not results of an analysis of the data. QC data, whether it is the result of computer evaluations (think mriqc) or human evaluations, is not raw/original data, it is features extracted from the raw data (features we call noise or artefacts, as opposed to other features we call signal), and I think it thus belongs in the derivatives folder.

earl · November 23, 2022, 2:06pm

@Marcel_Zwiers Thank you for chiming in! That is an interesting suggestion and I agree that, strictly speaking, QC is a derivative. However I feel the intent of BIDS is to be flexible while making concrete suggestions. I think the end-user is less likely to find the QC data in the derivatives for the same reason not everyone reads the README and the curator is less likely to follow standards set forth by the BIDS Validator due to the less strictly validated nature of derivatives folders.

In the way I would prepare my breakfast the night before to make the morning easier, I want to curate and share data that the end-user is ready to consume. So I want to have data I’ve curated as straightforward to use as possible.

This is why in the near future I will be advocating for the ability to house tabular files and/or their sidecar JSONs at any level of the (rawdata) folder structure. Because I feel putting the data where people will readily find it is nearly as important as curating it in the first place.

Cyril_Pernet · December 9, 2022, 5:47pm

FYI, I have been storing my QC or raw data in their respective json files - that is the simplest option IMO, just need to define in a dictionary what the added fields mean.

earl · December 21, 2022, 4:14pm

@Cyril_Pernet This is another great point. Sidecar JSON metadata like that is a great idea for housing a human-made judgment call like QC as well.

I think my new/revised thoughts I will be advocating for, related a bit to BEP036, will be that curated metadata for tabular data (derivative or not) should be able to be validated/allowed to be stored at any level if it satisfies the necessary details at that level. This would also satisfy the spirit of the inheritance principle I think. For instance, if I wanted to keep QC metrics per scan they should be allowed at:

The root of the folder structure as scans.tsv if it requires the participant_id (and session_id, as necessary).
The subject subfolder named as sub-{participant_id}_scans.tsv and it should not be required to include the participant_id column.
The session subfolder, when present, named as sub-{participant_id}_ses-{session_id}_scans.tsv and it should not be required to include the participant_id and session_id columns.
(I’m not married to Cyril’s use case here, but it is worth including for discussion’s sake) Into the individual scan’s sidecar JSON metadata, therefore no participant_id or session_id is necessary.

Number 1 in the above list is my preference because I feel like people want to be able to subset their data more quickly on categories like “QC == pass” or “group == case” without having to crawl a bunch of smaller files.

Thanks for this lively discussion, everyone!