I’m trying to use the BIDS schema for generating BIDS datasets, with a good amount of metadata stored in an electronic lab notebook… For this, I need to know “in which BIDS tsv file(s) should a given metadata field go?”, and I’d like to do this with the BIDS schema…
After some (heavy ) thinking, I think this would be solved if I get the answer to the following question: in the BIDS specs available on the website (let’s say here for the data summary files: Data summary files - Brain Imaging Data Structure 1.10.1), for each tsv file, there is a table listing the metadata fields (~columns) that go into that tsv file. How could I access (or reconstruct) such table from the BIDS schema?
(I’ve looked here and there in the schema, as well as in bidsschematools, but I didn’t manage to get the answer)
I guess the overall challenge to use the bids schema is its intrinsic complexity, which makes the information somehow difficult to find; but the doc you all wrote seems really good, it’s just vast! Here, in this particular case, I think that:
in the schema itself, I could have found the info myself by looking a bit better…
in the doc (that you linked), I have to admit that the title “Valid fields for definitions” may be a bit too vague… but I’m not sure I can provide a good suggestion at this point (I have to digest its content )
Other than that, I have one question: why is there a capital letter in the schema at the beginning of the category in rules.tabular_data.<category> (Participants, Samples, Scans etc.), whereas everything seems to be fully in lowercase everywhere else…? Is there a purpose / a function / a match with something else where this capitalization exists?
Thanks for the feedback! Please feel free to open issues/PRs with any suggestions you have and tag them with schema. I’ll also see updates to this thread.
Generally, under rules, you have <path>.<rulename>.<rule>. path is snake case, rulename is CamelCase, and rule contents are (I think) snake case.
I think this arose because the schema started as what is now objects.metadata, which are JSON fields that BIDS uses CamelCase for. So when we started writing other files, we kept doing the same thing. This isn’t universal, though; since columns are snake_case in BIDS, the entries in objects.columns are, too.
That said, you shouldn’t attach a semantic meaning to the capitalization. It’s just a name for the rule object.
Hi @effigies ! I’m following up on this, and in particular I’m coming back on your wish for suggestions on how to improve the doc of BidsSchemaTools… What would be great is to provide code snippets that would allow answering practical questions using the schema… This is actually what I’m trying to do at the moment, and this is fairly difficult (probably because of my lack of skills )… Anyhow, for my initial question (title of the thread), I came up (fairly easily) with:
If I want to know which of these columns is required / recommended / optional, the information is clearly present in what follows, but is not at the same level in the schema for participant_id vs. the other columns…
{'participant_id': {'level': 'required',
'description_addendum': 'There MUST be exactly one row for each participant.\n'},
'species': 'recommended',
'age': 'recommended',
'sex': 'recommended',
'handedness': 'recommended',
'strain': 'recommended',
'strain_rrid': 'recommended',
'HED': 'optional'}
Is this intended? I find this not very logical, neither practical , but there’s probably a good reason for this… Anyhow, how would you get the info easily (I guess this is a python question )?
for a given data modality (e.g microscopy), what is the list of files (tsv, json, sidecars) that can/should be included next to the microscopy data files (bs.objects.files.to_dict().keys() gives me a piece of answer for the top level files and directories; I need to select only the files, and get the other ones that are not at the top level)? and what is the status (mandatory or not) of each file?
for a given metadata item (e.g participant_id, or NumericalAperture, which don’t have exactly the same status…), in what file(s) should it go?
Sorry, I started writing and got bogged down and failed to hit send. Here’s a partial post.
In this case, I would probably do:
levels = {
key: value if isinstance(value, str) else value['level']
for key, value in bs.rules.tabular_data.modality_agnostic.Participants.columns.items()
}
A human reader/writer may benefit from the flexibility (why add indirection in the common case?) while tool developers benefit from consistency. Possibly we made the wrong choice when we found we needed more than just requirement levels (the level and description addenda are used in rendering).
I can see a couple ways to go from here:
We can just convert all string values in the source files to {'level': orig_value}.
We could auto-convert these values when compiling the YAML, so programmatic consumers could count on the full structure.
These are backwards compatible in the sense that existing tools that are flexible will continue to work, but new tools will either need to be flexible or set a hard minimum on the version of schema supported.
Another option would be to start building out a class structure where we could write something like TabularDataRule(bs.rules.tabular_data.modality_agnostic.Participants), using a structure like:
The loaders could perform the expansions, and that only needs to be coded up once, and can support a range of schema versions. The trick here is that if we write this directly into bidsschematools, there’s not a lot of difference between changing the schema and adding the structures, since they’ll both have a minimum version. I think we need to think about an API that lives outside the schema tools in order for this to be useful.
In general, files aren’t required (you can have a BIDS dataset with nothing but a dataset_description.json), metadata is required. If you have a microscopy file, then it has required metadata, which can be found in either the immediate sidecar or a sidecar file that matches according to the inheritance principle. The logic followed by the validator is:
Collect dataset-level metadata.
For each file, construct a context containing dataset-level and file-specific metadata.
Search through schema.rules.files for a rule that the file satisfies.
Loop over all rules in schema.rules.sidecars, evaluating the expressions in selectors to determine if a rule applies to the file. If so, check whether the described fields are present and match the types of the metadata fields.
Loop over all rules in schema.rules.checks, evaluating the expressions in selectors to determine if a rule applies to the file. If so, evaluate all of the expressions in checks to determine if the rule is satisfied or violated.
Finishing up. This can only be gotten by iterating over rules.sidecar, rules.json (for objects.metadata) and rules.tabular_data (for objects.columns) entries. Something like:
table_rules = {
column.name: [
f'{tables}.{table_key}'
for tables in ('rules.tabular_data', 'rules.tabular_data.derivatives')
if tables in schema # Consider that `derivatives` might get squashed to match raw
for table_key, table in schema[tables].items(level=2)
if 'columns' in table and column_key in table.columns
]
for column_key, column in schema.objects.columns.items()
}