How to update metadata with datatab?

apraga · November 23, 2024, 9:51pm

Hi,

I’m trying to use metadata to track the file versions other than a simple git log. I managed to add new metadata with
datalad meta-extract metalad_core clinvar_GRCh38.vcf.gz | data meta-add -

Writing it to a JSON adn using `datalad meta-add’ again did not update it but added a new metadata.
How can I update the first and remove the second ?

There s i[no delete command yet](https://datalad meta-extract metalad_core clinvar_GRCh38.vcf.gz | data meta-add -) and I cannot figure out where it’s stored.

Thanks !

StephanHeunis · November 23, 2024, 11:40pm

Hi @apraga

You’re right, there isn’t a command to delete a metadata item explicitly. I recommend reading through the handbook chapter about datalad-metalad to get more context for how the tool works, especially the “Find out more” section that explains how metalad-specific metadata is stored in a datalad dataset. The metadata is stored as blobs in git’s internal object store, which is not really designed to be changed manually and requires a more advanced level of understanding of git’s internal functioning. So it’s not recommended to make any changes to those blobs without knowing exactly what you are doing. It’s still possible though, e.g. with a tool like git-filter-repo mentioned here in the handbook: 9.2. Miscellaneous file system operations — The DataLad Handbook

In the absence of a delete command for datalad-metalad, I would suggest sorting/filtering the metadata that is returned by meta-dump based on some other field, such as extraction_time, in order to identify the exact metadata item you are interested in.

apraga · November 24, 2024, 2:08pm

Hi @StephanHeunis,

Thanks for clearing that up. I wanted to track file versions with metadata but that makes it hard to to update them.
As a follow-up questions, what are best practices for tracking file versions ?
Trying to follow YODA principles, My idea was to have one dataset per major versions et track minor versions with git commits and tags.
Another option would be to use git branches, but that would not work all for for git tags.
A third idea would be to use git annex metadata.

Thanks !

StephanHeunis · November 24, 2024, 2:43pm

No problem.

Could you maybe explain your use case a bit more? I am not sure I understand the goal. What changes when a file version changes? What is the aim of tracking file versions with something like semantic versioning or different git branches/tags (i.e. refs), and why is this needed above what git already provides with commits (that point to tree hashes and blob hashes)?

apraga · November 24, 2024, 11:06pm

Sorry if it was lacking context. I’m working with human genomic data, both the genome data and other data needed for our analysis.
The thing is these “other data” have their own version but often must exist for different genome version.
Usually, there are stored per directoly according to the genome version:

genome_v1.0/
- database1_v7
- database2_v1
genome_v2.0
- database1_v7
- database2_v2

It is like having a major version (the genome version) and a minor version (the database version). My idea was to have a dataset per database by their origin and offer them for each genome version

database1/
- database1_genome_v1.0
- database1_genmove_v2.0
database2/
- database2_genome_v1.0
- database2_genmove_v2.0

Which does not work well with tags…

Also, a final constraint is to be able to easily print out all the database version at once (genome and the database version). So just a git commit works… but is not ideal !

Hope that’s clearer now !
Thanks,

StephanHeunis · November 25, 2024, 1:27pm

That’s indeed clearer, thanks for the extra info.

It sounds to me that you might be able to use datalad subdatasets (a.k.a. submodules in git) for this purpose. It does exactly what I think you want, which is to track versioned dependencies of specific dataset versions.

E.g. have a look at the file tree on this page of a datalad dataset: studyforrest-data/artifact at master · psychoinformatics-de/studyforrest-data · GitHub. It lists, and links to, several subdatasets each having a specific version. E.g. 3T_movie_eyetracking @ 276ceff, where the version is just a shortened version of the exact commit hash of the dependent subdataset repository.

Does this sound useful for you?

FTR, metalad’s metalad_core extractor also extracts subdataset information by default. So if you have the genome_v1.0 dataset, with subdatasets database1_v7 and database2_v1, their dependency information will be extracted as well (i.e. the commit at which the identified subdataset is a dependency). And later when you update your main dataset to v2.0 (which includes the update of the subdataset database2 to v2) and then extract the metadata again, these changed dataset and subdataset versions will be reflected in the metadata. So you will have two top level metadata items that reflect the versioned evolution of the whole hierarchy.

apraga · November 26, 2024, 7:06pm

Thanks for the pointers. I have indeed been playing with subdatasets. After some more reflexion, I think the simpler approach is a meta-datasets with a subdataset for each database.
A database have a branch for each genome version.
See GitHub - apraga/dgenomes for the metadataset and GitHub - apraga/clinvar for a database.
Do you think that’s acceptable ?

The only drawback is the database version and genome version are hard-coded in git commit messages, so not very robust for extracting them automatically.
If the file is updated, is it possible somehow to update the metadata ?
Edit: after testing, custom keys in metadata seems not available so adding the genome version is not doable.

Edit2: I plan to submit the datasets to datalad if possible, so your advice is much appreciated !

StephanHeunis · November 26, 2024, 9:57pm

This seems like the same setup that I suggested, so yes this makes sense to me. One thing I noticed though is that the link to the pinned version of the subdataset gives a 404. I checked the .gitmodules file and the url seems to point to a path in the same repo, i.e. not to the subdataset repo. Is this intentional? What sequence of commands did you use to add the subdataset / submodule?

With regards to having a branch per genome version, this is not strictly necessary, because a branch is also just a reference to a commit. So different versions could be different commits in a single branch, or in different branches. The branch named make it easy to identify for a human looking at it, but also not necessarily easy to extract.

The only drawback is the database version and genome version are hard-coded in git commit messages, so not very robust for extracting them automatically.

I saw that you did this. Not a bad idea, but there could be other options if your goal is to make automated extraction possible. If I understand correctly, the database version is tied to some external way of tracking it? Is this version contained in the file itself, i.e. in the clinvar.vcf file? If so you could create your own custom extractor that extracts the version from the file in the subdataset, and adds it as metadata to the subdataset. Otherwise you could add a textfile containing the version to the subdataset and create an extractor to do the same from the text file. Or you could create an extractor to get the version from the branch name, i.e. git ref.

If the file is updated, is it possible somehow to update the metadata ?

Yes, you could run meta-extract again. And if you create a custom extractor for getting the version, this would each time give you an updated metadata item with the database version.

Edit: after testing, custom keys in metadata seems not available so adding the genome version is not doable.

I am not sure what exactly you tested and how your conclusion was drawn, so I can’t add much substance to this other than the previous comments about a custom extractor.

I hope all of this helps. It seems like you may have a use case that requires a bit of troubleshooting and exploration. We have a virtual datalad office hour every Tuesday at 16h00 CET, in case you want to show up and speak to all the experts about your use case in real time. You can find more info about this office hour here: You're invited to talk on Matrix

apraga · December 9, 2024, 10:11am

Hi Stephan,

Thanks for the link. I’ll play around with the data with your suggestions and may join the office hour if needed.

Have a nice day,