This seems like the same setup that I suggested, so yes this makes sense to me. One thing I noticed though is that the link to the pinned version of the subdataset gives a 404. I checked the .gitmodules
file and the url seems to point to a path in the same repo, i.e. not to the subdataset repo. Is this intentional? What sequence of commands did you use to add the subdataset / submodule?
With regards to having a branch per genome version, this is not strictly necessary, because a branch is also just a reference to a commit. So different versions could be different commits in a single branch, or in different branches. The branch named make it easy to identify for a human looking at it, but also not necessarily easy to extract.
The only drawback is the database version and genome version are hard-coded in git commit messages, so not very robust for extracting them automatically.
I saw that you did this. Not a bad idea, but there could be other options if your goal is to make automated extraction possible. If I understand correctly, the database version is tied to some external way of tracking it? Is this version contained in the file itself, i.e. in the clinvar.vcf
file? If so you could create your own custom extractor that extracts the version from the file in the subdataset, and adds it as metadata to the subdataset. Otherwise you could add a textfile containing the version to the subdataset and create an extractor to do the same from the text file. Or you could create an extractor to get the version from the branch name, i.e. git ref
.
If the file is updated, is it possible somehow to update the metadata ?
Yes, you could run meta-extract
again. And if you create a custom extractor for getting the version, this would each time give you an updated metadata item with the database version.
Edit: after testing, custom keys in metadata seems not available so adding the genome version is not doable.
I am not sure what exactly you tested and how your conclusion was drawn, so I can’t add much substance to this other than the previous comments about a custom extractor.
I hope all of this helps. It seems like you may have a use case that requires a bit of troubleshooting and exploration. We have a virtual datalad office hour every Tuesday at 16h00 CET, in case you want to show up and speak to all the experts about your use case in real time. You can find more info about this office hour here: You're invited to talk on Matrix