Incomplete data on datasets.datalad.org

kinshuk · December 9, 2021, 1:28pm

Hello!
I’m working on archiving important open-access datasets in long term decentralized storage (Filecoin). I’m currently using the datasets on the datalad index as a starting point to map out the general pipeline.

It looks like several datasets have sources hosting incomplete data - I get the error - (file) [not available; Try making some of these repositories available multiple times when I run datalad get -r.
For example, when I run

datalad install -r nidm
cd nidm
datalad get . -r

I get a stream of errors which begin like this:

[ERROR ] not available; Try making some of these repositories available:; c5b2b9f1-2062-45ce-87d9-8794dfcb000f [get(/mnt/volume_sgp1_01/scripts/datasets/nidm/.datalad/metadata/objects/0e/cn-2e262d6ab40f1f0fa399e80866f732.xz)]
get(error): .datalad/metadata/objects/0e/cn-2e262d6ab40f1f0fa399e80866f732.xz (file) [not available; Try making some of these repositories available:; c5b2b9f1-2062-45ce-87d9-8794dfcb000f]
[ERROR ] not available; Try making some of these repositories available:; c5b2b9f1-2062-45ce-87d9-8794dfcb000f [get(/mnt/volume_sgp1_01/scripts/datasets/nidm/.datalad/metadata/objects/0e/ds-2e262d6ab40f1f0fa399e80866f732)]
get(error): .datalad/metadata/objects/0e/ds-2e262d6ab40f1f0fa399e80866f732 (file) [not available; Try making some of these repositories available:; c5b2b9f1-2062-45ce-87d9-8794dfcb000f]

I get similar errors for some other datasets I tried - studyforrest, physionet, neurovault. I haven’t tried some of the larger datasets as I wanted to get the pipeline working before scaling up the infrastructure.

I would really like to be pointed in the right direction, I think I’m missing some important step here.

Thanks!

yarikoptic · December 9, 2021, 2:50pm

COOL! Thank you!

would there be a way to backreference files in filecoin (may be for some datasets we could point to those locations as well if they are going to be public, or even require some auth)
for some we might be able to address the issue (e.g. by working with @eknahm for studyforrest), for some – might not (without providing our own copy).
for some, e.g. openneuro – since they use datalad for their backend, such issues would need to be brought to their attention
for some, we might not even be able to address the issues (data might be gone… we aren’t hosting the data typically, so if gone from original project, might be gone forever)

so, altogether, we could try to go piecemeal and start with those datasets, or may be prepare a full list, well – spreadsheet on google docs? – and annotate the status/references to related issues, etc. WDYT?

edit1: actually we might even better establish github repo with machine fillable and human modifiable json or yaml which would provide status for all those datasets, with comments etc, from which we could render the “dashboard” in the README or smth like that. Became a good fun of that approach, as we use it for https://github.com/datalad/datalad-usage-dashboard https://github.com/datalad/datalad-extensions/ etc (although in those cases it is primarily all machine generated).

kinshuk · December 10, 2021, 8:25am

Addressing each of your points:

Broken symlinks are a definite no-go due to the way Filecoin requires data to be packaged before storage (it creates a DAG of the directory tree, serializes this tree and encodes it in a particular way - Content Addressable aRchives, or CAR files). One way could be to ‘fork’ the datasets after removing the broken links, and re-add the missing data later on, while recording this forking info in the git history.
It’s more useful to think of Filecoin storage as giving us the ability to take periodic “snapshots” of the state of the dataset.
This would be really useful!
Agree. Although it might take quite a bit of work to manually repair sources for each dataset.
This is exactly the kind of problem we want to solve with decentralized archival.

I have a very rough doc that @seldamat and I have been preparing. We can collab on getting a polished version up for everyone to access if there’s interest.

This is interesting. How is the usage data collected?

yarikoptic · December 15, 2021, 2:45pm

quick overarching answer on data usage data (if that what “usage” you referred to): nohow

Longer one – FWIW there are some datasets.datalad.org http server logs, but since data could be coming from other locations, we would not even be able to get a hint that any particular dataset is still used/accessed from those. Clones shared on github, OSF, etc aren’t tracked at all. We aim for maximal dissemination and transparency, thus need to sacrifice some aspects (such as tracking) along the way.

edit: and yes – there is interest to collab on the doc.