Best approach to dealing with institution-shared datasets on a constrained file system

gudapatis · January 29, 2024, 12:31pm

Hi! I was hoping to see if anyone in the community has any ideas on a problem we ran into recently.

Background
An HPC has a giant constrained file system with many large open datasets stored on it. These are free to access and it’s very inexpensive to store data on this file system too. We’ll call this system A.

There is a second file system available that is not constrained but is more expensive to store data on. This one is system B

Both A and B are network-mounted and available to the entire HPC.

Goal
We’d like to use datalad/BABS to process these large datasets without copying their content from A to B. We would do the processing on B and would like to ultimately store the results as a RIA store on A.

Current plan
Our current idea is to create a DataLad dataset on file system B similarly to how the HCP openaccess dataset was created. We would use addurls with the urls being paths to files on system A. Then we’d use BABS/fairlybig to process the data on system B. After all the results branches are merged, we’d move the results RIA store to system A.

Question
Does this sound reasonable?
In this setup we’d have data stored only on system A with shasums and remote locations stored in the datalad dataset on B. Is there a way to regularly check that the data on A hasn’t been changed?
Has anyone else attempted to use BABS/fairlybig on a crippled file system?

Thanks in advance!

yarikoptic · January 29, 2024, 3:57pm

Question: So there is a consideration that data on A CANNOT be provided (from A) as DataLad datasets? Because if it could, then it is just a matter of “ephemeral” reckless clones to B I think. If not – probably looking into importree functionality of git-annex from a directory special remote to keep updates on filesystem B (but yet to see/figure out what about storage concern)…