Hi! I was hoping to see if anyone in the community has any ideas on a problem we ran into recently.
Background
An HPC has a giant constrained file system with many large open datasets stored on it. These are free to access and it’s very inexpensive to store data on this file system too. We’ll call this system A.
There is a second file system available that is not constrained but is more expensive to store data on. This one is system B
Both A and B are network-mounted and available to the entire HPC.
Goal
We’d like to use datalad/BABS to process these large datasets without copying their content from A to B. We would do the processing on B and would like to ultimately store the results as a RIA store on A.
Current plan
Our current idea is to create a DataLad dataset on file system B similarly to how the HCP openaccess dataset was created. We would use addurls with the urls being paths to files on system A. Then we’d use BABS/fairlybig to process the data on system B. After all the results branches are merged, we’d move the results RIA store to system A.
Question
Does this sound reasonable?
In this setup we’d have data stored only on system A with shasums and remote locations stored in the datalad dataset on B. Is there a way to regularly check that the data on A hasn’t been changed?
Has anyone else attempted to use BABS/fairlybig on a crippled file system?
Thanks in advance!