Using datalad for managing data on cluster

JulianIEK3 · January 24, 2023, 10:09am

Hello everyone,

we want to use datalad for managing the data on our institutes’ cluster.

However, when we convert an existing (test-)folder to a datalad dataset, all files are shifted to ./.git/annex/objects/… and only the links to the shifted data in ./.git/annex/objects/… are displayed in the previous file-location.

Our questions:

Is this the normal behavior of datalad?
We are worried, that if we convert our data-folders to datalad datasets, all data on our cluster will be relocated to the .git/annex/objects-path.

Thanks in advance,
Julian

StephanHeunis · January 24, 2023, 10:42am

Hi Julian

we want to use datalad for managing the data on our institutes’ cluster.

Great!

Is this the normal behavior of datalad?

Without specifying any particular configuration, this is normal yes. DataLad manages your data with two main tools: git and git-annex. This is useful because it creates a modular and portable dataset (the git repository) which contains information about all the data (and their history) inside the dataset, while the actual data content is managed by git-annex, which can handle arbitrarily large data files. File content that are placed under management of git-annex will be moved into ./.git/annex/objects/… while a symbolic link (symlink) to this content will remain in the original path and will be tracked in the git repository.

By managing the dataset with these tools, you can for example share the dataset (git repository) publicly, while keeping the contents safe on your cluster. People can then access the dataset (with datalad clone) and download individual files in the dataset (with datalad get) if they have access credentials for that location on your cluster.

However, you can use configurations to specify how datalad should commit/manage your data. You might want to commit all your files to git (not if they are large or too many, or if you don’t want to make them publicly available!), or only some files to git (e.g. small text files, or files that should be publicly available), or everything to the annex.

The Datalad Handbook has some very useful information on applying standard or custom configurations to your datasets:

We are worried, that if we convert our data-folders to datalad datasets, all data on our cluster will be relocated to the .git/annex/objects-path.

This will happen, but shouldn’t be something to worry about. The files will still be accessible through datalad via their original paths. Is there a specific thing that is worrying about this for you?

Also, if you decide later to not use datalad anymore, and you want to remove version control from your files/datasets, and remove the content from the annex, there is a way to do this with git annex unannex. See a complete description here: 9.2. Miscellaneous file system operations — The DataLad Handbook

Best,

Stephan

JulianIEK3 · January 24, 2023, 2:09pm

Thank you Stephan for your quick reply, which helped us a lot in the decision process, already.

This will happen, but shouldn’t be something to worry about. The files will still be accessible through datalad via their original paths. Is there a specific thing that is worrying about this for you?

We were worried about the fact, that datalad shifts the data in the dataset to the .git/annex/objects-path upon creation of the datalad dataset, because we thought it is a computationally heavy operation to shift all the data. But if I understand it correctly now, the data is not shifted physically to the .git/annex/objects-path. I mean, the location of the data on the hard drive is not changed, is it?
Thus, it would not take too long to initialize a data folder, sized ~0.5 TB, to be a datalad dataset, would it?

Thanks again!

Best
Julian

StephanHeunis · January 24, 2023, 9:27pm

datalad, via git-annex, does in fact shift the file content to the .git/annex/objects/ directory once you run a datalad save after you have created the dataset with datalad create. There’s a high-level explanation of what happens when putting file content under git-annex management here: how it works. However, when accessing the content via datalad, a user will still point to the file at its original path.

How long this “annexing” operation takes is very much dependent on the infrastructure and file system on which it will be executed, and on the amount of files in the dataset. But my guess is that turning a single 0.5TB dataset into a datalad dataset can take quite some time (many minutes to hours, or even longer). What you could do is to test it out on a partial sample dataset resembling your actual data and see how your system handles that.

A frequented option for making these operations more manageable would be to split the dataset into several subdatasets. See a nice write-up here: 2.1. Going big with DataLad — The DataLad Handbook. E.g. if it’s a typical neuroimaging dataset, you could turn each subject’s collection of files into a dataset, and register all of these as subdatasets of a single superdataset. In that way you could use a job scheduling system on your HPC to run parallel jobs to turn the data into datalad datasets, one per subject.

JulianIEK3 · January 25, 2023, 7:59am

Dear Stephan,

thanks again for your insights and the explanatory links.

The way that you proposed (splitting up the original dataset and running parallel jobs to convert these chunks to sub-datalad-datasets) is probably the way we gonna go.

Best
Julian