Datalad and S3 Bucket Usage on HPC

vinpetersen · July 20, 2021, 4:37pm

Hi,

our lab considers to use a local s3 bucket to facilitate distribution of and collaboration on large scale datalad datasets on our cluster. We are considering to save centralised datasets in the bucket from which every lab member can pull from and publish to. A motivating aspect is the fact that the pipeline outputs of these datasets exceed our local quotas on the cluster (the bucket is far larger) and pulling and pushing immediately/during the job to the bucket might alleviate our storage problems. We are wondering whether such a setup is feasible or a local (on the cluster) dataset where everybody is interacting with is necessary to easily address (e)merging issues etc. and push subsequently to the s3 sibling.

Can someone provide some guidance in this regard? Other input regarding handling large datasets with datalad is also very welcome (I found and studied the according datalad handbook chapters).

StephanHeunis · July 21, 2021, 7:40pm

Hi @vinpetersen

This doesn’t answer your full question, but in case you missed it there’s this chapter in the handbook that provides guidance on setting up an S3 bucket as a git-annex special remote for your datalad dataset.

Others with more experience collaborating on large datalad datasets might be better able to give tips about when/how to update siblings.

yarikoptic · July 21, 2021, 9:44pm

You might bottleneck at the outbound network level if lots of traffic to go to/from s3. Might establish some own storage server on the network and accessible eg via ssh, to which you which push/pull data, and then only push to s3 (or may be you have instructional Dropbox it some?) from it for extra backup/collaboration if so desired.

vinpetersen · July 22, 2021, 2:13pm

Hi @StephanHeunis and @yarikoptic ,

thanks for your answers. Indeed, I have already found the chapter for establishing a S3 bucket. Thanks for your perspective, Yaroslav. That’s what I was looking for.