our lab considers to use a local s3 bucket to facilitate distribution of and collaboration on large scale datalad datasets on our cluster. We are considering to save centralised datasets in the bucket from which every lab member can pull from and publish to. A motivating aspect is the fact that the pipeline outputs of these datasets exceed our local quotas on the cluster (the bucket is far larger) and pulling and pushing immediately/during the job to the bucket might alleviate our storage problems. We are wondering whether such a setup is feasible or a local (on the cluster) dataset where everybody is interacting with is necessary to easily address (e)merging issues etc. and push subsequently to the s3 sibling.
Can someone provide some guidance in this regard? Other input regarding handling large datasets with datalad is also very welcome (I found and studied the according datalad handbook chapters).