I want to use bitbucket for my datalad (non large files) repository and then use Amazon S3 for large data storage. I think, I know how to go about the latter but not clear about the former. I believe I can not do “git remote add” as normal for the repository. I believe, I will have to use something similar to datalad create-sibling-github but for bitbucket. Is this correct ?
Hi @sumodm,
There is no Bitbucket equivalent of create-sibling-github
, i.e., no command that will create a sibling repository on Bitbucket for you. What I would recommend you to do it to visit Bitbuckets webinterface, and create a new, empty repository there. Afterwards, add the repository as a sibling (remote) to your dataset with
$ datalad siblings add --name bitbucket --url clone-url.from.bitbucket
This will create a sibling bitbucket
that you can publish your non-large files to. Above that, if you have already configured your s3 special remote sibling, you can add it as a publication dependency. If you have a special remote sibling called “s3bucket”, the command would look like this:
$ datalad siblings add --name bitbucket --url clone-url.from.bitbucket --publish-depends s3bucket
With this setup, all annex files will be pushed automatically to your s3 bucket prior to pushing your dataset to Bitbucket.
However, you could also do a git remote add bitbucket <url>
- a dataset sibling is the DataLad equivalent of a Git remote. Using datalad siblings add
however has the advantage that you can configure the publication dependency to s3 (which saves you publishing both to Bitbucket and to s3).
Thanks Adina. Adding datalad siblings doesn’t seem to work (see error below) but manually adding with “git remote add” works and I was able to use it normally. For the large files, I am using S3 and was able to use it after enabling the S3 repo.
Error message while adding bitbucket as repo using “datalad sibling add”
[INFO ] Configure additional publication dependency on “priv_s3”
[INFO ] Failed to enable annex remote bitbucket, could be a pure git or not accessible
[WARNING] Failed to determine if bitbucket carries annex. Remote was marked by annex as annex-ignore. Edit .git/config to reset if you think that was done by mistake due to absent connection etc
Just FTR: the siblings
command has probably worked. Just the reporting you saw is rather unfortunate. I opened an issue to address this: https://github.com/datalad/datalad/issues/4322
Hey, I have a related question: when doing this, every person who wants to use this dataset would have to do two steps:
datalad clone clone-url.from.bitbucket foo-dataset
cd foo-dataset
datalad siblings configure --name origin --publish-depends s3bucket
because the datalad-publish-depends
config param in stored in .git/config
and so will not be cloned/fetched from the remote git repo.
Is there any way to work around this so that datalad clone
immediately sets up the publishing dependency? Maybe datalad could store the publish-depends
configuration in .datalad/config
instead (which is part of the git repo)? This is obviously not specific to BitBucket, but in general plain git remote repos
Thanks in advance!
every person who wants to use this dataset would have to do two steps …
Just FTR, it would be necessary for every person that has, and will make use of a push
to origin. For consumption of the dataset this is not needed.
Maybe datalad could store the publish-depends configuration in .datalad/config instead (which is part of the git repo)?
That may work. Setting a publication dependency is merely setting a config option
remote.<name>.datalad-publish-depends=<dep>
or in your case
remote.origin.datalad-publish-depends=s3bucket
it is possible to (manually) put this in .datalad/config
. I have not tested it (yet), though. I am also not 100% confident that there are no negative side-effects, or security implications that would prevent DataLad from honoring such a remote configuration (now or in the future).
Amazing, thanks! This worked like a charm.
Wouldn’t it make sense to somehow make this the default? It would essentially eliminate the need for custom create-sibling-{github,gitlab}
commands, no? Or at least have a CLI flag that allows one to --save-sibling-config-repo
when doing a regular create-sibling
By the way, it would be great to also update the Handbook documentation on using shared infrastructure to inform users that if they do datalad create-sibling <..> --publish-depends 'foo'
then this will NOT be persisted / made available for others who clone the repo. Now it was a bit confusing and I had to figure out manually that this section can’t work with regular create-sibling
commands - you MUST use create-sibling-{github,gitlab}
if you want to share the publish-depends
configuration.
Thanks for the feedback! Will look into it, and PRs are also always welcome!