Using Bitbucket with Datalad

I want to use bitbucket for my datalad (non large files) repository and then use Amazon S3 for large data storage. I think, I know how to go about the latter but not clear about the former. I believe I can not do “git remote add” as normal for the repository. I believe, I will have to use something similar to datalad create-sibling-github but for bitbucket. Is this correct ?

Hi @sumodm,

There is no Bitbucket equivalent of create-sibling-github, i.e., no command that will create a sibling repository on Bitbucket for you. What I would recommend you to do it to visit Bitbuckets webinterface, and create a new, empty repository there. Afterwards, add the repository as a sibling (remote) to your dataset with

$ datalad siblings add --name bitbucket --url clone-url.from.bitbucket

This will create a sibling bitbucket that you can publish your non-large files to. Above that, if you have already configured your s3 special remote sibling, you can add it as a publication dependency. If you have a special remote sibling called “s3bucket”, the command would look like this:

$ datalad  siblings add --name bitbucket --url clone-url.from.bitbucket --publish-depends s3bucket

With this setup, all annex files will be pushed automatically to your s3 bucket prior to pushing your dataset to Bitbucket.

However, you could also do a git remote add bitbucket <url> - a dataset sibling is the DataLad equivalent of a Git remote. Using datalad siblings add however has the advantage that you can configure the publication dependency to s3 (which saves you publishing both to Bitbucket and to s3).

1 Like

Thanks Adina. Adding datalad siblings doesn’t seem to work (see error below) but manually adding with “git remote add” works and I was able to use it normally. For the large files, I am using S3 and was able to use it after enabling the S3 repo.

Error message while adding bitbucket as repo using “datalad sibling add”

[INFO ] Configure additional publication dependency on “priv_s3”
[INFO ] Failed to enable annex remote bitbucket, could be a pure git or not accessible
[WARNING] Failed to determine if bitbucket carries annex. Remote was marked by annex as annex-ignore. Edit .git/config to reset if you think that was done by mistake due to absent connection etc

1 Like

Just FTR: the siblings command has probably worked. Just the reporting you saw is rather unfortunate. I opened an issue to address this: https://github.com/datalad/datalad/issues/4322

1 Like

Hey, I have a related question: when doing this, every person who wants to use this dataset would have to do two steps:

datalad clone clone-url.from.bitbucket foo-dataset
cd foo-dataset 
datalad siblings configure --name origin --publish-depends s3bucket

because the datalad-publish-depends config param in stored in .git/config and so will not be cloned/fetched from the remote git repo.

Is there any way to work around this so that datalad clone immediately sets up the publishing dependency? Maybe datalad could store the publish-depends configuration in .datalad/config instead (which is part of the git repo)? This is obviously not specific to BitBucket, but in general plain git remote repos :slight_smile:

Thanks in advance!

every person who wants to use this dataset would have to do two steps …

Just FTR, it would be necessary for every person that has, and will make use of a push to origin. For consumption of the dataset this is not needed.

Maybe datalad could store the publish-depends configuration in .datalad/config instead (which is part of the git repo)?

That may work. Setting a publication dependency is merely setting a config option

remote.<name>.datalad-publish-depends=<dep>

or in your case

remote.origin.datalad-publish-depends=s3bucket

it is possible to (manually) put this in .datalad/config. I have not tested it (yet), though. I am also not 100% confident that there are no negative side-effects, or security implications that would prevent DataLad from honoring such a remote configuration (now or in the future).

1 Like

Amazing, thanks! This worked like a charm.

Wouldn’t it make sense to somehow make this the default? It would essentially eliminate the need for custom create-sibling-{github,gitlab} commands, no? Or at least have a CLI flag that allows one to --save-sibling-config-repo when doing a regular create-sibling :slight_smile:

By the way, it would be great to also update the Handbook documentation on using shared infrastructure to inform users that if they do datalad create-sibling <..> --publish-depends 'foo' then this will NOT be persisted / made available for others who clone the repo. Now it was a bit confusing and I had to figure out manually that this section can’t work with regular create-sibling commands - you MUST use create-sibling-{github,gitlab} if you want to share the publish-depends configuration.

Thanks for the feedback! :slight_smile: Will look into it, and PRs are also always welcome!