How to set clone locations to local only on HPC

akieslinger · July 24, 2023, 7:26am

Summary of what happened:

Hi,

I was wondering how to set the clone priorities in such a way that remotes via internet are tried last or not at all.
We are using datalad a lot on compute nodes of our HPC environment that don’t have internet access and cloning takes longer than it needs to because it tries web urls first.

I have tried to set a clone candidate priority with remote-origin/{path} but had no luck. How can I set the clone candidate to local only?

Thanks.

Command used (and if a helper script was used, a link to the helper script or the command generated):

Version:

Environment (Docker, Singularity, custom installation):

Data formatted according to a validatable standard? Please provide the output of the validator:

Relevant log outputs (up to 20 lines):

Screenshots / relevant information:

loj · July 24, 2023, 8:32am

Hi @akieslinger! If I understand your situation correctly, I would recommend to set the clone candidate priority via the configuration variable datalad.get.subdataset-source-candidate-<name>. The DataLad Handbook chapter 1.5.2 (Clone candidate priority) outlines how to configure the priority of subdataset clone locations by attaching a cost to a source candidate.

I think the information in the Handbook will allow you to achieve what you need, but if that is not the case, please let us know.

akieslinger · July 24, 2023, 8:51am

Hi loj,

thanks for your reply! I have read the handbook and was not able to get to my goal with only that info.

My problem is that I don’t know how to set the path to the remote-origin clone in the config variable correctly without explicitly specifying it.
Edit short example: I have a nested dataset at location A on my system. I want to clone this to location B. Dataset A is the remote-origin for me here. Dataset B tries to “get” itself from web sources first instead of from dataset A.

The api for datalad get says that it should be possible to use remote-origin in the path, but this causes errors for me or the path is not found. I do not want to specify the explicite path to remote-origin in this dataset.
Alternatively, I would want to set the cost for the local clone to 000 (“In case .gitmodules contains a relative path as a URL, the absolute path of the superdataset, appended with this relative path (cost 900).” check here).

Could you give an example for how to do this?

Edit: I don’t want to give an absolute path as I want this configuration to stay with a “dataset template” no matter its physical location.

mszczepanik · July 25, 2023, 5:23pm

Hi @akieslinger

If I understand correctly, you want the remote-origin (superdataset’s “origin” remote) URL be prioritized for subdataset clones, ahead of the URL recorded in .gitmodules (by default it is the other way round).

Can you try the following configuration in your .datalad/config:

[datalad "get"]
	subdataset-source-candidate-000startLocal = {remoteurl-origin}/{path}

FTR, 000 is “top priority”, but anything less than 590 (or maybe 500, not quite sure ATM) should work.

It seems like a documentation issue - based on what we learned here, I filed Documentation for `datalad.get.subdataset-source-candidate` option gives an incorrect property for URL of a configured remote · Issue #7458 · datalad/datalad · GitHub

Please let us know if that worked for you!

akieslinger · July 26, 2023, 9:40am

Hi mszczepanik,

thank you for your help, this worked for me!