Datalad data copy for untracked usage

Hi datalad gurus

Here are 2 use cases, @melanieganz and I cannot figure out what to do

  • we can datalad clone then datalad get and do stuff on my files, yeh
  • A - sometimes we can’t … some containers like fastsurfer do not resolve the symbolic link to the git annex object
  • B - other times I want to copy some files for students to play with, untracked just to mess about with them

→ how can one make copies of the data (dataset) located in my gitannex to some other location, not tracked and not as symbolic link (would solve A and B)

thx
@yarikoptic @StephanHeunis @eknahm

I think you may find answers (or at least good pointers) here:

it is hard to provide exhaustive answer without any details being shared on. Let me try:

A. git annex whereis and git annex list might be of help to see where you can find specific files of interest.

B – well, just mess around with them, nobody forbids you. so again – hard to understand what you mean. May be smth about datalad run?

scp or rsync -L ?

Dear all,

so I think what Remi pointed to (9.2. Miscellaneous file system operations — The DataLad Handbook) can fixed what went wrong with the datalad remove. So this we can fix.

Now the other two issues we are having is related to the file representation within datalad and the git-annex.

  1. I want to run an existing script we have for running fastsurfer with singularity on a BIDS dataset we downloaded. The command looks like this:

python recon-all_bids.py -s 718211 -ses 01 -fs_seg -recon_dir /staff/mganz/myelinproject/ds003653/derivatives/fastsurfer/ -bids_dir /staff/mganz/myelinproject/ds003653 -sif_image fastsurfer-gpu-v2.0.1.sif

Now since the file doesn’t actually exist, but is a sym link to git annex where the actual file resides the binding doesn’t work:

singularity exec --cleanenv --nv --bind /staff/mganz/myelinproject/ds003653/derivatives/fastsurfer:/data/recon-all --bind /staff/mganz/myelinproject/ds003653/sub-718211/ses-01/anat:/data/t1 --bind /indirect/staff/mganz/myelinproject/code:/license_dir fastsurfer-gpu-v2.0.1.sif /fastsurfer/run_fastsurfer.sh --fs_license /license_dir/license --t1 /data/t1/sub-718211_ses-01_T1w.nii.gz --sid sub-718211_ses-01 --sd /data/recon-all --seg_only
ERROR: T1 image (/data/t1/sub-718211_ses-01_T1w.nii.gz) could not be found. Must supply an existing T1 input (full head) via --t1 (absolute path and name) for generating the segmentation.

The path is all correct, but singularity can’t resolve the symlink. Does this mean I can only run containers through datalad run? I can never do it like this because of the binding?

  1. We just had a server change and that has brought about some issues for us. For example just for testing we want to copy one or two subjects of a dataset in a student folder, since the new setup gives us load of issues with them reading from commong folders - don’t ask, we are in a big fight with our admin about this.
    But is there a way to “un-git-annex” a datalad dataset meaning to stop tracking and pull the data out of the annex and actually locate it on disk so we cna give it to students to mess around with in their own folder before we run their code on the whole dataset?

Cheers, Mel

short: no

longer: datalad run or containers-run don’t do any magic really. They just first datalad get all input files you specify and the container image, git annex unlock all output files you specify (so they could be modified), and then singularity, by default, bind mounts the top of the dataset so that all relative annex’ed symlinks to .git/annex located in the top of the dataset resolve just fine. As long you ensure that to happen – you can do everything manually.

short: yes

longer:

  1. I don’t know from which location you run, or what is the top of the dataset for anat/ which might matter but probably doesn’t since likely the /staff/mganz/myelinproject/ds003653/ is the top
  2. you bind mount subfolder within dataset sub-718211/ses-01/anat where files have symlinks pointing to ../../../.git/annex as /data/t1 . Within your container environment those symlinks under /data/t1 would have no chance to reach that ../../../.git/annex since it was not bound mount.

Fix, try smth like

singularity exec --cleanenv --nv \
    --bind /staff/mganz/myelinproject/ds003653/:/data/  \
    --bind /indirect/staff/mganz/myelinproject/code:/license_dir \
    fastsurfer-gpu-v2.0.1.sif /fastsurfer/run_fastsurfer.sh \
    --fs_license /license_dir/license \
    --t1 /data/sub-718211/ses-01/anat/sub-718211_ses-01_T1w.nii.gz \
    --sid sub-718211_ses-01 --sd /data/derivatives/fastsurfer/recon-all --seg_only

or even shorter

singularity exec --cleanenv --nv \
    --bind /staff/mganz/myelinproject/:/data/ \
    fastsurfer-gpu-v2.0.1.sif /fastsurfer/run_fastsurfer.sh \
    --fs_license /data/license_dir/license \
    --t1 /data/ds003653/sub-718211/ses-01/anat/sub-718211_ses-01_T1w.nii.gz \
    --sid sub-718211_ses-01 --sd /data/ds003653/derivatives/fastsurfer/recon-all --seg_only

hints:

  • the less bind mounts you have, less things to “debug”
  • bigger hint: organize more of YODA way, where “derived” dataset sits “above” the rawdata/ and not the other way around where you populate within your bids dataset derivatives/. Then you just bind mount that location of the dataset and have everything needed right under it. See e.g. GitHub - ReproNim/containers: Containers "distribution" for reproducible neuroimaging for more illustration and a prototypical workflow to fulfill YODA principle for such cases.
  1. I would recommend actually to give actual DataLad dataset to them and just git rm the subjects you don’t want them to analyze. But if you really want them to not get a datalad dataset, and thus loose track on what those students delete/add/change in that dataset while “working with it”, just cp -L or rsync -L the files you want which would dereference symlinks and thus simply copy the files.

If you want to export a dataset (strip version control) the export-archive command may be a solution.

Your fastsurfer does not see the file issue can be solved with an explicit unlock or by using adjusted branches (read more at 3.1. Data safety — The DataLad Handbook). The former can be performed by run or containers-run when declaring a path as output, and the latter also works with both commands, but doesn’t need this trick.

1 Like

Thanks, Yarik! I will try it like this!

Thanks Michael, then we can consider to do this in case we want to strip a dataset and just hand out files directly.