Turning subdataset into subdirectory that is gitignored

lukasvo76 · March 1, 2023, 8:37am

Hi Datalad Team,

I have a superdataset in which one subdataset (sourcedata) is slowing things (datalad save, datalad status, etc) down tremendously because of a high amount of (classic) DICOM files which are all annexed.

I tried adding the following to .gitignore in my superdataset, to ask git(annex)/datalad to ignore the entire subdataset

sourcedata/**

In addition, I added the following to .gitignore in the subdataset, to ask git(annex)/datalad to ignore the content of the directories containing the DICOMs in each subject dir

**/DICOM

Things seem to remain slow however, and in hindsight, the best solution may be to not have any version control on the sourcedata subdataset (these DICOMs do not change anyway, they are just copied once into the dir) by turning it into a subdirectory and keep it in .gitignore of the superdataset (which works fine for other subdirs I have).

What would be the best way of doing this? I could not find this in the handbook. No problem to loose history, so maybe just getting rid of the .git and .datalad inside the folder, and remove the submodule/subdataset in .gitmodules in the superdataset? Or would I need to unannex everything first?

Thanks a ton for the help - for future datasets, I will make sourcedata a subdir rather than subdataset and have it gitignored from the start

Cheers,

Lukas

yarikoptic · March 1, 2023, 2:59pm

since it is a subdataset, could you just uninstall it (if you have it also securely in another location) for regular operations? or you do need it to be present?

NB paths which are already known to git is not ignored (by git either) even if it is already as well in .gitignore, so it would be some other explicit notion to introduce to ignore “committed” paths.

lukasvo76 · March 1, 2023, 3:44pm

Thanks Yarik!

I would like to keep it present if possible, but do have the sourcedata securely in another place, so I could indeed uninstall and copy it over again as a subdirectory if that is the easiest workaround. For the purpose of uninstalling, would you recommend datalad uninstall or datalad remove in this case?

Is there any way to add such an explicit notion to .gitignore (or somewhere) else which would result in committed paths to be ignored?

Cheers again,

Lukas

yarikoptic · March 1, 2023, 11:22pm

not remove for sure - that one would be analogous to git rm. Just uninstall or drop --what datasets

I am not aware of one. Share if you find anything related, e.g. explicit annotation for a submodule to not be investigated for changes.

lukasvo76 · March 2, 2023, 7:47am

Thanks Yarik!

Just to make sure, shall I run the following from my superdataset

datalad drop -d sourcedata --what datasets

I guess I will also need --reckless availability since I do not have any remotes sources for this particular subdataset (contrary to the superdataset and all the others)? And do I need -r as I want to drop a subdataset? Or will this drop all subdatasets?

Or should I rather use --what filecontent or --what allkeys to only empty the annex, after which .gitignore may work?

Finally, I want to create a copy of the subdataset first outside the superdataset boundaries, without symlinks, git history, etc - what would be the best way of doing this?

Cheers,

Lukas

lukasvo76 · March 20, 2023, 9:37am

Hi Yarik,

Could you please have a look at my previous post?

Just wanted to confirm whether what I wrote there is correct before trying things!

Thanks a lot,

Lukas

yarikoptic · March 20, 2023, 2:05pm

I would recommend to

avoid doing reckless things whenever possible
try/demo-script it in some throw away script to see how/if it all would work so you make sure that it all does what you want it to do. E.g. here is the one I quickly sketched:

#!/bin/bash
set -ex

cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"

datalad create super
cd super
datalad create  -d . subds

echo "precious" >| subds/data.dat
datalad save -m "Super with data in subdataset" -r -d .

# Now we decided to move subds out
cd subds
datalad create-sibling -s origin ../../subds-origin
#  could be I guess this right away
# datalad push --to origin
#  but let's just move 
git annex move --to=origin
# declare dead here as datalad drop recommended
git annex dead here
datalad push --to origin

cd ..
datalad drop -d . --what datasets subds


echo "See what we have"
datalad status
cd ..
tree

and which IMHO goes through your use case and results in

+ echo 'See what we have'
See what we have
+ datalad status
nothing to save, working tree clean
+ cd ..
+ tree
.
├── subds-origin
│   └── data.dat -> .git/annex/objects/jF/3W/MD5E-s9--4ee9c5beec01edf6b964ab48999c0c58.dat/MD5E-s9--4ee9c5beec01edf6b964ab48999c0c58.dat
└── super
    └── subds

4 directories, 1 file

lukasvo76 · March 20, 2023, 2:21pm

Thanks a lot Yarik, will look into it!

Cheers,

Lukas

eknahm · April 18, 2023, 7:29am

Here is a related info bit on dropping subdatasets KBI0005: Drop a subdataset to speed up superdataset operations — PsyInf Knowledge Base documentation

lukasvo76 · April 18, 2023, 7:40am

Thanks a lot, this is probably the most elegant solution indeed!

Lukas