I have a superdataset in which one subdataset (sourcedata) is slowing things (datalad save, datalad status, etc) down tremendously because of a high amount of (classic) DICOM files which are all annexed.
I tried adding the following to .gitignore in my superdataset, to ask git(annex)/datalad to ignore the entire subdataset
sourcedata/**
In addition, I added the following to .gitignore in the subdataset, to ask git(annex)/datalad to ignore the content of the directories containing the DICOMs in each subject dir
**/DICOM
Things seem to remain slow however, and in hindsight, the best solution may be to not have any version control on the sourcedata subdataset (these DICOMs do not change anyway, they are just copied once into the dir) by turning it into a subdirectory and keep it in .gitignore of the superdataset (which works fine for other subdirs I have).
What would be the best way of doing this? I could not find this in the handbook. No problem to loose history, so maybe just getting rid of the .git and .datalad inside the folder, and remove the submodule/subdataset in .gitmodules in the superdataset? Or would I need to unannex everything first?
Thanks a ton for the help - for future datasets, I will make sourcedata a subdir rather than subdataset and have it gitignored from the start
since it is a subdataset, could you just uninstall it (if you have it also securely in another location) for regular operations? or you do need it to be present?
NB paths which are already known to git is not ignored (by git either) even if it is already as well in .gitignore, so it would be some other explicit notion to introduce to ignore βcommittedβ paths.
I would like to keep it present if possible, but do have the sourcedata securely in another place, so I could indeed uninstall and copy it over again as a subdirectory if that is the easiest workaround. For the purpose of uninstalling, would you recommend datalad uninstall or datalad remove in this case?
Is there any way to add such an explicit notion to .gitignore (or somewhere) else which would result in committed paths to be ignored?
Just to make sure, shall I run the following from my superdataset
datalad drop -d sourcedata --what datasets
I guess I will also need --reckless availability since I do not have any remotes sources for this particular subdataset (contrary to the superdataset and all the others)? And do I need -r as I want to drop a subdataset? Or will this drop all subdatasets?
Or should I rather use --what filecontent or --what allkeys to only empty the annex, after which .gitignore may work?
Finally, I want to create a copy of the subdataset first outside the superdataset boundaries, without symlinks, git history, etc - what would be the best way of doing this?
try/demo-script it in some throw away script to see how/if it all would work so you make sure that it all does what you want it to do. E.g. here is the one I quickly sketched:
#!/bin/bash
set -ex
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
datalad create super
cd super
datalad create -d . subds
echo "precious" >| subds/data.dat
datalad save -m "Super with data in subdataset" -r -d .
# Now we decided to move subds out
cd subds
datalad create-sibling -s origin ../../subds-origin
# could be I guess this right away
# datalad push --to origin
# but let's just move
git annex move --to=origin
# declare dead here as datalad drop recommended
git annex dead here
datalad push --to origin
cd ..
datalad drop -d . --what datasets subds
echo "See what we have"
datalad status
cd ..
tree
and which IMHO goes through your use case and results in
+ echo 'See what we have'
See what we have
+ datalad status
nothing to save, working tree clean
+ cd ..
+ tree
.
βββ subds-origin
β βββ data.dat -> .git/annex/objects/jF/3W/MD5E-s9--4ee9c5beec01edf6b964ab48999c0c58.dat/MD5E-s9--4ee9c5beec01edf6b964ab48999c0c58.dat
βββ super
βββ subds
4 directories, 1 file