Datalad won't publish to Google Drive and slow to save

dvsmith · January 17, 2020, 5:45am

Hi – I’m trying to incorporate datalad into my general lab procedures for data management. So far, it’s been really nice for data sharing across labs and institutions, but I’ve ran into a couple of issues (perhaps because I’m missing something):

When I use datalad publish my files do not go to the Google Drive remote I set up and linked when I created the sibling on GitHub.
Once I accumulate a lot of files (e.g., after running FMRIPREP on a dataset with 50 participants), the datalad save command becomes very slow. It doesn’t seem to be copying anything to the remote, but it is creating the sym links to the files in the annex. Even after 12 hours, it is not done.

Here are some of my commands:

# create dataset
mkdir srndna-public-test
cd srndna-public-test
datalad create --annex-version 7 --text-no-annex --description "SRNDNA public test data on Smith Lab Linux" --shared-access group

# create remote on Google Drive via Rclone
git annex initremote gdrive type=external externaltype=rclone target=dvs-temple prefix=srndna-public-test/annex chunk=50MiB encryption=none rclone_layout=lower

# create sibling and save setup
datalad create-sibling-github srndna-public-test -s DVS-Lab-GitHub --github-organization DVS-Lab --publish-depends gdrive
datalad save . -m "initial save" --version-tag "initialsetup" 

# convert to bids and save. this works ok.
datalad run -m "heudiconv, defacing, and mriqc" "bash run_prepdata.sh"

# run FMRIPREP and and FSL and then try to save again
datalad save -m "add preprocessing and level 1 stats"

That last save runs forever. And nothing ever appears to go Google Drive. Sorry if I’m missing anything and thanks for such a great tool!

Best,
David

dvsmith · February 3, 2020, 4:31pm

Hi – just to follow up, maybe the slowness is related to the post below? Though I’m still confused since it doesn’t seem like files are being transferred to the remote annex (it is creating the symlinks, though).

Any ideas? If I’m reading this GitHub issue correctly (https://github.com/datalad/datalad/issues/3869), it, may not be unexpected for the datalad save to take twice as long as it took to generate the data?
Creating datalad dataset with existing directories

Also, with neuroimaging data, would be best to make a subdataset for each subject? It seems like that was recommended on this thread for the HCP data? And maybe it is the default approach in heudiconv when the --datalad option is enabled?

Maybe yarikoptic and/or eknahm would know best here? Thanks any advice! We’re very excited to start having datalad as a regular part of our workflow.

Best,
David

dorianps · February 3, 2020, 5:00pm

@dvsmith fyi I have switched to one dataset per subject to decrease time overhead and other headached when working on individual subjects. The problem is with some pipelines, like fmriprep, which scatter subject files in separate places. I also wish I had done the same with my raw data which is all lumped in one dataset. My suggestion for @yarikoptic et al is to create a command that takes a path and removed it from an existing subdataset and makes it a subdataset. This way would be easy to move in the desired structure without needing to make a careful plan from the beginning.

dvsmith · February 14, 2020, 1:35am

Thanks and sorry for the slow reply! I’d be curious to hear what @yarikoptic or @eknahm (I mistakenly thought I had tagged them initially).

I’m going to try and go with the one dataset-per-subject approach, but I agree this can get a little messy with some outputs like FMRIPREP and MRIQC. I’ll try to play around the this soon since it seems like it would be the biggest timesaver for these data; and it also looks relatively easily to implement and track in datalad, unlike git.

yarikoptic · February 14, 2020, 3:22am

Re publish:

See datalad publish --help

--transfer-data {auto|none|all}
                    ADDME. Constraints: value must be one of ('auto',
                    'none', 'all') [Default: 'auto']

where just now I mentioned ADDME so, with the default auto, it relies on “wanted” setting for the remote. You could use

git annex wanted gdrive 'not metadata=distribution-restrictions=*';

to make datalad publish always transfer all data to gdrive remote, unless files have git annex metadata key distribution-restrictions with some value. I use it e.g. to annotate sensitive or non-redistributable data, e.g.:

/tmp > datalad install ///labs/haxby/raiders                 
install(ok): /tmp/raiders (dataset)                                                                                                                                                                                                
(dev3) 
/tmp > git -C raiders annex metadata stimuli | head
metadata stimuli/task002/orig/INDIANA_JONES_RAIDERS_LOST_ARK_part_1.m4v 
  distribution-restrictions=proprietary
  distribution-restrictions-lastchanged=2016-09-26@18-14-28
  lastchanged=2016-09-26@18-14-28
ok
metadata stimuli/task002/orig/INDIANA_JONES_RAIDERS_LOST_ARK_part_2.m4v 
  distribution-restrictions=proprietary
  distribution-restrictions-lastchanged=2016-09-26@18-14-28
  lastchanged=2016-09-26@18-14-28
ok

Re slow save:

what versions of datalad? if uptodatish (0.12.2), then probably largely because of --annex-version 7 and files going through the smudge filtering. things might be faster if staying with older 5. - What busy processes do you see in the top?
Isn’t there a progress bar reporting ETA etc. Related: https://github.com/datalad/datalad/issues/4129
I would also recommend to not pile up everything (heudiconv, defacing, mriqc) into a single command/dataset. if you run heudiconv with --datalad - you would get per conversion commits. I would have placed mriqc into a separate subdatset

yarikoptic · February 14, 2020, 3:25am

BTW, @eknahm et al. has prepared the full collection of HCP dataset(s), if you need that data: see https://github.com/datalad-datasets/human-connectome-project-openaccess

dvsmith · February 14, 2020, 3:34pm

Thanks! Very helpful! If I recall correctly, the main process that I saw running with the long save was git. I’ll update to the latest version of datalad and start creating subdatasets. It sounds like should use an older version of git-annex when creating the initial superdataset?

datalad create --annex-version 5 --text-no-annex --description "SRNDNA public test data on Smith Lab Linux" --shared-access group

Thanks for the tip regarding the HCP data! We are using that for other projects, and the datalad integration would make our lives much easier.

Best,
David

yarikoptic · February 14, 2020, 3:58pm

may be you would also need to follow this

To prevent automatic upgrades in a repository, run: git config annex.autoupgraderepository false from https://git-annex.branchable.com/upgrades/, might even just add --global to that for now on that box

--text-no-annex suggests that you are still using 0.11.x version. You might want to upgrade to our “flagman” 0.12.x series now (there that option is gone, use -c text2git). Some things could get faster, some slower, but overall more correct etc

dvsmith · February 14, 2020, 4:12pm

Thanks! Will do! Looking forward to working more with datalad! Hopefully this Google Drive point will be obsolete once we have our data on OpenNeuro, but that’s a conversation for another day.