If I want to run a procedure on a ton of data and can do so procedurally (sequentially or in parallel), can datalad
be told when to get and drop data during the run?
I assume normally, datalad run
will run datalad get
prior to running any code. If --jobs
is specified, the getting is parallelized but not in a delayed way - meaning all the data has to be gotten before the scripts are run.
It would be awesome if data could be gotten “just-in-time” so to speak, like if the script accesses the pointer to the data, then the data is gotten. Assuming this kind of magic is hard, then maybe there’d be a way to import a datalad package in e.g. python which came with its own run-time methods?
What if, for example, I could call with datalad.get(datalad.run.imports[0]) as data
(python pseudo-code) to get the data such that when the scope of the with
expires, this datalad.get
method datalad drop
s the data? (here datalad.run.imports
might be a file list of specified imports on the command line.) This would be a nice way to take advantage of python’s with
closure/on-exit capabilities to manually specify the lifetime of imports so they may be dropped to save disk space.