Procedurally Getting (and Dropping) Data during Datalad Run

Servinjesus1 · August 14, 2024, 7:06pm

If I want to run a procedure on a ton of data and can do so procedurally (sequentially or in parallel), can datalad be told when to get and drop data during the run?

I assume normally, datalad run will run datalad get prior to running any code. If --jobs is specified, the getting is parallelized but not in a delayed way - meaning all the data has to be gotten before the scripts are run.

It would be awesome if data could be gotten “just-in-time” so to speak, like if the script accesses the pointer to the data, then the data is gotten. Assuming this kind of magic is hard, then maybe there’d be a way to import a datalad package in e.g. python which came with its own run-time methods?

What if, for example, I could call with datalad.get(datalad.run.imports[0]) as data (python pseudo-code) to get the data such that when the scope of the with expires, this datalad.get method datalad drops the data? (here datalad.run.imports might be a file list of specified imports on the command line.) This would be a nice way to take advantage of python’s with closure/on-exit capabilities to manually specify the lifetime of imports so they may be dropped to save disk space.

Servinjesus1 · August 14, 2024, 7:15pm

As suggested by Claude, one way to accomplish this is:

for file in input_dataset/*.dat; do
  output_file="out_$(basename $file)"
  datalad run \
    -m "Process data for $(basename $file)" \
    --input "input_dataset/$file" \
    --output "$output_file" \
    "python process_data.py input_dataset/$file $output_file && datalad drop input_dataset/$file"
done

Granted, there are several confusions about something like this:

Is this script run by datalad run? If so can datalad handle runs within runs (likely not)
This pollutes the run log with a significant number of near-identical run operations. If only run/rerun had the ability for templated operations?