Datalad workflow for growing number of input files

observingClouds · June 29, 2021, 3:34pm

Hi there,
I’m very excited about Datalad and am curious to know how you would set up a workflow that has to deal with a growing and at times updating input dataset.
To be specific, I run climate simulations that produce several output files on a native grid that I convert to a different grid and do some additional post-processing with. Because this simulation is running a long time I want to start my workflow while still new input files are produced. At a later time I want to just re-run the workflow to process the remaining non-processed or updated input files.

The general/simplified setup

an input directory with files of patterns variableA.dat, variableB.dat and variableC.dat
a function that converts each individual file e.g. variableA_atTimeX.dat to variableA_atTimeX_converted.dat (there are not dependencies between input files)
all output files are collected in one output folder
further files of any pattern are added continuously

Trial
I could use something similar to the example code snippet:

datalad run -m "convert all files" \
--input "*variable*.dat" \
--output "output_folder/" \
"convert {inputs} {outputs}"

However, this solution would

expand *variable*.dat and no new files will be regarded when re-running the workflow
the input-files are not handled individually. If any file changes, all files are converted again
or the convert script needs to decide which files need to be processed and would need to check the hash of each input-file individually

A possible solution?
I think, what I am looking for is something like the above example, but

with no expansion of the wildcard i.e. the pattern information persists for future re-runs, but
at the same time still an expansion for the current run, and
an additional flag for the datalad run command, like --vectorize that would result in a map of the function on each input/output argument, e.g. convert input1 output1;convert input2 output2;…convert inputN outputN.
Instead of the whole list of input/output files, each input-output pair is given individually to the convert function.

I’m a complete Datalad beginner, so maybe this functionality already exists or cannot be implemented or does not make sense.

Anyway, I’m interested to see your suggestions or questions if I haven’t been clear enough.

Cheers,
Hauke

yarikoptic · June 29, 2021, 11:22pm

I think the best way would be to have one clone of the dataset to be the one where input files are “collected”, let’s call it “incoming” . Then you could have a clone or might even better have a “derived” dataset which would follow the YODA principles (have a pick at GitHub - myyoda/myyoda: YODA I am or handbook has more), let’s call it “derived”, which would just have incoming dataset as a subdataset (you can either benefit from CoW of your file system if you have that, or install it in some “reckless” mode). This way you would always have clarity on what state of “incoming” you performed conversion in your “derived”.

If you want to populate in the same dataset, still just create a clone (the same CoW or reckless) and do conversions in it , while periodically git pull (or datalad update --how=merge) ing the original location, thus “merging in” new incoming files. Again – would be clear on what state you perform conversion etc, while allowing for incoming files to flow in.

This is somewhat inline with how datalad crawler performs whenever there is a need to extract some archives: there is incoming branch – original stuff from the web, processed – extracted archives, master- extracted merged with possible other manual tune ups. Examples – any of the good old OpenfMRI datasets at DataLad Repository iirc

asking questions about datalad run and establishing automated workflows graduates you right away from the “beginners’ bench”