Hi there,
I’m very excited about Datalad and am curious to know how you would set up a workflow that has to deal with a growing and at times updating input dataset.
To be specific, I run climate simulations that produce several output files on a native grid that I convert to a different grid and do some additional post-processing with. Because this simulation is running a long time I want to start my workflow while still new input files are produced. At a later time I want to just re-run the workflow to process the remaining non-processed or updated input files.
The general/simplified setup
- an input directory with files of patterns variableA.dat, variableB.dat and variableC.dat
- a function that converts each individual file e.g. variableA_atTimeX.dat to variableA_atTimeX_converted.dat (there are not dependencies between input files)
- all output files are collected in one output folder
- further files of any pattern are added continuously
Trial
I could use something similar to the example code snippet:
datalad run -m "convert all files" \
--input "*variable*.dat" \
--output "output_folder/" \
"convert {inputs} {outputs}"
However, this solution would
- expand
*variable*.dat
and no new files will be regarded when re-running the workflow - the input-files are not handled individually. If any file changes, all files are converted again
- or the convert script needs to decide which files need to be processed and would need to check the hash of each input-file individually
A possible solution?
I think, what I am looking for is something like the above example, but
- with no expansion of the wildcard i.e. the pattern information persists for future re-runs, but
- at the same time still an expansion for the current run, and
- an additional flag for the
datalad run
command, like--vectorize
that would result in a map of the function on each input/output argument, e.g.convert input1 output1
;convert input2 output2
;…convert inputN outputN
.
Instead of the whole list of input/output files, each input-output pair is given individually to theconvert
function.
I’m a complete Datalad beginner, so maybe this functionality already exists or cannot be implemented or does not make sense.
Anyway, I’m interested to see your suggestions or questions if I haven’t been clear enough.
Cheers,
Hauke