Can the "datalad run" command of a python script automatically save input and output files?

jhpb7 · May 20, 2022, 8:11am

I’m currently wrapping my head around datalad and all it’s impressive functionality.
I have the following question:
If I use datalad run for running python scripts, is there a way to automatically track the input and output files?
I’m wondering if there is a possibility to have datalad track the input and output files used inside a python script that I run using datalad run without explicity appending them to the datalad run command.
In the datalad handbook there is one examples using input and output files and python, which does not seem an intuitive workflow to me.
This first script uses (hardcoded) inputs inside the python code AND the datalad run command gives the same inputs to the script which seem to be ignored. To me this rather seems like a workaround, as the input to the code is in fact not tracked by datalad.
Another similar approach is given here in the “Find-out-more: Write your own procedures”
However, here I would have to append the inputs to the dl run command by hand or write another script that runs the python script appending the inputs to the command.
Is there a more intuitive way to dl run and track which files are used inside the python script?

Thanks in advance

yarikoptic · May 20, 2022, 9:49pm

not at the moment. It would require to implement some transparent to the user tracking of IO to establish/discover what was used as input or output, and immediately react on it (get the input content, unlock outputs). Something could be bolted

for python-based scripts using now deprecated (moved to GitHub - datalad/datalad-deprecated: DataLad extension for functionality that has been phased out of the core package extension) ad-hoc’ish AutomagicIO we had, or
GitHub - datalad/datalad-fuse: DataLad extension to provide FUSE file system access extension (would be Linux only) .
some other magic… have some ideas?

Either of those would mean “slow down the process doing that extra tracking”. Although in case of datalad-fuse it might actually provide an efficiency angle since might avoid fetching large files in full (so might be worth considering). And in case of AutomagicIO based one would be “user-efficient” in avoiding user to specify input/outputs; BUT it would never be “functionally complete” since relies on monkey patching open calls at Python level and thus guaranteed to miss some.

correct, I also think that the example is little “suboptimal” in that there is a hardcoded path in the script, and that an ideal example should have taken that file path from command line. I guess the idea of the authors could have been to make it really simple and demonstrate that you don’t need to modify original scripts so they could be used by datalad run. But since it is describing YODA principles (not just an intro into datalad run) I think providing a more “kosher” example would be better. Since you discovered this – so I don’t steal the honor – would be interested to submit a PR against handbook to make that filepath into positional argument to the script?

sorry – you lost me here a bit. May be it is just a confusion of relationship between datalad run and datalad run-procedure of which there is none! For datalad run you can given any command invocation and it would be for that command interface to define arguments. run-procedure has no relationship to datalad run - it is to run “datalad procedures” which require first positional argument to be a dataset. The rest of arguments to them is not “controlled” by DataLad AFAIK.

jhpb7 · May 23, 2022, 2:05pm

thank you very much for the detailed answer!
I think AutomagicIO is approximately what I was looking for.
I’m not particularly experienced in software development, but alternatively I would have thought of a function that allows me to define files as input in Python:

input1 = "somepath"
datalad.add_input(input1)

and the same with outputs. And this would then be included in the commit-hash of the datalad run. But I really can’t foresee how successful/efficient this would be and of course inputs / outputs could still be overlooked here…

Thanks for the input. I created a feature request.

I see, I guess I misinterpreted both commands to rather be the same. Thanks for your clarification.