Intro: Pydra is a new lightweight dataflow engine written in Python. The package is a part of the second generation of the Nipype ecosystem — an open-source framework that provides a uniform interface to existing neuroimaging software and facilitates interaction between different software components. The Nipype project was born in the neuroimaging community, and has been helping scientists build workflows for a decade, providing a uniform interface to such neuroimaging packages as FSL, ANTs, AFNI, FreeSurfer and SPM. This flexibility has made it an ideal basis for popular preprocessing tools, such as fMRIPrep and C-PAC. The second generation of Nipype ecosystem is meant to provide additional flexibility and is being developed with reproducibility, ease of use, and scalability in mind.
Pydra is developed as an open-source project in the neuroimaging community, but it is designed as a general-purpose dataflow engine to support any scientific domain. Scientific workflows often require sophisticated analyses that encompass a large collection of algorithms. The algorithms that were originally not necessarily designed to work together, and were written by different authors. Some may be written in Python, while others might require calling external programs. It is a common practice to create semi-manual workflows that require the scientists to handle the files and interact with partial results from algorithms and external tools. This approach is conceptually simple and easy to implement, but the resulting workflow is often time consuming, error-prone and difficult to share with others. Consistency, reproducibility and scalability demand scientific workflows to be organized into fully automated pipelines. This was the motivation behind Pydra.
Project: Provenance captures the relationship of data objects to the processes that generated them and the inputs to those processes, enabling scientific results to be interpreted and compared based on their generating processes. To be useful, the provenance must be comprehensive, understandable, easily communicated, and captured automatically in machine accessible form.
In the neuroimaging context, there is an ongoing effort (BIDS-Prov) to create a specification for the provenance of BIDS datasets, built on the W3C PROV standard.
There is a natural correspondence between a workflow definition, which describes the flow of data objects, and the provenance of the results of the workflow. We would like to build provenance tracking capabilities directly into Pydra, to dynamically construct BIDS-Prov compatible metadata that can be saved alongside any results. This work will be done in close collaboration with the BIDS-Prov working group.
Planned effort: 175 hours
Skills:
- Python 3: novice +
- Bash: novice +
- Semantic web, RDF, JSON schema: novice +
- Data workflows: beginner +
Mentors: Dorota Jarecka @djarecka, Michael Dayan @michael, Satra Ghosh @satra; collaboration with BIDS-Prov working group
Tech keywords: Provenance, BIDS, Workflow, Python, Pydra, Nipype