GSoC 2022 Project Idea 17.1: Adding BIDS Prov to new dataflow engine written in Python: Pydra (175 h)

Intro: Pydra is a new lightweight dataflow engine written in Python. The package is a part of the second generation of the Nipype ecosystem — an open-source framework that provides a uniform interface to existing neuroimaging software and facilitates interaction between different software components. The Nipype project was born in the neuroimaging community, and has been helping scientists build workflows for a decade, providing a uniform interface to such neuroimaging packages as FSL, ANTs, AFNI, FreeSurfer and SPM. This flexibility has made it an ideal basis for popular preprocessing tools, such as fMRIPrep and C-PAC. The second generation of Nipype ecosystem is meant to provide additional flexibility and is being developed with reproducibility, ease of use, and scalability in mind.

Pydra is developed as an open-source project in the neuroimaging community, but it is designed as a general-purpose dataflow engine to support any scientific domain. Scientific workflows often require sophisticated analyses that encompass a large collection of algorithms. The algorithms that were originally not necessarily designed to work together, and were written by different authors. Some may be written in Python, while others might require calling external programs. It is a common practice to create semi-manual workflows that require the scientists to handle the files and interact with partial results from algorithms and external tools. This approach is conceptually simple and easy to implement, but the resulting workflow is often time consuming, error-prone and difficult to share with others. Consistency, reproducibility and scalability demand scientific workflows to be organized into fully automated pipelines. This was the motivation behind Pydra.

Project: Provenance captures the relationship of data objects to the processes that generated them and the inputs to those processes, enabling scientific results to be interpreted and compared based on their generating processes. To be useful, the provenance must be comprehensive, understandable, easily communicated, and captured automatically in machine accessible form.

In the neuroimaging context, there is an ongoing effort (BIDS-Prov) to create a specification for the provenance of BIDS datasets, built on the W3C PROV standard.

There is a natural correspondence between a workflow definition, which describes the flow of data objects, and the provenance of the results of the workflow. We would like to build provenance tracking capabilities directly into Pydra, to dynamically construct BIDS-Prov compatible metadata that can be saved alongside any results. This work will be done in close collaboration with the BIDS-Prov working group.

Planned effort: 175 hours

Skills:

  • Python 3: novice +
  • Bash: novice +
  • Semantic web, RDF, JSON schema: novice +
  • Data workflows: beginner +

Mentors: Dorota Jarecka @djarecka, Michael Dayan @michael, Satra Ghosh @satra; collaboration with BIDS-Prov working group

Tech keywords: Provenance, BIDS, Workflow, Python, Pydra, Nipype

1 Like

Hi,
My name is Salma and I am really interested in the project that you have proposed. I have been programming in Python for 2 years (with formal undergraduate-level training), and have fair experience in bash shell scripting. As an applied machine learning engineer intern and undergrad researcher, I’ve experience using Nipype, and NiLearn and NiBabel libraries used for fMRI/fNIRS analyses, like for exploring brain functions in ASD.
I’ve developed a keen interest in computational neuroscience and I would be thrilled to work on this project and gain experience in neuroimaging software development.

Let me know what steps have to be taken next on my part in order to apply. I can send you my CV if you’d like.
Looking forward to hearing from you!

Hello,

I am an incoming PhD student at USC who wants to focus on reproducible and accurate neuroimaging research at the population level (specifically in the dMRI space) during my studies. I have been working with all of the mentioned neuroimaging analysis packages for about 4 years now and have been programming in Python for 2 (with formal graduate-level training). In my current position as a research tech, I work with nipype and shell daily to develop pipelines for a team of clinicians to use on traumatic brain injury patient data.

I commented on another GSoC Project Idea from your group but I would also be thrilled to work on this project as well. @djarecka @effigies @satra

Hi @Salma_Shaik ! Thank you for your interest in this project! Please feel free to send me your CV in a direct message. Do you have any public repository that you would like to share with us?

I’m wondering if you have any experience with semantic web?

Also, have you already made any significant contribution to any open source project? If you did, that’s really great, but please note that GSoC is not for already an established contributor in open source.

Hi @rcali Thank you for your interest in this project! Do you have any public repository that you would like to share with us to show your work?

I’m wondering if you have any experience with semantic web?

Also, have you already made any significant contribution to any open source project? If you did, that’s really great, but please note that GSoC is not for already an established contributor in open source.

Hi @djarecka,

I have not contributed to an open source project yet but I am eager to! I will private message you my CV and Github.

Cheers,

Ryan