GSoC 2022 Project Idea 17.2: Adding new workers and resource management to new dataflow engine written in Python: Pydra (175 h)

Intro: Pydra is a new lightweight dataflow engine written in Python. The package is a part of the second generation of the Nipype ecosystem — an open-source framework that provides a uniform interface to existing neuroimaging software and facilitates interaction between different software components. The Nipype project was born in the neuroimaging community, and has been helping scientists build workflows for a decade, providing a uniform interface to such neuroimaging packages as FSL, ANTs, AFNI, FreeSurfer and SPM. This flexibility has made it an ideal basis for popular preprocessing tools, such as fMRIPrep and C-PAC. The second generation of Nipype ecosystem is meant to provide additional flexibility and is being developed with reproducibility, ease of use, and scalability in mind.

Project: Pydra workflows are intended to be written independent of considerations of the computational resources that will ultimately execute them. Pydra “workers” are classes that describe how to submit nodes in an execution graph to a computational resource, for instance a process pool on a local machine or a high-performance computing cluster. Currently Pydra has support for local multiprocessing, Slurm, Dask, and Oracle/Sun Grid Engine (SGE). We would like to expand the range of systems that Pydra can manage, as well as improve utilization of features for existing workers. One goal of particular interest to us is to track resource (CPU, GPU, RAM) allocation to allow scheduling that makes efficient use of those resources. The specific task can depend on the participant’s interest and experience.

Planned effort: 175 hours

Skills:

  • Programming, OOP: intermediate +
  • Python 3: novice +
  • Bash/Shell: novice +
  • HPC and schedulers: beginner +

Mentors: Dorota Jarecka @djarecka, Chris Markiewicz @effigies, Satra Ghosh @satra

Tech keywords: HPC, schedulers, Python, Pydra, Nipype

I am an aspirant software engineer and problem solver with strong problem-solving and strategic planning skills. In partnership with my fellow classmates, I have worked on and successfully completed a variety of projects. I also have open source contribution experience, which has taught me how to collaborate with others when working on a project. I’m proficient in programming languages such as C, C++, Python and Javascript with a basic comprehension of Java. I’ve worked with Django, DRF, and ReactJS, among various other frameworks, and am also willing and curious to learn many other technologies.

I’m interested in this project and was wondering how I could start working towards a proposal. What would be some good first steps?

Hi @bridyash

Thanks for your interest. Tagging the mentors @djarecka , @effigies , @satra

@arnab1896 - thanks for tagging!

I’ve answered to @bridyash and asked to confirm that they meet the GSoC requirements.

@bridyash - btw. INCF just got confirmation that it was approved as a GSoC organization.

Hello,

I am an incoming PhD student at USC who wants to focus on reproducible and accurate neuroimaging research at the population level (specifically in the dMRI space) during my studies. I have been working with all of the mentioned neuroimaging analysis packages for about 4 years now and have been programming in Python for 2 (with formal graduate-level training). In my current position as a research tech, I work with nipype and shell daily to develop pipelines for a team of clinicians to use on traumatic brain injury patient data.

I would be thrilled to work on this project and gain experience working directly with the nipype/nipreps devs. @djarecka @effigies @satra

@bridyash - have you work with any HPC scheduler before? Are you interested in working on any specific HPC scheduler or “worker” in general? Have you ever worked with Dask?

@rcali - thank you for your interest in this project. I saw you also answered to 17.1, so I won’t repeat myself, but wondering which project would be your first choice?
Have you work with any HPC scheduler before? Are you interested in working on any specific HPC scheduler or “worker” in general? Have you ever worked with Dask?