GSoC 2022 Project Idea 17.3: Converting existing scientific workflows to new dataflow engine written in Python: Pydra (350 h)

Intro: Pydra is a new lightweight dataflow engine written in Python. The package is a part of the second generation of the Nipype ecosystem — an open-source framework that provides a uniform interface to existing neuroimaging software and facilitates interaction between different software components. The Nipype project was born in the neuroimaging community, and has been helping scientists build workflows for a decade, providing a uniform interface to such neuroimaging packages as FSL, ANTs, AFNI, FreeSurfer and SPM. This flexibility has made it an ideal basis for popular preprocessing tools, such as fMRIPrep and C-PAC. The second generation of Nipype ecosystem is meant to provide additional flexibility and is being developed with reproducibility, ease of use, and scalability in mind.

Project: There are many scientific workflows, written using a variety of languages and frameworks. In neuroimaging, many use Nipype 1 or bash. Pydra is intended to be a generic dataflow engine, and we would like to demonstrate its utility by converting existing workflows from different scientific domains. There is flexibility according to participant interest and experience to select the specific workflows. This project will require the creation of new Pydra Task classes that wrap the necessary tools. Depending on the tools to be wrapped, it may be possible to automatically generate classes from a pre-existing specification, or they may be written manually as needed. Likewise, there may be opportunities to convert specifications of entire workflows into Pydra workflows. Generated workflows will be made accessible through the Niflows framework (or similar) and submitted for reuse to workflow hubs such as workflowhub.eu and dockstore.org.

Planned effort: 350 hours

Skills:

  • Python 3: novice +
  • Bash: novice +
  • Scientific software (e.g. AFNI, SPM, ITK, SpikeInterface, CalmAn): intermediate +
  • Creating data workflows: beginner +
  • Data file format, e.g. NWB, NIfTI, OME-TIFF, PLINK, HDF5: beginner +

Mentors: Dorota Jarecka @djarecka, Chris Markiewicz @effigies, Hao-Ting Wang @HaoTing_Wang, Satra Ghosh @satra, collaboration with groups responsible for specific workflows

Keywords: Workflow, Python, Pydra, Nipype, CWL

Hello, I’m a third-year Ph.D. student, emphasizing in media neuroscience. I’ve used Nipype to reproduce three levels of GLM in FSL, which means that I’m familiar with python, Nipype, and FSL :slight_smile: I’m interested in joining this project! What kind of preparation would you suggest before I start writing a proposal about this project for GSoC 2022?
Thank you!
Yibei

Hi Yibei, let me tag the mentors for you: @djarecka, @effigies , @HaoTing_Wang, @satra

/Malin, org admin

Hi @yibeichen! Thank you for your interest in the project! Do you have any specific pipelines in mind that you would like to work on? If not we can suggest some options but it would be good to know if you have any preferences. Do you have an example of your analyses that you would like to share with us?

Also, have you already made any significant contribution to any open source project? If you did, that’s really great, but please note that GSoC is not for already an established contributor in open source.

Hi @djarecka ! Here is the anonymous osf link for the nipype project I’m writing the full manuscript now.
Re pipeline, I’m most familiar with FSL, but open to other options too. I’d love to try SPM, which I’ve never got a chance to.
Re open science, this nipype+FSL is my first project, I’m fairly new to this field. My background is social science and I have learned python + neuroimging (through online tutorials + classes) since 3 years ago. It would be great if I can attend GSoC and make more contributions!

@djarecka hello again! I read through the Pydra documentation. Just want to confirm my understanding of “pipeline”. For example, I’m using group ICA + dual regression in FSL currently. Can I convert it to Pydra? Would it be considered as a pipeline?
If yes, I would like to work on it since I need to use it in my current research project, so (1) I can experiment it with real world data and having hypotheses/expected results in mind, and (2) I can create this pipeline from the user’s perspective.
Please let me know whether my understanding is accurate :’)

Hi @yibeichen - thank you for your answer.

We would like to convert all nipype/FSL interfaces to pydra. All workflows/pipelines that use multiple algorithms/interfaces could be potentially good examples, but can discuss more details.

Do you have any GitHub repository? Perhaps you can also send us your current CV or resume (can be via direct message, please include all co-leaders).

Pipelines can range from straightforward (e.g., splitting across subjects, processing, joining results for further processing) to complicated (e.g., dynamically changing the workflow based on the data found in the inputs), so this should qualify unless it’s basically a single call to an FSL program.

Even if it is the latter, you could still write a pipeline that calls the FSL program (you would need to wrap that in a Pydra Task), extract summary statistics and generate plots for publication.

This sounds entirely reasonable. One thing to consider is whether there are any comparable open datasets (e.g., on OpenNeuro) that could be used to validate/demonstrate the pipeline.

Hello @effigies ! thank you for explaining! So the group ICA+dual regression should belong to your first category. I’m happy to start with anything related to FSL, then expand to other software (e.g., SPM)
Re dataset, haha, I should not say “real data”. Yes, the data I’m using is from OpenNeuro. By “real” I mean, I’m familiar with the data structure and know what kind of questions can be asked and answered.