GSoC 2022 Project Idea 9.2: Conversion of public neurophysiology datasets to Neurodata Without Borders (NWB) format (175/350 h)

More and more of the experimental datasets behind publications in neuroscience are being publicly released, increasing transparency of the scientific process and allowing reuse of the data for new investigations. However, these datasets, which can include electrophysiology, 2D/3D imaging data and behavioural recordings, are not always in an accessible format, and it may take some effort for researchers to access and analyse the data before deciding to use it in their own research. The Neurodata Without Borders (NWB, https://www.nwb.org) initiative is developing a format for sharing data from neurophysiology experiments which, together with APIs for handling files in the format, promises to greatly facilitate the sharing and reuse of data in neuroscience.

This project will involve converting a number of publicly available datasets to NWB format, adding structured metadata to ensure maximal understandability and reusability of the data. The converted datasets will be made available through the new NWB Explorer on the Open Source Brain repository (http://nwbexplorer.opensourcebrain.org) which allows visualisation of the data as well as interactive analysis through an inbuilt Jupyter notebook.

Skills required: Python; open source development; neuroscience (experimental or computational) background; data analysis.

Aims:

  1. Select a number of publicly available datasets which require conversion to NWB format (see here for examples).

  2. Read, understand and appreciate original publications related to data, convert datasets to NWB format, adding annotations and metadata to facilitate interpretability & reuse of the data by others. Document process to aid others.

  3. Make data available via the NWB Explorer on the Open Source Brain repository

Note: this project is suitable for a half-time or full-time commitment by the GSoC contributor, with the scope of the data conversion scaled as appropriate.

Mentors: Padraig Gleeson (lead), Ankur Sinha

Tech keywords: Python, HDF5, data analysis, open access.

Mam can you guide me to proceed furthure I am very much interested.

1 Like

@imad08 To get a sense what this would involve, have a look at advice for a similar project we ran last year: GSoC 2021 project idea 12.2: Conversion of public neurophysiology datasets to NeuroData Without Borders format - #5 by pgleeson.

I’ll post updated info after INCF gets approved as an organisation in a few weeks.

3 Likes

Hi @pgleeson ,

I’ve been going through the material on NWB and OSB, along with the advice for previous year’s GSoC. I also added a potential dataset (that I have used in the past) to the OSB/NWBShowcase/issues.

Overall I think it’s a very cool project and I would be quite excited to add more datasets to the showcase!

About me:
I’m a PhD student in neuroscience working on understanding visual coding in the rodent retina at IST Austria. Before this, I studied computer science and engineering for my bachelor’s and master’s. I have been using Python, git etc for many many years but somehow never found the time to contribute to any OSS. Looking forward to changing that this summer!

1 Like

Advice for 2022 OSB/NWB GSoC applicants

Background reading

Read the Open Source Brain paper as well as the recent Neurodata Without Borders paper. Note the OSB paper only briefly discusses extensions for NWB; OSB is undergoing a major expansion (v2.0) to allow sharing of data as well as models in neuroscience. The beta site for sharing NWB files on OSB is here: http://v2.opensourcebrain.org and a standalone instance of the NWB Explorer (accessible without logging in) can be found here: http://nwbexplorer.opensourcebrain.org.

Suggested activities prior to application

Sign up to GitHub if you’re not already there.

Create an OSB v2 user account & link your GitHub account to it.

Have a look at the example converted data sets which have been put online here: http://nwbexplorer.opensourcebrain.org.

There are scripts for converting different data formats (e.g. Matlab, IgorPro) to NWB format here .

Install pynwb and get some of the above scripts/notebooks working locally.

Make a minor update to the existing scripts (or just README) to improve these existing examples.

There is also a list of potentially interesting datasets which could be converted to NWB here: Issues · OpenSourceBrain/NWBShowcase · GitHub.

Some datasets which were converted during previous years’ GSoC project were:

Find some other public datasets (e.g. single cell electrophysiology recordings, population (calcium) imaging, behavioural studies) which you think would be appropriate for conversion to NWB format, to list with your application. Focus on datasets that are well described/structured/annotated, but in a non-NWB format (to minimise need to involve original data producers)! Also open issues as outlined above with links to the data.

Note 1: There are an increasing number of NWB compatible datasets available on the DANDI Archive. For this reason, there is a pressing need to test and ensure these are compatible with our NWB Explorer, rather than make new datasets which will be compatible with it from the start. Applicants who would be prepared to work to test the NWBE interface and make updates for compatibility with other independently developed datasets (e.g. as last year’s applicant did) would be very welcome!

Note 2: Please share the draft of your application early to allow feedback before the application deadline!

Essential information to include in your application:

  1. The list of potential datasets to convert as discussed above
  2. Details on the course currently being followed and a link to the course webpage.
  3. What are your time commitments during the coding period? Please be specific about this, work/exam commitments etc. Are you planning any vacations this summer? How many classes are you taking this summer?
  4. How many hours per week will you be able to spend on this project?
  5. If you have any evidence of your coding abilities (e.g. contributions to open-source projects) and/or background in neuroscience, please let us know about it. Send links to specific public repositories showing commits by you.
  6. Details of any previous experience in data analysis or computational modelling.
1 Like

Thanks for your interest in the project @guptadivyansh. Your background sounds great, and your suggested data set looks like a good place to start.

I’ve updated the advice for applicants in the post above, and please see Note 1 there about also potentially working to improve the NWB Explorer interface by testing it with other datasets.

Hi @pgleeson ,

I’m Anh, a PhD student in Neuroscience and a prospective candidate for this GSoC project. I work mostly with behavioral data and my programming skill (Python/Matlab) is at the lower middle end.

I’ve been looking through your notes from this and previous years about this project, as well as trying my hands on the nits and grits of converting/testing non-NWB/NWB datasets. From my understanding, converting public neurophysiology datasets to NWB format is rather a straightforward process, where one needs to make sure they understand the publication’s data structure and its suitability with the NWB requirements. I might have overlooked the potential challenges of the project due to my inexperience, of which I’m hoping you could let me know. Also, I’d also like to know, from your experience with working with past GSoC contributors, the average amount of time one spends on converting and validating one dataset. The information would help shaping my expectations and time commitments for the project.

Anh.

Hi @anhknguyen96.

Yes, in the ideal case the dataset is well described and can be converted to NWB with a few scripts to generate it in the required format using pynwb etc. However, most datasets are not fully complete/annotated, and will require careful reading of the source publications, maybe even interaction with the authors. I would expect 2-3 average-sized datasets could be well converted during a full time project.

Bear in mind though Note 1 above, that it’s increasingly important to test how the dataset works/can be viewed/analysed in NWB Explorer, as well as just getting it into NWB. Last year, the student spent ~50% of her time converting data and the rest working to ensure features of the data were well supported in NWBE, so a willingnedd to dive into the code of NWBE too would be valuable.

Hi @pgleeson ,

As I attempt to test an existing NWB compatible dataset, I ran into several issues:

  • None of the three example files I randomly chose to open with the NWB Explorer can be loaded (sub-mouse3-fni18_ses-170503152245.nwb (1), sub-mouse1-fni16_ses-170808180842.nwb(2), sub-mouse2-fni17_ses-161004115936.nwb (3)). The webpage shows a connection error warning. I would appreciate suggestions on the possible causes for this issue and approaches to tackle it.

  • Among the tested files:

    • (3) couldn’t be loaded/read with io = NWBHDF5IO(file_path, mode="r"). Error message: gsoc/nwb_project/lib/python3.8/site-packages/pynwb/ophys.py:363: UserWarning: The second dimension of data does not match the length of rois. Your data may be transposed. The error message seems to stem from a warning in class RoiResponseSeries(TimeSeries)

    • (1)&(2) don’t have the Trial-based-Segmentation module as instructed to be retrieved by line 4 in this notebook

    • These seem to stem from the format conversion process, so one sensible approach would be to create an issue on the repo. If you suggest otherwise, please let me know.

Thanks for checking this @anhknguyen96. It can be difficult to determine whether such NWB files are failing because of NWBE, or just they were built with an older version of pynwb, or even that they’re too big and the loading is timing out.

I’ve opened an issue here: https://github.com/MetaCell/nwb-explorer/issues/293 which will help determine this, and potentially help solve other issues. If you were keen to try making such a simple script for testing NWB files, it would be a great contribution to highlight in your application.

Hi, I added a pull request #295 to address the issue. Please check it out when you have the time. Thanks.

1 Like

Hi , I am very interested in this project
How can I join?
I can do data analysis in R and machine learning
Also I am training in python now and very interested in Neuroimage

Thanks for that @anhknguyen96. PR merged!

2 Likes

Welcome @drahmdshahn! Thanks for your interest in the project. You need to apply via Google. Have a look here: https://summerofcode.withgoogle.com. There is also some info from INCF here: Google Summer of Code | INCF. They have a template for applications too: INCF GSoC 2022 Application template - Google Docs and this is submitted to Google when you select what project you want to work on.

Please read carefully the Advice for applicants above too.

1 Like

Hi @pgleeson , I’m wrapping up the proposal and planning to submit it tomorrow (sorry I couldn’t do it sooner). I’m applying for this as a 175h project (brief description of deliverables: conversion of 1 dataset, ensuring compatibility of that dataset and one other (the dataset that I tested but NWBE failed to load) with NWBE, and an executable script to determine the sources of incompatibility with NWBE), and so that means I should answer “medium” for this question in the submission for? (screenshot attached).

Hi @anhknguyen96, yes, I believe the 175h option is described as medium.

1 Like

Thanks for the promt reply!

1 Like

Hi @pgleeson , I know it’s a bit tight on time but I would appreciate any feedback on the proposal that you might have. Thanks!

1 Like

Thanks @anhknguyen96. Proposal looks good, though a bit short. Should give sufficient detail though on what’s to be done. Did you download and look at the pfc-3 dataset yourself though?

1 Like

Yes I did, but I haven’t tried my hands on exploring the data. I propose that dataset because of its comprehensiveness, which gives ample room for analysis: it has both Behavior and Electrophysiology modules, with the former having 2 sets of behavioral tasks, and the latter having pre and post training recordings. I wasn’t sure how detailed I should be, and so what I planned out for the timeline are mostly the goals I want to achieve for said durations. Would you mind suggesting how I could give more sufficient details?

1 Like