GSoC 2020 project idea 11: Extended support for NIX file format in GIN

The G-Node Data Infrastructure (GIN) services[1] provide a platform for management and sharing of data in neuroscience. Inspired by GitHub, the platform uses git and git-annex for versioning and sharing of scientific data, offering the power of a web-based repository management service combined with distributed version control. It addresses the range of research data workflows from data processing and analysis on the local workstation to remote collaboration and data publication. GIN also provides indexing services for convenient searching of data and metadata, including information in well-defined formats like the odML[2] metadata format and the NIX[3] format for scientific data.

In this project we want to enhance the GIN data management services by making use of specific features of the NIX format, such as the comprehensive organization of metadata and the representation of relationships between the data. This would materialize as a set of features on the GIN web frontend for extended search, visualization and exploration of data stored on GIN.

Outcomes of this project would be the ability to search for and extract structural properties and metadata from files and to present and visualize the results.

A successful applicant will have some experience with Python and Go as well as git and will be interested in working with ElasticSearch and JavaScript based data visualization.

Mentors: G-Node & NIX Core Team (Achilleas Koutsou, @achilleas-k; Jan Grewe, @jgrewe; Michael Sonntag, @mpsonntag)

[1] https://gin.g-node.org
[2] https://github.com/G-Node/python-odml
[3] https://github.com/G-Node/nix

Hi @achilleas @jgrewe @mpsonntag, I am Mohini Tripathi, pursuing under graduation in Computer Science from IET Bundelkhand University. I’ve prior experience with Python, Go, Javascript and Git, furthermore I am familiarizing myself with ElasticSearch and GIN. I would like to contribute to this project or solve any warm-up task if required. I am looking forward to your guidance on this project.

Hello Mohini. Thanks for showing interest in this project. Before we look at warmup tasks, do you have any public code repositories we can look at?

In the meantime, I realise that we neglected to link to the repository of the current indexing service. You can have a look at it here: https://github.com/G-Node/gin-dex. This service indexes and serves the search results for the data search on GIN: https://gin.g-node.org/explore/data

hi @achilleas, please have a look at my Github profile https://github.com/mohini-tripathi/

Background & motivation

Hi @achilleas @jgrewe & @mpsonntag, I’m Huzi Cheng, currently a 2nd year phd student from Indiana University studying theoretical neuroscience. Libraries like this aiming at data sharing always attract me as we need data from other experimental neuroscientists and that’s why I’m interested in this project.

Programming experience
I have experience in some of the techniques you listed: python (mother tongue for programming, use it for everything including writing backend service and running neural network simulations), modern frontend development stack(JS, react.js, vue.js, CSS3, HTML5, WebGL, etc.) and golang( no large project experience with go but just some playground stuff). Also, I’m familiar with the git workflow. So I think I can contribute to this project.

My Github profile: https://github.com/chenghuzi

Questions about this project

  • If I understand the introduction correctly, are we supposed to build a browser-based visualization and manipulation interface for GIN data? It sounds more like a FE project but the two repos you mentioned above are more like a backend service.

  • Is there any task/issue that we can work with or use to write a proposal?

Hi @chenghuzi, thanks for showing interest.

Indeed there is some frontend work that is necessary, but extensions to the backend will be necessary. The relevant projects are GIN, which is a fork of GOGS: https://github.com/G-Node/gogs
Any frontend work will probably be part of this project, though it’s possible a separate service could be deployed that provides GIN with the processing and visualisation data.

The gin-dex project is a backend service that provides indexing and search for GIN. So part of the project is to also extend gin-dex to index information from NIX files and provide it to GIN through search and other means.

The frontend work need not be too substantial. Individual file rendering functionality can be added to GIN with little modifications to the core service. Much like a markdown file is rendered in HTML when previewed on GIN (or GitHub or similar), a separate library can take care of displaying NIX files, embedded in the file preview page, in interesting ways.

@mohini-tripathi had the same question and unfortunately I didn’t have a good answer until now.
I think a good warm-up task would be to add NIX file indexing to gin-dex. This is a rather big task if done properly, so I didn’t feel comfortable mentioning it. But a simpler version of this which might help you both get familiar with the project is to treat NIX like a regular HDF5 file while more specialised NIX support would be part of the bigger project.

The tricky part of such a task, beyond the basic task of getting data into the index, is understanding how to best format the data in the index to make it searchable in useful ways and deciding what should be indexed and what not.

@achilleas Thanks for your response! I checked the repo https://github.com/G-Node/gin-dex and now have a general idea about the project. If the first step is to treat NIX as regular HDF5 files, are we supposed to add a case solution for *.nix in indexObjects.go?

If this is the case, then a possible next step is to add a specific parsing function for this specialized type of HDF5 file in files like “util.go”. I also checked the NIX storage schema and found that using things like properties under sections as metadata would be a plausible option for parsing, as in the demos data description and information like author names are stored in this part. This is a way I can see to solve how to start to pick up something that should be indexed. As for how to format the data we picked, currently, I have no clear answer as I haven’t dive into details.

Please tell me if this is a plausible approach. If yes, can you tell me what extent of granularity we’re supposed to reach in the proposal? Are we going to just describe a rough roadmap or should we describe the whole plan?

That is a good approach and certainly the place to start, yes.

Regarding granularity: At the top level there should be long term goals and milestones, but the more detail the better. A rough roadmap is usually enough, but any more information that shows you understand how to tackle the problem at each stage, or have an idea about what it might involve, will help.

@achilleas Thanks for your suggestions! I’ve submitted my proposal.