GSoC Project Idea 8.1: Continuous integration for research data

malin · January 15, 2019, 3:26pm

The G-Node Data Infrastructure (GIN) services[1] provide a platform for management and sharing of data in neuroscience. Inspired by GitHub, the platform uses a git/git-annex backend for versioning and sharing of scientific data, offering the power of a web based repository management service combined with a distributed file storage. It addresses the range of research data workflows starting from data analysis on the local workstation to remote collaboration and data publication. GIN also provides indexing services for convenient searching of data and metadata, including information in well-defined formats like the odML[2] metadata format and the NIX[3] format for scientific data.

Considering existing continuous integration services like Travis[4] or CircleCI[5] and build pipelines for the scientific field like SnakeMake[6] this project aims to prototype a continuous integration microservice for research data.

Scope of the project is to set up a GIN microservice for automated organization and processing of data and metadata using established CI technology. The development will be performed based on a use case of electrophysiological data.

Skills: A successful application will have some experience with the Python or the Go programming languages and ideally is familiar with git, continuous integration services and/or SnakeMake

Mentors: Achilleas Koutsou, Michael Sonntag, G-Node

[1] https://gin.g-node.org
[2] https://github.com/G-Node/python-odml
[3] https://github.com/G-Node/nix
[4] https://travis-ci.org/
[5] https://circleci.com/
[6] https://snakemake.readthedocs.io/en/stable/

Rahul_Verma · January 17, 2019, 10:12pm

Hi @malin,

My Name is Rahul Verma a 4th year undergrad, doing my bachelors in computer science. It is really awesome how incf is improving life of millions of people by advancing collabrative brain research. I want to become part of it and make lives of people better by doing the project “continuous integration for research data”. I have contributed to Mozilla (Release Engineering) and GNOME (Nautilus - there official file manager) before and i am fairly proficient in Python, Git and linux. I don’t know much about Travis/Circle but i am very much interested in learning more about those. I am really really excited to work on this project so can you please guide me on what should i do next.

Thanks.

malin · January 18, 2019, 8:51am

Hello Rahul, you will want to talk to the mentors for this project, Achilleas Koutsou and Michael Sonntag from G-Node. They should contact you soon.

Rahul_Verma · January 18, 2019, 12:31pm

@malin, Thanks.
Hey @achilleas, as you told me on irc besides learning about Travis ci and snake make, is there anything else you want to tell.

achilleas · January 18, 2019, 4:44pm

Hello Rahul. If you want to become more familiar with the project, you can have a look at the GIN services to get an idea of what this will be about.
The first link in the description is the main service. The code for that service is hosted here: https://github.com/G-Node/gogs
It’s a slightly modified version of the GOGS project.

The GOGS project has some support for Drone for CI and we experimented with this a bit in the past, but we’re open to trying out any available CI/CD platform that fits out needs.

The goal of this project is a little different from traditional continuous integration and continuous delivery services. As the project description mentions, the goal is to have automated processing of research data and it should be geared towards (but not limited to) electrophysiological data. While researching available technologies and designing the implementation of the project, this goal should be taken into account.

Feel free to ask any further questions once you start getting familiar with the relevant projects.

Rahul_Verma · January 19, 2019, 11:32am

@achilleas. Thanks a lot for your advice. :). Surely will ping you again for any further queries.

wahal · March 27, 2019, 11:08am

Hey Mr, Achilleas, I have been looking into existing CI/CD tools for a few weeks now to see whichever would be most compatible with our requirements. I have also been discussing with the developers of Travis CI over emails, and as per what I have found (and have been informed by the devs) Travis CI isn’t compatible with GIN like source for now. In fact any of their integrations apart from Github are quite problematic and fail with bugs repeatedly (including Bitbucket). Same news is from Circle CI devs and a few other services.

Drone seems good to me, I’m checking it out currently. I believe Buddy should also serve our purpose, however, as far as I have found out, Buddy is only free up to a single user. And not open sourced.

Moreover, I have also been reading and trying to figure out how can we build our own CI tool. I understand that would be a tedious project with a fairly long timeline, but its just me being prepared with a contingency idea in case existing tools don’t work out for us.

Sometimes, its better and fast to write code from scratch instead of reading and modifying existing code from the middle of it. ; )

However, Drone looks promising to me, I’ll go further with studying it. And lets have an engaging discussion in this forum to collectively go forward. : )

wahal · March 27, 2019, 12:17pm

@achilleas, I also had a few more doubts:

[1] Where can I get sample (electrophysiological) data we wish to supply to the CI platform, along with its corresponding tests? If, perhaps, you have built certain data which isn’t confidential, can I get access to it?

[2] Do we plan to run the CI server in a single machine at INCF nodes or multiple distributed machines or are we opting for something like Kubernetes?

Thanks.

achilleas · March 30, 2019, 1:17pm

Hi @wahal,

For the kind of data we expect to be used on the system, you can look at the public datasets that are already hosted on the GIN service https://web.gin.g-node.org/explore/repos as well as datasets published through our service https://doid.gin.g-node.org/.

I can’t say I have a solid plan for how the service would be deployed currently, but of course if it’s built with scalability in mind it would be more future-proof. There are likely going to be some changes with how the whole GIN family of services are deployed and run, so this might be part of the discussion during the early stages of the project. It’s good that you’re thinking about it.

For a service like a CI, which in its barest form has little to persist between sessions, I think scalability won’t be a great challenge (I recognise I may be oversimplifying). That said, given one of our main requirements is the ability to define pipelines using SnakeMake, it’s likely that we will want to limit redundant repetition of processing steps for unchanged dependencies or inputs, so some form of persistence or caching will be considered.

wahal · April 6, 2019, 12:01pm

Okay, @achilleas. I’ll wait for the proposed changes in deployment model. So, I’ll keep it aside for now.

But if its only about automating the Snakemake files, then I believe there already exists a graphical version of Snakemake which makes things easier for the ones who don’t know how to code. We might just have to automate the CI/CD process in that case and handle the deployment of Drone.

Or are we looking for a cleaner and more professional looking version of a platform to make writing the snakemake files easier? Something like a web-platform ?

To be honest, not being from the field of neurosciences, I’m a bit confused as to exactly what parts of the Snakemake files writing process are deemed redundant (as you mentioned) and have to be automated. I guess, I’ll be learning this during the initial days of the project itself then.

achilleas · April 7, 2019, 7:52am

I would say automating the CI/CD process is the core priority of the project, yes.

No, I don’t think that’s a realistic deliverable for the scope of this project. The emphasis should be on the CI service. The pipelining tool, while an important core component, is secondary and we don’t expect the person working on this project to work on the workflow management tool (SnakeMake), but with it.

In its most general form, the service could be field and workflow agnostic. Any neuroscientific-specific features could be built on a more general core if it’s designed to support such extensions. Of course, as you mentioned, part of the project will consist of learning what kinds of workflows are most likely to be useful for our users and adjusting accordingly.

wahal · April 14, 2019, 7:44pm

Hey @achilleas, I was just thinking about best practices to actually deploy a service like GIN (whilst being completely oblivious of current deployment configuration) and I believe Google may have saved my day. Google’s new GKE On-Prem can perhaps reduce our trouble exponentially – assuming we use an on-premise locally distributed environment for GIN’s deployment. Since, I’ve used K8s before and the only reason I preferred to stay away is I’d like complete security over my data by employing on-prem servers. And that seems to be resolved now, so K8s can offer us a good solution to automate load distribution and other complex features as well.

If it’s deployed over the cloud, perhaps then its another case. But I doubt this will happen, considering data security and complete control over the same data seems like the paramount feature for the G-Node.