Neurobagel builds tools to provide a way for researchers and other data users to define and find cohorts of individuals across a federated ecosystem of data nodes. These tools are developed with the goal of making annotation, integration, and searching of datasets easier. With the increasing popularity of standardized vocabularies for dataset harmonization, dataset variables can be annotated with machine-understandable tags based on a globally accessible standard. Unfortunately, this annotation process needs human experts who know the data well and can be technically challenging and tedious if done entirely manually. We hope to improve the user experience by assisting human experts in the data standardization process with LLM-based recommendations.
Neurobagel nodes use a graph database to store harmonized datasets for query federation. To participate in the Neurobagel federation, datasets must conform to Neurobagel’s data model. Currently, Neurobagel provides a web-based annotation tool that features a graphical interface enabling users to manually upload a tabular data file to annotate a dataset for inclusion in the graph. This manual annotation can become cumbersome, especially with large datasets or cohorts.
The goal of this project is to develop an LLM-driven agent that creates an initial first-pass annotation for the user to check and correct. The agent will interface both with the ingested dataset as well as the employed standardized vocabularies to identify the most likely coding of the provided data. This assistant will be capable of interpreting uploaded files, offering recommendations for mapping dataset columns to variables modeled by Neurobagel, applying suitable heuristics, and identifying any missing values. The ability to audit the decision process will facilitate continuous improvements to the service. The user will then be able to override the initial coding as necessary to accurately reflect the data.
The tasks involved in this project include:
- Becoming acquainted with the annotation tool’s codebase
- Exploring LLMs and relevant libraries, such as LangChain, Ollama
- Embedding the automated assistant within the annotation tool for a more efficient process. Given the flexible time commitment, this task would only be part of the project for a contributor who would like to spend the full 350 h with us
What can I do before GSoC?
Check out Neurobagel’s website and GitHub organization to familiarize yourself with the relevant tools and codebases. Please feel free to reach out to one of the mentors through email (Brent and Arman) to raise questions/discussions that you may have about the project.
Skill level: Beginner / Intermediate
Required skills: Python or JavaScript/TypeScript
Helpful skills: Basic understanding of Linux command line, Git, Docker, network requests / API calls via HTTP
Time commitment: Flexible (175/350 h)
Lead mentors:
- Brent McPherson (@bcmcpher)
- Arman Jahanpour (@Arman)
- Sebastian Urchs (@surchs)
- Alyssa Dai (@alyssadai)
Project website: https://neurobagel.org/
Backup mentors: Members of the Neurobagel team and the Origami Laboratory at McGill
Tech keywords: Python, JavaScript, TypeScript, React, Large Language Models, Artificial Intelligence, Knowledge Graph
IMPORTANT
What to do if you want to work on this project / how to apply
First: Thanks a lot for your interest in our project, we’re excited to talk with you, discuss the project, and answer questions you have. Our project is open to everyone and we want to make sure you feel welcome here! So don’t hesitate to reach out even if you are coming from a different field, are new to this space, or have questions you first want to answer.
Here are some concrete next steps:
- Get to know us and get your questions answered! If something is unclear or you have a question, ask it here directly in the forum so everyone can benefit from the answer. Please don’t get in touch directly via email at this point, just ask your questions here in the forum.
- If you have a more technical question or want to see how we work, meet us on our GitHub Organization where we do and discuss most of our work: Neurobagel · GitHub. Feel free to comment on issues or even open a new one for a specific question, feature, or problem. Our contributor guide has some pointers for how we contribute to the projects: How to contribute - Neurobagel
- Discuss your idea for the project with us so we can help you refine your proposal before you submit it. You can send an email or direct message to @Arman or @bcmcpher for this if you prefer.
- Finally: make sure to look closely at the GSoC rules (Google Summer of Code), guides (What is Google Summer of Code? | Google Summer of Code Guides), timeline (Google Summer of Code 2024 Timeline | Google for Developers) and Advice for People Applying for GSoC | Google Summer of Code | Google for Developers) so you have a good idea of how the process works
Please note that we do not expect you to contribute any work to our repositories before you are selected for the project through GSoC. If you still want to contribute in your own time to our open-source project, you are very welcome to do so! But please understand that this is not a requirement for your application to be selected.
Once you are ready to submit your proposal for this project, please go through the GSoC website (https://summerofcode.withgoogle.com/) and follow the instructions there. We will make an effort to review and respond to your submissions quickly.