GSoC 2024 Project Idea 19.1 An LLM-assisted service for annotating research data with machine-understandable, semantic data dictionaries (175/350 h)

Neurobagel builds tools to provide a way for researchers and other data users to define and find cohorts of individuals across a federated ecosystem of data nodes. These tools are developed with the goal of making annotation, integration, and searching of datasets easier. With the increasing popularity of standardized vocabularies for dataset harmonization, dataset variables can be annotated with machine-understandable tags based on a globally accessible standard. Unfortunately, this annotation process needs human experts who know the data well and can be technically challenging and tedious if done entirely manually. We hope to improve the user experience by assisting human experts in the data standardization process with LLM-based recommendations.

Neurobagel nodes use a graph database to store harmonized datasets for query federation. To participate in the Neurobagel federation, datasets must conform to Neurobagel’s data model. Currently, Neurobagel provides a web-based annotation tool that features a graphical interface enabling users to manually upload a tabular data file to annotate a dataset for inclusion in the graph. This manual annotation can become cumbersome, especially with large datasets or cohorts.

The goal of this project is to develop an LLM-driven agent that creates an initial first-pass annotation for the user to check and correct. The agent will interface both with the ingested dataset as well as the employed standardized vocabularies to identify the most likely coding of the provided data. This assistant will be capable of interpreting uploaded files, offering recommendations for mapping dataset columns to variables modeled by Neurobagel, applying suitable heuristics, and identifying any missing values. The ability to audit the decision process will facilitate continuous improvements to the service. The user will then be able to override the initial coding as necessary to accurately reflect the data.

The tasks involved in this project include:

  • Becoming acquainted with the annotation tool’s codebase
  • Exploring LLMs and relevant libraries, such as LangChain, Ollama
  • Embedding the automated assistant within the annotation tool for a more efficient process. Given the flexible time commitment, this task would only be part of the project for a contributor who would like to spend the full 350 h with us

What can I do before GSoC?

Check out Neurobagel’s website and GitHub organization to familiarize yourself with the relevant tools and codebases. Please feel free to reach out to one of the mentors through email (Brent and Arman) to raise questions/discussions that you may have about the project.

Skill level: Beginner / Intermediate

Required skills: Python or JavaScript/TypeScript

Helpful skills: Basic understanding of Linux command line, Git, Docker, network requests / API calls via HTTP

Time commitment: Flexible (175/350 h)

Lead mentors:

Project website: https://neurobagel.org/

Backup mentors: Members of the Neurobagel team and the Origami Laboratory at McGill

Tech keywords: Python, JavaScript, TypeScript, React, Large Language Models, Artificial Intelligence, Knowledge Graph



IMPORTANT

What to do if you want to work on this project / how to apply

First: Thanks a lot for your interest in our project, we’re excited to talk with you, discuss the project, and answer questions you have. Our project is open to everyone and we want to make sure you feel welcome here! So don’t hesitate to reach out even if you are coming from a different field, are new to this space, or have questions you first want to answer.

Here are some concrete next steps:

Please note that we do not expect you to contribute any work to our repositories before you are selected for the project through GSoC. If you still want to contribute in your own time to our open-source project, you are very welcome to do so! But please understand that this is not a requirement for your application to be selected.

Once you are ready to submit your proposal for this project, please go through the GSoC website (https://summerofcode.withgoogle.com/) and follow the instructions there. We will make an effort to review and respond to your submissions quickly.

Greetings @bcmcpher @ArmanJahanpour @SebastianUrchs @AlyssaDai ,
My name is Tvisha, a Sophomore at VJTI ,Mumbai pursuing bachelors in Computer Science Btech . I am an NLP and machine learning enthusiast !!
I am well-versed with python and c++.
I have experience with natural language processing (NLP),Image processing , Python libraries like numpy and pandas , web development(MERN stack), Linux command line and version control with Git/GitHub. I have worked extensively with LLMS like langchain, Transformer architecture, and NLP, which I applied in developing a healthcare chatbot(using the Llama model from hugging face after testing a lot many models and also developing our own decoder-only transformer architecture using pytorch).

I found the project exceptionally intriguing.Developing this LLM-driven annotation agent, we can streamline the data standardization process, reduce the burden on human experts, and improve the overall efficiency and accuracy of dataset annotation I. feel that I can make meaningful contributions to this project given my background with LLms and also web-development for improving the interface. Additionally, multilingual support and audio instruction services can also be added in the future.

I feel that handling research data and systematic representation of the same is a very important task as it can help bring more contributors as well as donations for the research initiatives hence
contributing to the social good and hence contributing at Neurobagel will be an honor.

It would be great to be a part of this project.
Having spent 2 years in this ever expanding field of computer science I have developed a knack to learn various technologies in a smooth and quick way and I am ready to learn anything new for this project.
Pleased guide me with the resources which I would require to study, to start working on this project.

Looking forward to hearing from you and contributing under your mentorship.

Do share any updates regarding the tasks to be performed for GSOC’24

Email:
tnvedant_b22@ce.vjti.ac.in
My github repo:
https://github.com/tvilight4
My resume:
Resume

1 Like

My name is Sudip Mukherjee, and I am eager to express my interest in participating in Google Summer of Code 2024, collaborating with the Neurobagel team on Project 19.1 - the development of an LLM-assisted service for annotating research data.

I will dedicate significant time to understand the annotation tool’s codebase thoroughly. My research will identify the most suitable LLM for generating annotations, drawing on my experience with llama2. I’ll integrate the automated assistant into the annotation tool, enhancing user efficiency. Additionally, I’m open to UI development if needed, leveraging my JavaScript/TypeScript, React, and Redux/Recoil skills. My background in JavaScript/TypeScript, coupled with proficiency in Linux command line, Git, Docker, aligns seamlessly with the technical requirements of this endeavour. (Skills: Python, JavaScript, TypeScript, React, Large Language Models)

Concerning the tool repositories, there are three distinct ones: annotation_tool, query-tool, and react-query-tool. While annotation_tool and query-tool use Vue, I’m more proficient in React. Do I need to understand all three, or is there a specific repository to prioritise? Similarly, with API repositories (api and federation-api), should I acquaint myself with both or focus on a specific one for the project?

I am committed to dedicating the required 350 hours to this project. My technical skills, coupled with my passion for the subject matter, make me a suitable candidate for this opportunity. I am eager to collaborate closely with the Neurobagel team, learn from experienced mentors, and contribute meaningfully to the project’s success.

Immersing myself in extensive research on LangChain, I’ve acquired a profound understanding of its intricacies, spanning architecture, functionalities, and applications in linguistic data processing. This knowledge enables me to seamlessly integrate LangChain into Neurobagel, enhancing language-related aspects adeptly. My hands-on experience and active forum participation assure you of both theoretical expertise and practical insights. Eager to contribute, I wonder if there’s a task I can undertake under your guidance to further solidify my understanding and skills.

Researching LLMs and libraries like Long Chain and Ollama to find the optimal model for precise API call generation and interpreting user prompts. Previously interacted with llama2, utilising it as an API with curl.

Example using curl:

curl -X POST http://localhost:11434/api/generate -d '{

“model”: “llama2”,

“prompt”:“Explain the process of photosynthesis.”

}’

I look forward to the chance to contribute to the Neurological project and make a meaningful impact on advancing the LLM-assisted service for annotating research data.

I am excited about the prospect of working together during Google Summer of Code 2024.

Best regards,

Sudip Mukherjee
Email: sudipmukherjee96144@gmail.com
github:- SudipMukhejee (SudipMukherjee) · GitHub

3 Likes

Hi @Tvisha_Vedant
Thank you for your interest in the project! I’m glad to see that your experience aligns with our project goals.

At this stage, focusing on familiarizing yourself with the annotation tool is sufficient for this project. There’s no need to begin working on anything before the project officially starts.

Regarding GSoC tasks, I recommend visiting their official website and keeping an eye on the deadlines. I believe the contributor application period begins on March 18th and I encourage you to apply for this project and/or our 2nd project if you’re interested.

Looking forward to hearing from you during the application period.

In the meantime, please don’t hesitate to reach out with any questions you may have regarding the projects.

1 Like

Hi @Sudip_Mukherjee1,

Thanks for expressing your interest in the project! It’s great to see the alignment of your experience with our project goals.

Regarding your inquiry about the tool repositories: As we’re in the process of transitioning the query tool from Vue to React, and plan to do the same for the annotation tool soon, there may be some changes in the codebase. However, the overall workflow of the tools is likely to remain consistent. If you’re taking the initiative to familiarize yourself with our tools, I suggest focusing on the annotation tool for this project, as it will be our primary focus.

While I commend your proactive approach, I want to emphasize that there’s no obligation for applicants to undertake any tasks before the project officially begins.

I’m excited to hear from you during the application period. If you have any further inquiries or require clarification on any aspect of the project, please don’t hesitate to reach out.

1 Like

Hi everyone, thanks again for your interest in this project!

  1. We have recently added more details in the project description (under the “IMPORTANT” section) about our recommended next steps/what you can do right now if you are interested in submitting a proposal for this project. Please read through this section when you have a chance.
  2. To be able to answer your questions more easily and help you create a good proposal, we invite you to join our Neurobagel Discord server! All the mentors will be present on the server, and here you will be able to ask general questions as well as message mentors one-on-one about your specific proposal. (You can still reach us on email and Neurostars, but responses may be slower on these platforms.)
3 Likes

Greetings mentors,

I am Manya Gupta, an LLM enthusiast. I am extremely interested in contributing to this project. Building an LLM assistant to make data annotation process faster is a task I am confident about performing.

Looking forward to your guidance.

Email: manyaag231103@gmail.com

Is there any work remaining, for this project?

Hello @Arman , @alyssadai,
My name is Aravind, I am a final year student from NGIT, Hyderabad currently pursuing bachelors of Engineering in Computer Science. I have experience with building web apps and currently learning Machine learning and cloud native technologies. I previously built AI-integrated web apps with Google Gemini API and small ML projects with numpy and pandas.

I’m planning to apply for GSoC 2026 with INCF, and I’m especially interested in the Neurobagel natural-language querying project. Also open to learning new technologies and adapt fastly.

Can you please guide me on which resources i should refer and start and also
Please share updates regarding tasks to be performed for GSoC 2026.

Looking forward to contribute and learn soon :smiley:

Hello @Arman, @alyssadai and team,

I’m Sanjay, a final-year CSE student at VIT Chennai, planning to apply for GSoC 2026 with INCF. I’m particularly interested in Project 19.1 - Neurobagel LLM-assisted annotation project. My recent work has focused on LLM-driven data workflows - including building a RAG system using LangChain and developing production UI features with React and Vue. Since this project combines LLM and annotation interface, it aligns with what I’m working on.

I’m currently going through the project description and wanted to make sure I’m exploring the right areas. Could you please confirm:
whether the annotation tool is still the best place to start understanding the workflow,
if there are particular components (LLM integration points or UI flow) that would be helpful to study before drafting a proposal

Looking forward to learning more and refining my proposal direction with your guidance.

Thanks!
Sanjay
Email: sanjayhariharan9@gmail.com
GitHub: TheSanjBot (Sanjay H) · GitHub

Hello @Arman, @Brent and the Neurobagel team,

My name is Aryan, and I’m currently pursuing a B.Sc. in AI & ML at SRM University (Chennai). I’m planning to apply for GSoC 2026 with INCF, and I’m especially interested in Project 19.1 – the LLM-assisted annotation service for Neurobagel.

I’ve been focusing deeply on machine learning and LLM-based systems. I’ve built retrieval-augmented generation (RAG) pipelines using LangChain, worked with structured JSON outputs from LLMs, and explored ways to make model outputs reliable and schema-constrained. I’m particularly interested in designing AI systems that bridge natural language understanding with structured data — which is exactly why this project stood out to me.

I’m currently going through the Neurobagel website and GitHub repositories to better understand the data model and the existing annotation workflow.

I wanted to ask:

  • Is the current web-based annotation tool the best entry point for understanding the overall workflow and integration points?
  • Are there specific parts of the codebase (e.g., data model definitions, API layer, or UI components) that would be especially helpful to study before drafting a proposal?
  • Would you recommend prototyping the LLM mapping logic independently first, or focusing on understanding how it would integrate into the existing system?

Thank you for your time — I’m looking forward to learning more and hopefully contributing to Neurobagel.

Best regards,
Aryan Pagaria
GitHub: Aryanpagaria (Aryan Pagaria) · GitHub
Email: aryanpagaria7@gmail.com

Hello @alyssadai @bcmcpher

My name is Akshat Pal, and I am currently persuing my B.E. in ECE from Thapar University, Patiala and I am very interested in contributing to Project 19.1 – developing an LLM-assisted service for annotating research data within Neurobagel.

I come from a background in Python-based data systems and applied AI. I have developed a multi-threaded ETL pipeline in Python and MySQL that is specifically focused on structured data processing and performance optimization. Additionally, I have experience in AI-integrated systems that involve real-time data analysis and classification tasks, which has given me experience in designing reproducible and modular pipelines.

This project specifically interests me because it involves the nexus of: structured data harmonization, knowledge graph modeling, standardized vocabularies, explainable LLM integration.

Before writing a proposal, I would like to clarify a few architectural details:

  1. Of the repositories (annotation_tool, query-tool, react-query-tool), would contributors specifically target the annotation_tool for this project?

  2. Regarding the backend layer, would the assistant specifically communicate with the api repository, or would federation-api also be applicable for this purpose?

  3. Would there be a preferred method for LLM integration, such as:
    a. Structured output generation based on prompts?
    b. Embedding similarity comparison against standardized vocabularies?
    c. Or perhaps a hybrid approach that leverages heuristics and model inference?

  4. What is the expected level of auditability and logging for the reasoning process of the LLM?

I am comfortable working with Python and integrating external services via APIs. I am also familiar with Docker, Git-based workflows, and Linux environments. I would be happy to explore the relevant repositories and understand the current annotation workflow in depth before refining my proposal.

If there are specific documentation sections or starter issues you recommend reviewing first, I would greatly appreciate your guidance.

GitHub: akshat4703 (Akshat Pal) · GitHub
Resume: https://drive.google.com/file/d/1JuCajqaQwZwSzb4SJtOTC47vEQBzJyzL/view?usp=sharing
Email: akshat4703@gmail.com

Hi all! I’m Sejal Punwatkar (sejalpunwatkar (Sejal) · GitHub), a developer focused on neurodata standards. I’ve been contributing to the NWB ecosystem (specifically PyNWB and HDMF), where I’ve been lucky to have Ryan Ly as technical mentor.

I’m now expanding my focus to Neurobagel to tackle the LLM-assisted annotation project (19.1). After working on low-level data validation, I’m eager to apply that knowledge to high-level semantic mapping using LLMs. If anyone is into JSON-LD or automated metadata, I’d love to chat! (punwatkarsejal@gmail.com)

Hello mentors and Neurobagel team,

My name is Lohitha Reddy Karamala, and I am a final-year Electronics and Communication Engineering student from India. I am planning to apply for GSoC 2026 with INCF and I am very strongly interested in Project 19.1 – the LLM-assisted annotation service for Neurobagel.

I am particularly excited about this project because it sits at the intersection of structured data, AI, and real-world research workflows. I have a strong foundation in Python and have been actively exploring LLM-based systems, data processing pipelines, and API-based integrations. The idea of building an intelligent assistant that can bridge natural language understanding with structured scientific data annotation is something I find extremely meaningful and impactful.

I have started going through the Neurobagel documentation and repositories to understand the data model and the existing annotation workflow. I will also be joining the Discord server to stay connected with mentors and contributors and to ensure I align my proposal with the project’s real needs.

I am very serious about contributing meaningfully and would love guidance on:
• The best starting point in the codebase to understand the current annotation pipeline
• Any recommended issues, modules, or prototype ideas I can begin exploring
• How I can best prepare a strong proposal aligned with your expectations

Looking forward to learning from this community and contributing actively.

Thank you for your time and guidance.

Best Regards,
Lohitha Reddy Karamala