GSoC 2024 Project Idea 19.1 An LLM-assisted service for annotating research data with machine-understandable, semantic data dictionaries (175/350 h)

greg_incf · February 23, 2024, 5:45pm

Neurobagel builds tools to provide a way for researchers and other data users to define and find cohorts of individuals across a federated ecosystem of data nodes. These tools are developed with the goal of making annotation, integration, and searching of datasets easier. With the increasing popularity of standardized vocabularies for dataset harmonization, dataset variables can be annotated with machine-understandable tags based on a globally accessible standard. Unfortunately, this annotation process needs human experts who know the data well and can be technically challenging and tedious if done entirely manually. We hope to improve the user experience by assisting human experts in the data standardization process with LLM-based recommendations.

Neurobagel nodes use a graph database to store harmonized datasets for query federation. To participate in the Neurobagel federation, datasets must conform to Neurobagel’s data model. Currently, Neurobagel provides a web-based annotation tool that features a graphical interface enabling users to manually upload a tabular data file to annotate a dataset for inclusion in the graph. This manual annotation can become cumbersome, especially with large datasets or cohorts.

The goal of this project is to develop an LLM-driven agent that creates an initial first-pass annotation for the user to check and correct. The agent will interface both with the ingested dataset as well as the employed standardized vocabularies to identify the most likely coding of the provided data. This assistant will be capable of interpreting uploaded files, offering recommendations for mapping dataset columns to variables modeled by Neurobagel, applying suitable heuristics, and identifying any missing values. The ability to audit the decision process will facilitate continuous improvements to the service. The user will then be able to override the initial coding as necessary to accurately reflect the data.

The tasks involved in this project include:

Becoming acquainted with the annotation tool’s codebase
Exploring LLMs and relevant libraries, such as LangChain, Ollama
Embedding the automated assistant within the annotation tool for a more efficient process. Given the flexible time commitment, this task would only be part of the project for a contributor who would like to spend the full 350 h with us

What can I do before GSoC?

Check out Neurobagel’s website and GitHub organization to familiarize yourself with the relevant tools and codebases. Please feel free to reach out to one of the mentors through email (Brent and Arman) to raise questions/discussions that you may have about the project.

Skill level: Beginner / Intermediate

Required skills: Python or JavaScript/TypeScript

Helpful skills: Basic understanding of Linux command line, Git, Docker, network requests / API calls via HTTP

Time commitment: Flexible (175/350 h)

Lead mentors:

Brent McPherson (@bcmcpher)
Arman Jahanpour (@Arman)
Sebastian Urchs (@surchs)
Alyssa Dai (@alyssadai)

Project website: https://neurobagel.org/

Backup mentors: Members of the Neurobagel team and the Origami Laboratory at McGill

Tech keywords: Python, JavaScript, TypeScript, React, Large Language Models, Artificial Intelligence, Knowledge Graph

IMPORTANT

What to do if you want to work on this project / how to apply

First: Thanks a lot for your interest in our project, we’re excited to talk with you, discuss the project, and answer questions you have. Our project is open to everyone and we want to make sure you feel welcome here! So don’t hesitate to reach out even if you are coming from a different field, are new to this space, or have questions you first want to answer.

Here are some concrete next steps:

Get to know us and get your questions answered! If something is unclear or you have a question, ask it here directly in the forum so everyone can benefit from the answer. Please don’t get in touch directly via email at this point, just ask your questions here in the forum.
If you have a more technical question or want to see how we work, meet us on our GitHub Organization where we do and discuss most of our work: Neurobagel · GitHub. Feel free to comment on issues or even open a new one for a specific question, feature, or problem. Our contributor guide has some pointers for how we contribute to the projects: How to contribute - Neurobagel
Discuss your idea for the project with us so we can help you refine your proposal before you submit it. You can send an email or direct message to @Arman or @bcmcpher for this if you prefer.
Finally: make sure to look closely at the GSoC rules (Google Summer of Code), guides (What is Google Summer of Code? | Google Summer of Code Guides), timeline (Google Summer of Code 2024 Timeline | Google for Developers) and Advice for People Applying for GSoC | Google Summer of Code | Google for Developers) so you have a good idea of how the process works

Please note that we do not expect you to contribute any work to our repositories before you are selected for the project through GSoC. If you still want to contribute in your own time to our open-source project, you are very welcome to do so! But please understand that this is not a requirement for your application to be selected.

Once you are ready to submit your proposal for this project, please go through the GSoC website (https://summerofcode.withgoogle.com/) and follow the instructions there. We will make an effort to review and respond to your submissions quickly.

Tvisha_Vedant · February 25, 2024, 1:21pm

Greetings @bcmcpher @ArmanJahanpour @SebastianUrchs @AlyssaDai ,
My name is Tvisha, a Sophomore at VJTI ,Mumbai pursuing bachelors in Computer Science Btech . I am an NLP and machine learning enthusiast !!
I am well-versed with python and c++.
I have experience with natural language processing (NLP),Image processing , Python libraries like numpy and pandas , web development(MERN stack), Linux command line and version control with Git/GitHub. I have worked extensively with LLMS like langchain, Transformer architecture, and NLP, which I applied in developing a healthcare chatbot(using the Llama model from hugging face after testing a lot many models and also developing our own decoder-only transformer architecture using pytorch).

I found the project exceptionally intriguing.Developing this LLM-driven annotation agent, we can streamline the data standardization process, reduce the burden on human experts, and improve the overall efficiency and accuracy of dataset annotation I. feel that I can make meaningful contributions to this project given my background with LLms and also web-development for improving the interface. Additionally, multilingual support and audio instruction services can also be added in the future.

I feel that handling research data and systematic representation of the same is a very important task as it can help bring more contributors as well as donations for the research initiatives hence
contributing to the social good and hence contributing at Neurobagel will be an honor.

It would be great to be a part of this project.
Having spent 2 years in this ever expanding field of computer science I have developed a knack to learn various technologies in a smooth and quick way and I am ready to learn anything new for this project.
Pleased guide me with the resources which I would require to study, to start working on this project.

Looking forward to hearing from you and contributing under your mentorship.

Do share any updates regarding the tasks to be performed for GSOC’24

Email:
tnvedant_b22@ce.vjti.ac.in
My github repo:
https://github.com/tvilight4
My resume:
Resume

Sudip_Mukherjee1 · February 25, 2024, 4:04pm

My name is Sudip Mukherjee, and I am eager to express my interest in participating in Google Summer of Code 2024, collaborating with the Neurobagel team on Project 19.1 - the development of an LLM-assisted service for annotating research data.

I will dedicate significant time to understand the annotation tool’s codebase thoroughly. My research will identify the most suitable LLM for generating annotations, drawing on my experience with llama2. I’ll integrate the automated assistant into the annotation tool, enhancing user efficiency. Additionally, I’m open to UI development if needed, leveraging my JavaScript/TypeScript, React, and Redux/Recoil skills. My background in JavaScript/TypeScript, coupled with proficiency in Linux command line, Git, Docker, aligns seamlessly with the technical requirements of this endeavour. (Skills: Python, JavaScript, TypeScript, React, Large Language Models)

Concerning the tool repositories, there are three distinct ones: annotation_tool, query-tool, and react-query-tool. While annotation_tool and query-tool use Vue, I’m more proficient in React. Do I need to understand all three, or is there a specific repository to prioritise? Similarly, with API repositories (api and federation-api), should I acquaint myself with both or focus on a specific one for the project?

I am committed to dedicating the required 350 hours to this project. My technical skills, coupled with my passion for the subject matter, make me a suitable candidate for this opportunity. I am eager to collaborate closely with the Neurobagel team, learn from experienced mentors, and contribute meaningfully to the project’s success.

Immersing myself in extensive research on LangChain, I’ve acquired a profound understanding of its intricacies, spanning architecture, functionalities, and applications in linguistic data processing. This knowledge enables me to seamlessly integrate LangChain into Neurobagel, enhancing language-related aspects adeptly. My hands-on experience and active forum participation assure you of both theoretical expertise and practical insights. Eager to contribute, I wonder if there’s a task I can undertake under your guidance to further solidify my understanding and skills.

Researching LLMs and libraries like Long Chain and Ollama to find the optimal model for precise API call generation and interpreting user prompts. Previously interacted with llama2, utilising it as an API with curl.

Example using curl:

curl -X POST http://localhost:11434/api/generate -d '{

“model”: “llama2”,

“prompt”:“Explain the process of photosynthesis.”

}’

I look forward to the chance to contribute to the Neurological project and make a meaningful impact on advancing the LLM-assisted service for annotating research data.

I am excited about the prospect of working together during Google Summer of Code 2024.

Best regards,

Sudip Mukherjee
Email: sudipmukherjee96144@gmail.com
github:- SudipMukhejee (SudipMukherjee) · GitHub

Arman · March 1, 2024, 9:27pm

Hi @Tvisha_Vedant
Thank you for your interest in the project! I’m glad to see that your experience aligns with our project goals.

At this stage, focusing on familiarizing yourself with the annotation tool is sufficient for this project. There’s no need to begin working on anything before the project officially starts.

Regarding GSoC tasks, I recommend visiting their official website and keeping an eye on the deadlines. I believe the contributor application period begins on March 18th and I encourage you to apply for this project and/or our 2nd project if you’re interested.

Looking forward to hearing from you during the application period.

In the meantime, please don’t hesitate to reach out with any questions you may have regarding the projects.

Arman · March 1, 2024, 9:38pm

Hi @Sudip_Mukherjee1,

Thanks for expressing your interest in the project! It’s great to see the alignment of your experience with our project goals.

Regarding your inquiry about the tool repositories: As we’re in the process of transitioning the query tool from Vue to React, and plan to do the same for the annotation tool soon, there may be some changes in the codebase. However, the overall workflow of the tools is likely to remain consistent. If you’re taking the initiative to familiarize yourself with our tools, I suggest focusing on the annotation tool for this project, as it will be our primary focus.

While I commend your proactive approach, I want to emphasize that there’s no obligation for applicants to undertake any tasks before the project officially begins.

I’m excited to hear from you during the application period. If you have any further inquiries or require clarification on any aspect of the project, please don’t hesitate to reach out.

alyssadai · March 7, 2024, 4:29pm

Hi everyone, thanks again for your interest in this project!

We have recently added more details in the project description (under the “IMPORTANT” section) about our recommended next steps/what you can do right now if you are interested in submitting a proposal for this project. Please read through this section when you have a chance.
To be able to answer your questions more easily and help you create a good proposal, we invite you to join our Neurobagel Discord server! All the mentors will be present on the server, and here you will be able to ask general questions as well as message mentors one-on-one about your specific proposal. (You can still reach us on email and Neurostars, but responses may be slower on these platforms.)

Manya_Gupta · March 17, 2024, 2:59am

Greetings mentors,

I am Manya Gupta, an LLM enthusiast. I am extremely interested in contributing to this project. Building an LLM assistant to make data annotation process faster is a task I am confident about performing.

Looking forward to your guidance.

Email: manyaag231103@gmail.com