GSoC 2024 Project Idea 19.2 A natural language interface for querying federated research data (175h/350h)

greg_incf · February 23, 2024, 5:47pm

Neurobagel is a federated data ecosystem that allows researchers and other data users to find and consume research data that has to remain at their original institute for data governance reasons. To make this possible, Neurobagel provides tools that make annotation, integration, and searching of data easier, and maintains a common data model that allows for federating data queries. However, the scope of query options can be daunting for a user, and obtaining the desired results often requires iteration. Making the search process more accessible and conversational would motivate people to use and contribute back to the federated harmonization ecosystem of Neurobagel, ultimately benefiting all users.

The Neurobagel cohort query workflow allows users to search for cohorts of individual research participants across federated data nodes hosted at each participating institute. Each Neurobagel node consists of a graph database for data storage and an API that exposes specific query parameters and controls what results a user can see. Currently, Neurobagel provides a graphical web query interface that communicates with the node APIs on behalf of the user, making complex queries easier to formulate. We hope to improve the user query experience further by providing a LLM chatbot-style interface to populate queries and elaborate on search results.

Leveraging the existing Neurobagel cohort query workflow, this project aims to create a chatbot using existing large language models (LLMs) for parsing user-provided text into accurate queries and reliably summarizing the results to the user. At a high level, this chatbot should be capable of receiving and understanding user prompts in natural language, initiating the corresponding API calls using predefined Neurobagel parameters (minimum age, maximum age, sex, etc.), interpreting the results, and conveying that information to the user. Ideally, open tools and models can be selected to provide flexible hosting options.

The tasks involved in this project include:

Getting familiar with the codebase of existing tools, including the API and cohort query tool
Exploring LLMs and relevant libraries, such as LangChain, Ollama
Identify a model and sequence of prompts that can generate accurate API calls for the project
Developing a simple user interface for the agent. Given the flexible time commitment, this task would only be part of the project for a contributor who would like to spend the full 350 h with us

What can I do before GSoC?

Check out Neurobagel’s website and GitHub organization to familiarize yourself with the relevant tools and codebases. Please feel free to reach out to one of the mentors through email (Brent and Arman) to raise questions/discussions that you may have about the project.

Skill level: Beginner / Intermediate

Required skills: Python or JavaScript/TypeScript

Helpful skills: Basic understanding of Linux command line, Git, Docker, network requests / API calls via HTTP

Time commitment: Flexible (175/350 h)

Lead mentors:

Brent McPherson (@bcmcpher)
Arman Jahanpour (@Arman)
Sebastian Urchs (@surchs)
Alyssa Dai (@alyssadai)

Project website: https://neurobagel.org/

Backup mentors: Members of the Neurobagel team and the Origami Laboratory at McGill

Tech keywords: Python, JavaScript, TypeScript, React, Large Language Models, Artificial Intelligence, Knowledge Graph

IMPORTANT

What to do if you want to work on this project / how to apply

First: Thanks a lot for your interest in our project, we’re excited to talk with you, discuss the project, and answer questions you have. Our project is open to everyone and we want to make sure you feel welcome here! So don’t hesitate to reach out even if you are coming from a different field, are new to this space, or have questions you first want to answer.

Here are some concrete next steps:

Get to know us and get your questions answered! If something is unclear or you have a question, ask it here directly in the forum so everyone can benefit from the answer. Please don’t get in touch directly via email at this point, just ask your questions here in the forum.
If you have a more technical question or want to see how we work, meet us on our GitHub Organization where we do and discuss most of our work: Neurobagel · GitHub. Feel free to comment on issues or even open a new one for a specific question, feature, or problem. Our contributor guide has some pointers for how we contribute to the projects: How to contribute - Neurobagel
Discuss your idea for the project with us so we can help you refine your proposal before you submit it. You can send an email or direct message to @Arman or @bcmcpher for this if you prefer.
Finally: make sure to look closely at the GSoC rules (Google Summer of Code), guides (What is Google Summer of Code? | Google Summer of Code Guides), timeline (Google Summer of Code 2024 Timeline | Google for Developers) and Advice for People Applying for GSoC | Google Summer of Code | Google for Developers) so you have a good idea of how the process works

Please note that we do not expect you to contribute any work to our repositories before you are selected for the project through GSoC. If you still want to contribute in your own time to our open-source project, you are very welcome to do so! But please understand that this is not a requirement for your application to be selected.

Once you are ready to submit your proposal for this project, please go through the GSoC website (https://summerofcode.withgoogle.com/) and follow the instructions there. We will make an effort to review and respond to your submissions quickly.

Sauradip07 · February 23, 2024, 6:42pm

My name is Sauradip Ghosh , and I am writing to express my strong interest in participating in Google Summer of Code 2024 under the mentorship of the Neurobagel team. Specifically, I am keen on contributing to the development of the chatbot project outlined in the problem statement (19.2).

Having thoroughly reviewed the project description, I find the prospect of leveraging existing tools and large language models (LLMs) to build a chatbot capable of parsing user prompts, initiating API calls, and summarizing results intriguing. My background in JavaScript/TypeScript, as well as familiarity with Linux command line, Git, Docker, and HTTP requests, aligns well with the technical requirements of the project. (Python, JavaScript, TypeScript, React, Large Language Models)

Here how I approach my for this project :

Codebase Familiarization: : I am very much in interested and I will dedicate time to understanding the existing Neurobagel codebase, including the API and cohort query tool .

Clarification Needed on Project Repositories

Regarding the tool repositories, I noticed that there are three distinct repositories: annotation_tool, query-tool, and react-query-tool. I understand that the annotation_tool and query-tool are written in Vue, while I am more familiar with React. Could you please clarify if it is necessary for me to familiarize myself with all three repositories, or if there is a specific repository that I should focus on?

Similarly, I noticed that there are two API repositories: api and federation-api. Could you please advise if I need to familiarize myself with both of these repositories, or if there is a specific one that is more relevant to the project?

Exploration of LLMs and Libraries: I will research and evaluate LLMs and relevant libraries such as LangChain and Ollama to identify the most suitable model for generating accurate API calls and interpreting results based on user prompts. As I am previously played with llama2 and also used as an API using curl:
Example using curl:

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt":"Why is the sky blue?"
 }'

User Interface Development: I am willing to contribute to the development of a simple user interface for the chatbot using my proficiency in JavaScript/TypeScript React and Redux/Recoil.

I am fully committed to dedicating the required 350 hours to this project. My technical skills, coupled with my enthusiasm for the subject matter, make me a strong candidate for this endeavor. I am eager to collaborate closely with the Neurobagel team, learn from experienced mentors, and contribute meaningfully to the project’s advancement.

Furthermore, I understand the importance of this project and am prepared to allocate additional time beyond the stipulated 350 hours, if necessary, to ensure the project’s success. I am flexible and willing to go the extra mile to meet the project’s objectives and deadlines.

Additionally, I would greatly appreciate it if you could assign me specific tasks or provide guidance on tasks that I can work on for this project. I am eager to contribute in any way possible and ensure that my efforts are aligned with the project goals.

Best Regards,
Sauradip Ghosh

LinkedIn: https://www.linkedin.com/in/sauradip-ghosh-726742222/
GitHub: Sauradip07 (Sauradip Ghosh) · GitHub

Arman · March 1, 2024, 9:44pm

Hi @Sauradip07

Thanks for reaching out to express your interest in the project! I’m glad to see that your technical background and experience align well with our project goals.

Regarding your inquiry about the project repositories: We are currently transitioning the query tool from Vue to React. If you’re taking the initiative to familiarize yourself with our tools, I would recommend focusing on the react-query-tool repository and the api repository since our primary focus will be on them for this project.

While I commend your proactive approach, I want to emphasize that there’s no need for you to undertake any tasks before the project officially begins.

I hope this addresses your questions adequately. If you have any further inquiries or require clarification on any aspect of the project, please don’t hesitate to reach out.

arnabch20k · March 6, 2024, 5:22am

Dear Brent, Arman, Sebastian, Alyssa, and the Neurobagel Team,

I hope this message finds you well.
I am Arnab Chatterjee, a Sophomore Computer Science student Specialized for AI/ML at Asansol Engineering College. After thoroughly reviewing the details of the Neurobagel project and its exciting potential, I am writing to express my keen interest in contributing to the development of a conversational interface for the Neurobagel federated data ecosystem.

Neurobagel’s mission to facilitate data accessibility and collaboration among researchers while maintaining data governance standards resonates deeply with me. The prospect of enhancing user experience through a chatbot-style interface aligns perfectly with the project’s vision of making data querying more intuitive and user-friendly.

Here are some compelling points that motivate me to be a part of this endeavor:

Federated Data Ecosystem: I am intrigued by the concept of a federated data ecosystem that enables researchers to access and utilize data while respecting institutional data governance policies. It is crucial for advancing scientific research while upholding data integrity and security.
Enhanced Query Workflow: The Neurobagel cohort query workflow presents an innovative approach to searching for research cohorts across multiple data nodes. By leveraging graph databases and APIs, Neurobagel streamlines the querying process and offers researchers valuable insights into diverse datasets.
Conversational Interface: The integration of a chatbot-style interface powered by large language models (LLMs) represents a significant step towards enhancing user engagement and accessibility. By enabling users to interact with the system in natural language, we can lower the barrier to entry and encourage broader participation within the research community.
Skill Development: As someone with proficiency in Python and a basic understanding of relevant technologies such as Git and network requests, I am eager to expand my skill set and contribute meaningfully to the project. The opportunity to work with mentors and explore cutting-edge technologies like LLMs excites me tremendously.

While I possess a solid foundation in Python, I am also eager to delve into JavaScript/TypeScript to contribute effectively to the project. Moreover, I am comfortable with basic Linux command line operations, Git, Docker, and have experience with network requests and API calls via HTTP.
Before the official commencement of GSoC, I plan to familiarize myself with the Neurobagel codebase and relevant tools, including the API and cohort query tool. Additionally, I aim to explore LLMs and associated libraries to gain a deeper understanding of their capabilities and potential applications within the project.

I am enthusiastic about the possibility of collaborating with the Neurobagel team and contributing to the development of a transformative solution that empowers researchers worldwide.

Moreover, I recognize the significance of this project and am ready to invest extra time beyond the designated 350 hours, should it be required, to guarantee its triumph. My flexibility and dedication extend to surpassing expectations to fulfill the project’s aims and meet deadlines effectively.

Furthermore, I would be grateful if you could designate particular tasks or offer guidance regarding areas where I can contribute to this project. I am enthusiastic about contributing in any capacity necessary and ensuring that my contributions align seamlessly with the project’s overarching objectives.

Thank you so much for considering me! Looking forward for your reply and guiding light to take further step towards this project.

Warm Regards,
Arnab Chatterjee

LinkedIn
GitHub
Twitter

alyssadai · March 7, 2024, 4:28pm

Hi everyone, thanks so much for your interest in this project!

We have recently added more details in the project description (under the “IMPORTANT” section) about our recommended next steps/what you can do right now if you are interested in submitting a proposal for this project. Please read through this section when you have a chance.
To be able to answer your questions more easily and help you create a good proposal, we invite you to join our Neurobagel Discord server! All the mentors will be present on the server, and here you will be able to ask general questions as well as message mentors one-on-one about your specific proposal. (You can still reach us on email and Neurostars, but responses may be slower on these platforms.)