GSoC 2025 Project #20 INCF Secretariat :: Build an AI Agent for for KnowledgeSpace using RAG (350h)

Mentors: Visakh Muraleedharan <visakh@incf.org> and Tom Gillespie <tom.h.gillespie@gmail.com>

Skill level: Intermediate

Required skills: Python, AI/ML, NLP, Neo4j, ElasticSearch,React, Node.js, GCP, GitLab CI/CD.

Time commitment: Full time (350 hours)

Forum for discussion

About: KnowledgeSpace is a community-driven online resource for the neuroscience community, facilitating open and accessible sharing of data, knowledge, and tools. Neuroscience research generates vast amounts of complex data and literature, making it challenging for users to locate specific, relevant information. By implementing an AI agent powered by RAG, the project aims to address this challenge by creating a tool that facilitates quick and reliable access to the right data and knowledge. This will empower neuroscientists, educators, and the broader community to leverage KnowledgeSpace more effectively.

Aims: To further enhance the platform’s usability, this project aims to develop an AI-powered agent that uses Retrieval-Augmented Generation (RAG) to provide precise, contextually relevant, and human-like answers to user queries that will improve user experience by providing concise, context-aware, and scientifically accurate information about neuroscience concepts and datasets.

Scope:

  • Integration with existing KnowledgeSpace metadata
  • Indexing and data retrieval based on text and vector search
  • Neuroscience Domain context adaptation using standards NIFSTD
  • Model deployment and integration in Vertex AI
  • User Interface development

Websites: https://knowledge-space.org/ and GitHub - INCF/knowledge-space: KnowledgeSpace (KS) is a data-driven encyclopedia and search engine for the neuroscience community.

Tech keywords: Python, AI/ML, NLP, Neo4j, ElasticSearch

2 Likes

Greetings @visakh, Tom Gillespie,
I am Kinshuk Trivedi, a third-year undergraduate student at VJTI, Mumbai, India, pursuing a B.Tech in Electrical Engineering. I have a strong enthusiasm for LLMs, NLP, and machine learning! I am proficient in Python and C++, with hands-on experience in RAG, NLP, Computer Vision, and Python libraries such as NumPy, Pandas, TensorFlow, PyTorch, Streamlit. My skill set also includes Web Development (MERN stack), MySQL, APIs, Linux CLI, and version control with Git/GitHub. I have extensively worked with LLMs, including LangChain, RAG, and the Transformer architecture, applying them to develop a news research tool and a cold email generator using OpenAI and LLaMA 3.1 models.

I found the project “Build an AI Agent for KnowledgeSpace using RAG” incredibly fascinating. KnowledgeSpace serves as a vital hub for the neuroscience community, providing open and accessible sharing of research data, knowledge, and tools. The idea of integrating an AI-powered agent that leverages Retrieval-Augmented Generation (RAG) to enhance search capabilities and deliver precise, context-aware responses is both innovative and impactful. Given my experience with LLMs and RAG, I believe I can make meaningful contributions to this project. With my proficiency in web development, I can also build and improve the user interface. Future improvements could include multilingual support and personalization features to refine search relevance based on user preferences.

I feel that enhancing the way neuroscience data is structured, explored, and interacted with can make a real difference in how knowledge is shared and discoveries are made. This project has the potential to bridge gaps, making it easier for researchers, educators, and enthusiasts to find the right information when they need it, making it an honor to contribute to this project.

I am committed to dedicating the required 350 hours to this project. Over the years, I have developed a strong ability to grasp new technologies quickly, and I am eager to learn and adapt to contribute meaningfully to this project.

I’ve been exploring the ElasticSearch APIs, Graph databases like Neo4J, ArrayDB & also KnowledgeSpace platform by running it locally on my system, experimenting with improvements and potential integrations to enhance its functionality and user experience . Please guide me with the resources which I would require to study, to start working on this project.

Looking forward to hearing from you and contributing under your mentorship.

Do share any updates regarding the tasks to be performed for GSoC’25.

Email: kinshuktrivedi03@gmail.com
My github repo: GitHub
My resume: Resume

Best Regards,
Kinshuk Trivedi

Hello @visakh @TomGillespie
I am Mahi S. Palimkar, a Computer Engineering sophomore at Veermata Jijabai Technological Institute (VJTI), Mumbai. I am a Machine learning enthusiast and have also worked extensively with Natural language processing, computer vision, image processing and web development. I have a good command on python, C/C++ and version control using Git/GitHub.

I’m fascinated by the challenge of making complex neuroscience data accessible, as this directly accelerates scientific discovery. RAG is the ideal solution here as it combines the conversational abilities of LLMs with fact-based retrieval from authoritative sources—ensuring both scientific accuracy and intuitive access to specialized knowledge.

In the last few days, I have gained a deeper understanding of how RAGs work. This is a blog I wrote about it. I also tried building a basic RAG pipeline, and later on a RAG powered chatbot that answers Indian legal queries.
You can see it here:RAG-mini-project

Currently, I have been actively learning Neo4j as it has relevance in implementing the ontology driven aspects of this RAG system.

I would love to contribute to issues and do tasks to get more familiar with the project. It would be really helpful if you guide me on these lines.

I am also attaching my resume here.

Thank you!

Thank you all for your interest!
Besides the RAG part, which we have already started building.
We are looking for someone who can build a chat interface with conversational memory.

1 Like

Greetings @visakh,
Thanks for your update. Yes we can also build a chat interface with conversational memory. There are two ways through which we can implement it :-

1.using RAG (AI Agent) by which we can store past conversations as vector embeddings and query the vector database to fetch relevant past conversations and then concatenate retrieved memory + user query and pass it down to LLM which will generate the desired response. In this one we can have more than one database also (for different knowledge domains), for which the data retrieval from the databases will be handled by the agent using LLM reasoning.

2.using LLM fine-tuning (best for domain specific memory retention). If you want the chatbot to retain knowledge and respond more naturally then we can fine-tune an open-source model (e.g., LLaMA, Mistral, Falcon ) on custom conversation datasets and then train on past user interactions + responses to put memory into weights.

We can also make a hybrid one from both of the above methods. The UI part can be done using React/Flask or any other tech stack which is suitable for integration. Hope you got the ideology of the above methods.

Regards,
Kinshuk Trivedi

Thank you for your response @visakh!
I will look out for the best ways to do so and update you at the earliest.

Hello @visakh

I hope you’re doing well. My name is Uzair Sayyed, and I’m currently a second-year Computer Science Engineering student with a strong passion for LLMs and agentic AI workflows. I’m excited about the opportunity to contribute to our current system by building a Retrieval Augmented Generation (RAG) layer on top of our existing Elasticsearch data.

The approach focuses on leveraging the power of Elasticsearch combined with advanced language models. The idea is to convert our existing text data into dense, numerical representations that capture the semantic meaning of the text. These dense vectors will be stored alongside the original content within Elasticsearch. Storing the dense vectors with the original text is crucial: the vectors enable efficient semantic search using techniques like kNN, while the original text provides the full context and readability for users once relevant documents are retrieved.

When a user submits a query, we will convert that query into an embedding using the same model. Elasticsearch’s k-Nearest Neighbors (kNN) search will then be used to retrieve the most semantically similar documents from our dataset. These documents, rich in context, are subsequently fed into a language model along with the query to generate an informed and context-aware response.

Additionally, to further enhance the user experience, I propose building a conversational chat interface using LangChain. This interface will not only handle the retrieval and generation process but also incorporate conversational memory, ensuring that the dialogue remains coherent and engaging across multiple interactions.

Overall, this approach allows us to seamlessly integrate our existing data into a robust RAG system, combining semantic search with generative AI to deliver more accurate, context-rich responses. I’m very willing to contribute to this project and collaborate with you all to refine the details.

Looking forward to your thoughts and feedback.

mail:uzairsayyed010@gmail.com

Hey @visakh ,
I am Satvik, a third-year at Amrita Vishwa Vidyapeetham and a member at amFOSS.
I sent you an email asking questions regarding how to obtain the metadata so that I could explore this project further. I hope you can respond to it soon (a reply over here would work too!). I’d love to be a part of building the RAG even though you have already started working on it as it is a genuine interest of mine and it would be an incredibly enlightening experience to be a part of such a project, regardless of it being considered as a part of GSoC. I believe that building the RAG is far more challenging than developing a chat interface to record conversational memory (which I would also be interested in contributing to), but my main interest lies in joining the team currently working on it. I recently worked on a personal project implementing RAPTOR RAG from scratch so this would be a good continuation of that project.
I hope you see this and respond as soon as possible, as I would love to be part of this project.

Regards,
Satvik Mishra
mail:satvmishi@gmail.com

hi @visakh just to understand you are looking for individuals with expertise in frontend and not RAG??

Hi Visakh,

I’m Atharva, Final Year AI student from Mumbai.
I’ve been exploring knowledge graphs for a while now, I think I can contribute to this project.
I had been working on creating an conversational memory like mem0 of my own using Neo4j and ArangoDB.

Greetings @visakh @TomGillespie,

I am Viraj Vora, a second-year Computer Engineering student at VJTI, Mumbai, with a passion for AI/ML, NLP, Generative AI and web development. I was thrilled to discover the “Build an AI Agent for KnowledgeSpace using RAG” project and am deeply inspired by its mission to empower neuroscientists through enhanced data accessibility. KnowledgeSpace’s role as a bridge between neuroscience concepts, datasets, and tools resonates with my aspiration to create impactful AI solutions for real-world challenges.

The opportunity to enhance neuroscience data accessibility through RAG resonates with my technical interests and aspirations. KnowledgeSpace’s mission to bridge vast neuroscience datasets with intuitive tools inspires me to apply my skills in semantic search, graph databases, and generative AI. The prospect of integrating ElasticSearch’s vector-based retrieval with Neo4j’s relational knowledge graphs to deliver precise, context-aware responses is particularly compelling. I am eager to contribute to a system that empowers researchers to navigate intricate data landscapes efficiently.

Proposed Approach :
The solution combines ElasticSearch’s semantic search (via kNN and embeddings) with Neo4j’s graph-based entity relationships (e.g., linking “hippocampus” to memory datasets via NIFSTD standards). A lightweight LLM, fine-tuned on Vertex AI, will generate responses grounded in retrieved data. A React-LangChain chat interface will incorporate conversational memory and feedback loops for iterative improvement.

Relevant Project Experience
My prior work demonstrates alignment with this project’s scope:

  • XCELERATE (Autonomous Vehicle Project) : Built a CNN for traffic sign recognition (98% accuracy) and integrated YOLOv8 for object tracking—skills transferable to dataset handling and model optimization for RAG.
  • ExpenFlow (AI Expense Management) : Engineered an OCR-LangChain pipeline for data extraction and LLM-driven policy validation, showcasing expertise in NLP workflows.
  • Decora (Virtual Interior Designer) : Leveraged React and Three.js for UI development and web scraping for data aggregation—relevant to KnowledgeSpace’s UI and metadata integration needs.

Can you guide me on the next best steps to be taken or any prior tasks to be done for this project?

Thank you for your time—I am eager to contribute to this transformative project!

Best regards,
Viraj Vora
virajvora2409@gmail.com | LinkedIn | GitHub

1 Like

Greetings @visakh and @TomGillespie
I’m Vijayavallabh, a second-year B Tech Biological Engineering student minoring in AI at IIT Madras, India. I am interested in contributing to the Build an AI Agent for KnowledgeSpace using RAG project. My experience in AI Agents and RAG involves leading the development of SECure RAG (Inter-IIT Tech Meet 13.0) leading a team of 8, a patent-pending dynamic agentic RAG framework for financial documents like SEC-10K filings. The system integrates a novel state-machine based Multi-HyDE retrieval and Explainable AI components, benchmarked against GPT-4o and Gemini-1.5-flash. This involved advanced PDF parsing, conversational and long-term memory integration using ZEP, and dynamic data handling with Pathway’s vector store.

I am looking forward to connecting with the mentors to discuss the potential contribution towards the project and am excited about the summer of learning ahead.

Hi everyone

I am Aru Sharma, 3rd Year Engineering Undergrad from India.
I was the LFX Mentee last term at CNCF WasmEdge where I build a RAG based application using Open Source Models. We used Qdrant as a vector database and for faster indexing we compiled in native wasmedge binary. I have also worked with UC Berkeley on a project called Docetl to improve responses from LLMs for the given context. I am also creating a RAG based security bot for clouddefense.ai. Considering my experience with tools like llamaindex, langchain and various vectordbs. I think it’s best to go with writing all the stuff from scratch as it’ll allow us to have more control over the flow. I am looking forward to contribute to the project.

Thanks
Aru Sharma
Github

Intro:

Hey @visakh , my name is Soham Shah, a sophomore at VJTI. I have hands-on experience in the MERN stack, deep learning models, and Retrieval-Augmented Generation (RAG). I previously worked on ChromaSight, a project aimed at assisting visually impaired individuals using Vision Transformers for image descriptions and YOLOv11 for object detection. Additionally, I have experience building client-side AI Agents using RAGs and WebLLM, along with proficiency in Docker for deployment.

Question:

I wanted to check in on the current progress of the RAG-based AI agent for KnowledgeSpace. You previously mentioned that work on RAG has already started—does that mean the data has already been web scraped, converted into a knowledge graph, and integrated with RAG, and now only the chat interface and conversational memory are left to be implemented? Or do we still need to convert the data into a knowledge graph? Could you provide an update on the current progress and any revised deliverables for the proposal and the project? Thanks!

Hi @visakh

For this project, do you see any significant advantage in using a frontend tech stack like Node.js, or would it be sufficient to go with a built-in UI framework like Streamlit or Gradio?

Greetings @visakhđź‘‹

I’m Mrityunjay Kukreti, an AI/ML enthusiast eager to contribute to the AI Agent for KnowledgeSpace using RAG project for Google Summer of Code (GSoC) 2025.

Why I’m Interested

KnowledgeSpace plays a crucial role in making neuroscience knowledge more accessible, and I’m excited about the potential of AI-powered retrieval systems. Using Retrieval-Augmented Generation (RAG), we can create a smart agent that helps users quickly find relevant neuroscience data.

My Skills & Motivation

  • AI/ML & NLP Enthusiast – I’ve been actively learning and working on AI/ML projects, including Python-based AI models and NLP applications.
  • Eager to Work with New Technologies – While I haven’t worked with Neo4j and ElasticSearch before, I’m highly motivated to learn and apply them to this project.
  • Open-Source Contributor – I’ve been exploring open-source development and am excited about the opportunity to contribute meaningfully to KnowledgeSpace.

Next Steps

I’m currently reviewing the KnowledgeSpace documentation and exploring how RAG can be effectively implemented for this use case. I’d love to hear from mentors and the community on how I can best get started! Looking forward to collaborating and learning. :rocket:

Would appreciate any guidance on where to begin! :blush: