GSoC 2025 Project #20 INCF Secretariat :: Build an AI Agent for for KnowledgeSpace using RAG (350h)

Raghavendra_N.V · March 26, 2025, 7:00pm

Hi Visakh,
Would you prefer using something like FastAPI or Flask to create a backend for the chatbot? Or would you be interested in bundling the entire functionality with the streamlit or gradio code? I have experience with building an end-to-end RAG based chatbot with langchain and I think I can contribute in building the chatbot. Thanks in advance!

Raghavendra

Personal Webpage

Abhishek_Subramanian · March 27, 2025, 4:05am

Hey @visakh,

I’m Abhishek, currently pursuing my Master’s in Computer Engineering with a focus on Machine Learning at Virginia Tech. Before starting grad school, I worked as a Software Engineer at an ML-focused firm for two years. My experience includes building APIs and developing user interfaces, particularly for chatbot applications.

I came across the Build an AI Agent for KnowledgeSpace using RAG project and found it closely aligned with my interests and experience. I’m currently involved in a research project focused on developing a reasoning model using Retrieval-Augmented Generation (RAG), and I believe I could meaningfully contribute to this initiative.

I’ve attached my resume and portfolio for your reference.

Thanks,
Abhishek Subramanian
Resume | Website

Mosmoove · March 30, 2025, 8:18pm

Greetings Visakh and Tom Gillespie,

My name is Mohamed Awad, a fourth-year undergraduate student at CUNY Queens College studying computer science in New York City. I have a strong interest in LLM, machine learning, and NLP. I have demonstrated proficiency in Python with hands-on experience in ML, RAG, and other Python libraries, such as Numpy, Pandas, and Pytorch. Recently, I have done ML research where I analyzed thousands of images to classify signs of skin cancer. My skill set also includes web development from tools like JS, React, Node.js, Git, APIs, AWS, and Firebase.

I came across this project “Build an AI Agent for KnowledgeSpace using RAG” and I found it interesting that in a world of LLMs, the models themselves need to be accurate, fast, and up-to-date with the relevant context. With my knowledge in RAG, NLP, and LLMs, I believe I can make meaningful contributions for this project and feel that neuroscience data would be structured and accessible to community members.

I am dedicated to commit at least 350 hours to this project given my background in AI and web development. I have developed a strong ability to learn quickly and adapt to technologies, allowing myself to make contributions that benefit the entire KnowledgeSpace community. I’m wondering if you could guide me with any resources to get started before getting into the nitty-gritty of this project.

Looking forward to hearing from you.

Email: mohamednasiradeen27@gmail.com
Resume: Mohamed’s Resume

Github: Mosmoove

Blockquote

navkar98 · March 30, 2025, 11:32pm

Hello Visakh and Tom,

I hope I’m not too late to the discussions.

I’m currently a third-year computer science PhD at UMass Boston, researching bias mitigation and fairness enhancement in Recommender Systems (RS). In my research, I’m also experimenting with LLMs to uncover the black box nature of RS. I wanted to utilize my summer learning and contribute meaningfully and I found the “Build an AI Agent for KnowledgeSpace using RAG” project interesting and close to my research and interests.

I’ve worked with RAG and chatbots in past academic projects and also got to explore AI agents through hackathons at the MIT/Harvard club. Through these hackathons, I came across a tool called Maestro by AI21 Labs. I believe the problem of creating an AI agent for KnowledgeSpace would need a similar approach where the Agent dynamically plans the task based on the user query and makes an execution plan to retrieve correct data effectively.

Additionally, I read that you are looking for someone to build the chat interface with conversational memory. For that, I’ve had some experience in the past where I’ve worked in Web Development and can create a chat interface similar to ChatGPT, Deepseek, etc. Also, as part of the academic project where I worked on a RAG-based chatbot, we worked on a Conversational pipeline to save and tailor the LLM response based on the conversation history.

I would love to get this conversation going and am looking forward to your response.

PS: attaching my resume for you to get a better understanding of my skills.
Resume

Best,
Navkar

FawwazRaza · April 2, 2025, 8:18am

@visakh Can you provide details about the dataset. Like is it in pdfs, csv or word files etc?

@visakh Should we have to use any api key (openai or gemini etc) for text or all the models used in chatbot must be open source models?

Rishika_Chaudhary · April 3, 2025, 4:34pm

Hi @Visakh and team,

I’m Rishika, a third-year CSE undergrad with a strong interest in AI and backend development. I recently built a WhatsApp chatbot for the 2025 Indian elections as part of a government-backed project, handling large-scale queries on AWS with NLP-based retrieval and optimization techniques. Through that, I worked on efficient retrieval methods, prompt engineering for factual accuracy, and scalable caching strategies—which directly align with the challenges in this project.

I know I’m joining the discussion a bit late, but the work you’re doing with RAG and conversational memory in KnowledgeSpace is incredibly exciting! I had a quick question regarding the retrieval pipeline:

Are we looking at a hybrid memory approach, combining vector search (FAISS/Qdrant) with structured storage for long-term context, or are we optimizing more for short-term recall with token-window-based solutions? Also, given the neuroscience domain, what trade-offs are we considering in retrieval—speed, storage constraints, or something else?

Really looking forward to your insights and excited to contribute.

visakh · April 3, 2025, 9:42pm

Datasets are metadata from large neuroscience data repos together with curated ontologies in NIFSTD.

The agent can provide concept definitions and provide results on relevant datasets. The metadata schema for each datasets/sources in not standardised, but the llm can be quite useful to retrieve relevant information there.

We will be using the models in Vertex-AI

Gajendra_Thakur · April 5, 2025, 2:20pm

Dear @visakh and Tom,

I hope you’re doing well. I’m writing to express my strong interest in the GSoC 2025 project: “Build an AI Agent for KnowledgeSpace using RAG.”

Thank you for the recent update clarifying that the RAG backend is already underway. I’m genuinely excited to hear that! Based on your guidance, I’ve begun exploring the development of a chat interface with conversational memory, which will integrate with the existing RAG system and enhance KnowledgeSpace’s usability.

Over the past few days, I’ve:

Researched and experimented with frontend chat UIs using React
Explored how conversational memory can be built using LangChain concepts (e.g., buffer memory and context windows)
Studied how vector stores and graph databases like Neo4j might support long-term memory or structured metadata retrieval

I understand that metadata across datasets is not standardized, and I see the huge potential of using LLMs to bridge this gap by providing accurate, contextual answers through a conversational experience. I’m also beginning to experiment with model deployment on Vertex AI, as mentioned in the project scope.

Attached is a mockup that showcases my early thoughts and interface concept. I’d love your feedback and would be happy to iterate or prototype further based on your suggestions.

Understanding Metadata Structure in KnowledgeSpace To effectively enable contextual responses and relevant dataset retrieval in the AI agent, I studied the existing metadata structuring pipeline. The diagram below outlines how dataset provenance, measurement device data, and results are formalized using OWL ontologies and stored in an ontology repository. My proposed conversational UI will interface with this layer to provide accurate, ontology-powered search and interaction.

I’m currently drafting my full GSoC proposal and would be honored to collaborate on this project under your mentorship.

Looking forward to hearing from you!

Warm regards,
@Gajendra_Thakur
GitHub: Gajendra9679 (Gajendra Thakur ) · GitHub
LinkedIn: https://www.linkedin.com/in/gajendrathakur/
University: Savitribai Phule Pune University (B.E.- Artificial Intelligence & Data Science)
Email: gajendrathakur1031@gmail.com

Mosmoove · April 5, 2025, 10:05pm

Hi @visakh
Could you provide a dataset for this project? Looking forward to diving into building an AI Agent given my knowledge in RAG, LLMs and NLP.

visakh · April 6, 2025, 3:51pm

This api provides an overview of the data
KnowledgeSpace API

Tao_He · April 7, 2025, 1:32am

Hi @visakh and team,

I am Tao He, currently an MSCS Align student at Northeastern University(Boston). I’m credibly excited about the KnowledgeSpace RAG project and would love to contribute as a GSoC contributor…

What draws me to this project is its rare intersection of AI, knowledge retrieval, and neuroscience—areas I’m now deeply passionate about. I’ve worked on multiple end-to-end NLP and ML projects, including:

A comparative time-series modeling research (ARIMA vs SVR) published with Taylor & Francis
A document-level sentiment classification project using an improved BiLSTM with attention (accepted at ICDSE 2024)
A solo-built full-stack prompt-sharing platform Promptllery using React + Supabase, demonstrating my frontend/backend + user-centric design skills

And I have solid experience in:

NLP: Transformers, BERT fine-tuning, attention mechanisms
Data pipeline: Python, Scikit-learn, TensorFlow, Git, Pytorch
UI + tooling: React, Tailwind, Supabase, Node.js, RESTful APIs

I’m especially interested in contributing to:

UI/UX for querying and interacting with neuroscience data
Fine-tuning LLM outputs for clarity and domain relevance in the RAG pipeline

As someone transitioning from economics and public policy into CS and AI, I bring not only technical curiosity but also a deep respect for the complexity of domain knowledge—especially in fields like neuroscience.

I’d love the opportunity to dive deeper into the problem scope and propose a concrete implementation plan if selected. Thank you for considering my interest!

Warm regards,
Tao He
Email: hetaoo.c@gmail.com
GitHub