Hi Visakh,
Would you prefer using something like FastAPI or Flask to create a backend for the chatbot? Or would you be interested in bundling the entire functionality with the streamlit or gradio code? I have experience with building an end-to-end RAG based chatbot with langchain and I think I can contribute in building the chatbot. Thanks in advance!
Iโm Abhishek, currently pursuing my Masterโs in Computer Engineeringwith a focus onMachine Learning at Virginia Tech. Before starting grad school, I worked as a Software Engineer at an ML-focused firm for two years. My experience includes building APIs and developing user interfaces, particularly for chatbot applications.
I came across the Build an AI Agent for KnowledgeSpace using RAG project and found it closely aligned with my interests and experience. Iโm currently involved in a research project focused on developing a reasoning model using Retrieval-Augmented Generation (RAG), and I believe I could meaningfully contribute to this initiative.
Iโve attached my resume and portfolio for your reference.
My name is Mohamed Awad, a fourth-year undergraduate student at CUNY Queens College studying computer science in New York City. I have a strong interest in LLM, machine learning, and NLP. I have demonstrated proficiency in Python with hands-on experience in ML, RAG, and other Python libraries, such as Numpy, Pandas, and Pytorch. Recently, I have done ML research where I analyzed thousands of images to classify signs of skin cancer. My skill set also includes web development from tools like JS, React, Node.js, Git, APIs, AWS, and Firebase.
I came across this project โBuild an AI Agent for KnowledgeSpace using RAGโ and I found it interesting that in a world of LLMs, the models themselves need to be accurate, fast, and up-to-date with the relevant context. With my knowledge in RAG, NLP, and LLMs, I believe I can make meaningful contributions for this project and feel that neuroscience data would be structured and accessible to community members.
I am dedicated to commit at least 350 hours to this project given my background in AI and web development. I have developed a strong ability to learn quickly and adapt to technologies, allowing myself to make contributions that benefit the entire KnowledgeSpace community. Iโm wondering if you could guide me with any resources to get started before getting into the nitty-gritty of this project.
Iโm currently a third-year computer science PhD at UMass Boston, researching bias mitigation and fairness enhancement in Recommender Systems (RS). In my research, Iโm also experimenting with LLMs to uncover the black box nature of RS. I wanted to utilize my summer learning and contribute meaningfully and I found the โBuild an AI Agent for KnowledgeSpace using RAGโ project interesting and close to my research and interests.
Iโve worked with RAG and chatbots in past academic projects and also got to explore AI agents through hackathons at the MIT/Harvard club. Through these hackathons, I came across a tool called Maestro by AI21 Labs. I believe the problem of creating an AI agent for KnowledgeSpace would need a similar approach where the Agent dynamically plans the task based on the user query and makes an execution plan to retrieve correct data effectively.
Additionally, I read that you are looking for someone to build the chat interface with conversational memory. For that, Iโve had some experience in the past where Iโve worked in Web Development and can create a chat interface similar to ChatGPT, Deepseek, etc. Also, as part of the academic project where I worked on a RAG-based chatbot, we worked on a Conversational pipeline to save and tailor the LLM response based on the conversation history.
I would love to get this conversation going and am looking forward to your response.
PS: attaching my resume for you to get a better understanding of my skills. Resume
Iโm Rishika, a third-year CSE undergrad with a strong interest in AI and backend development. I recently built a WhatsApp chatbot for the 2025 Indian elections as part of a government-backed project, handling large-scale queries on AWS with NLP-based retrieval and optimization techniques. Through that, I worked on efficient retrieval methods, prompt engineering for factual accuracy, and scalable caching strategiesโwhich directly align with the challenges in this project.
I know Iโm joining the discussion a bit late, but the work youโre doing with RAG and conversational memory in KnowledgeSpace is incredibly exciting! I had a quick question regarding the retrieval pipeline:
Are we looking at a hybrid memory approach, combining vector search (FAISS/Qdrant) with structured storage for long-term context, or are we optimizing more for short-term recall with token-window-based solutions? Also, given the neuroscience domain, what trade-offs are we considering in retrievalโspeed, storage constraints, or something else?
Really looking forward to your insights and excited to contribute.
Datasets are metadata from large neuroscience data repos together with curated ontologies in NIFSTD.
The agent can provide concept definitions and provide results on relevant datasets. The metadata schema for each datasets/sources in not standardised, but the llm can be quite useful to retrieve relevant information there.
I hope youโre doing well. Iโm writing to express my strong interest in the GSoC 2025 project: โBuild an AI Agent for KnowledgeSpace using RAG.โ
Thank you for the recent update clarifying that the RAG backend is already underway. Iโm genuinely excited to hear that! Based on your guidance, Iโve begun exploring the development of a chat interface with conversational memory, which will integrate with the existing RAG system and enhance KnowledgeSpaceโs usability.
Over the past few days, Iโve:
Researched and experimented with frontend chat UIs using React
Explored how conversational memory can be built using LangChain concepts (e.g., buffer memory and context windows)
Studied how vector stores and graph databases like Neo4j might support long-term memory or structured metadata retrieval
I understand that metadata across datasets is not standardized, and I see the huge potential of using LLMs to bridge this gap by providing accurate, contextual answers through a conversational experience. Iโm also beginning to experiment with model deployment on Vertex AI, as mentioned in the project scope.
Attached is a mockup that showcases my early thoughts and interface concept. Iโd love your feedback and would be happy to iterate or prototype further based on your suggestions.
Understanding Metadata Structure in KnowledgeSpace To effectively enable contextual responses and relevant dataset retrieval in the AI agent, I studied the existing metadata structuring pipeline. The diagram below outlines how dataset provenance, measurement device data, and results are formalized using OWL ontologies and stored in an ontology repository. My proposed conversational UI will interface with this layer to provide accurate, ontology-powered search and interaction.
Iโm currently drafting my full GSoC proposal and would be honored to collaborate on this project under your mentorship.
I am Tao He, currently an MSCS Align student at Northeastern University(Boston). Iโm credibly excited about the KnowledgeSpace RAG project and would love to contribute as a GSoC contributorโฆ
What draws me to this project is its rare intersection of AI, knowledge retrieval, and neuroscienceโareas Iโm now deeply passionate about. Iโve worked on multiple end-to-end NLP and ML projects, including:
A comparative time-series modeling research (ARIMA vs SVR) published with Taylor & Francis
A document-level sentiment classification project using an improved BiLSTM with attention (accepted at ICDSE 2024)
A solo-built full-stack prompt-sharing platform Promptllery using React + Supabase, demonstrating my frontend/backend + user-centric design skills
UI/UX for querying and interacting with neuroscience data
Fine-tuning LLM outputs for clarity and domain relevance in the RAG pipeline
As someone transitioning from economics and public policy into CS and AI, I bring not only technical curiosity but also a deep respect for the complexity of domain knowledgeโespecially in fields like neuroscience.
Iโd love the opportunity to dive deeper into the problem scope and propose a concrete implementation plan if selected. Thank you for considering my interest!
Hello @visakh
Iโm Rishee, a Computer Science undergrad with a strong interest in LLMs, multi-agent systems, and Retrieval-Augmented Generation (RAG). Iโve worked on projects using Gemini, Pinecone, LangGraph, and built custom logic with LangChain and HuggingFace. Iโm comfortable with Python and have experience in prompt engineering, RAG evaluation, and LLM observability.
Recently, I built a RAG pipeline without using any orchestration tools, implementing everything from scratch chunking, embedding, vector search, and response generation. Iโve documented the process in this blog post. I also created a context aware medical chatbot using Gemini and Pinecone, with a full-stack deployment via Flask. link here
I would love to contribute to issues and do tasks to get more familiar with the project. It would be really helpful if you guide me.
I know im a bit late to join the party, but im willing to give my best in this project.
My name is Navanish Pandey, and Iโm interested in contributing to the KnowledgeSpace Chat Agent project under INCF.
I have already set up the full project locally (backend + frontend + API integration), and I want to start improving it by fixing issues and adding useful features.
Before I begin, I wanted to ask:
Who is the current mentor or maintainer for this project?
Is there a preferred way to reach out to them for discussing ideas, issues, or improvements?
Is the project still active and open for contributions outside the GSoC period?
Iโm genuinely interested in contributing to this project long-term and preparing myself for GSoC 2026 with INCF.
Any guidance from your side will really help me get started in the right direction.
My name is ๐๐ฎ๐๐ซ๐. I am a Full-Stack AI Developer specializing in the ๐ ๐๐ฌ๐ญ๐๐๐ + ๐๐๐๐๐ญ + ๐๐๐ง๐ ๐๐ก๐๐ข๐ง stack.
I am highly interested in ๐๐ซ๐จ๐ฃ๐๐๐ญ #๐๐ (๐๐ฎ๐ข๐ฅ๐๐ข๐ง๐ ๐๐ง ๐๐ ๐๐ ๐๐ง๐ญ). I have my End-Semester exams until ๐๐๐ ๐๐, but I wanted to share a quick proof-of-concept I built today to test the data retrieval architecture.
๐๐๐ฏ๐ฟ๐ถ๐ฑ ๐ฆ๐๐ฎ๐ฐ๐ธ: A React frontend connected to a Python/FastAPI backend (simulated).
๐๐ด๐ฒ๐ป๐๐ถ๐ฐ ๐๐ต๐๐ป๐ธ๐ถ๐ป๐ด: Instead of naive character splitting, I wrote a custom ingestion pipeline that detects scientific headers (e.g., โPathophysiologyโ, โMethodsโ) and uses โsticky metadataโ to ensure every chunk retains its context.
๐๐ถ๐๐ฎ๐๐ถ๐ผ๐ป-๐ฅ๐ฒ๐ฎ๐ฑ๐: The system retrieves specific page numbers and section names to prevent hallucinations.
I will be fully available to start contributing code and discussing the ๐๐๐จ๐๐ฃ + ๐๐๐ง๐ ๐๐ซ๐๐ฉ๐ก integration starting ๐๐๐ ๐๐
Best regards, Rudra
My name is Pratik Dhaktode, a second-year Computer Engineering student at PICT, Pune. I have a strong background in Deep Learning, Applied Maths, and LLM Agents, primarily working with Python and C++.
I am writing to express my strong interest in contributing to the KnowledgeSpace AI Agent. I have significant experience building RAG systems and recently won the Smart India Hackathon (SIH) solving a problem statement for ISRO, which gave me rigorous experience in delivering scalable AI solutions.
My Relevant Experience:
OceanGPT: Built a RAG + LLM chatbot with conversational memory (similar to the goals of Project #20).
Agentic AI: Developed a high-scale fraud detection system using Agentic workflows and Python microservices.
Tech Stack: PyTorch, LangChain, FAISS, and Vector Databases.
Current Progress & Proposal: I have already successfully set up the knowledge-space-agent repository locally and ran the ingestion pipeline to understand how you are currently chunking and indexing data.
I noticed the current implementation relies heavily on vector search. Based on my experience with OceanGPT, I believe implementing a Hybrid Search (Vector + Keyword/BM25) approach could significantly reduce hallucinations when retrieving specific dataset names or technical terms.
I would love to work on this or any other priority tasks. Is there a specific branch or issue you recommend I tackle for my first PR?