GSoC 2025 Project #30 UCSD Projects :: LinkML support for the Resource Information Network: unlocking LLM interrogation of a scientific resource and citation knowledge graph (350h)

Mentor: Jeffrey Grethe <jeffrey.grethe@gmail.com> and Anita Bandrowski <abandrowski@health.ucsd.edu>

Skill level: Intermediate or greater

Required skills: Python, Graph Databases (e.g. Neo4J), RESTful APIs

Time commitment: Full time (350 hours)

Forum for discussion

About: The Resource Information Network (RIN) is the largest collection of information about research resources (e.g. cell lines, antibodies, software tools). The RIN has been developed through support of projects like the NIDDK Information Network (https://dknet.org), the Neuroscience Information Framework (https://neuinfo.org), and the Research Resource Identification Initiative (https://rrid.site). Resources with the RIN are uniquely identified by a Research Resource Identifier (RRID). This allows the RIN to aggregate information about the performance of a resource and who has used a particular resource in a certain context - via mentions of the resource in the scientific literature. All this data for the RIN is available in JSON format through an Elasticsearch API.

Aims: The primary goal of this project is to generate a knowledge graph from this information that can be interrogated via Large Language Models utilizing Retrieval Augmented Generation (RAG) methods. RAG enhances LLMs by integrating structured knowledge graphs into their query process. Instead of relying solely on pre-trained data, RAG dynamically retrieves relevant information from knowledge graphs, improving accuracy and reducing hallucinations. To transition the JSON data to a formal knowledge graph we will utilize LinkML (https://linkml.io/), a general purpose modeling language that can be used with linked data, JSON, and other formalisms. Steps involved in the project would include: 1) Developing a LinkML model of the data sources for the knowledge graph; 2) make a knowledge graph of this data available in Neo4J (e.g. through use of LinkML-Store); 3) Inspect the knowledge graph through the development of some initial Cypher queries; 4) explore the use of RAG techniques through a LLM; and 5) explore the use of an open source LLM user interface framework (e.g. Open WebUI GitHub - open-webui/open-webui: User-friendly AI Interface (Supports Ollama, OpenAI API, ...)).

Websites:

  • Research Resource Identification Initiative (https://rrid.site) website which provides a searchable user interface for resources and related mentions in the literature
  • Biomed Resource Watch (Biomed Resource Watch | Home) is a unique knowledge base for storing validation and performance information about research resources such as antibodies, cell lines, and tools. It is a new service platform that uses Research Resource Identifiers (RRIDs) to aggregate and disseminate known issues and validation information about antibodies, cell lines, and tools
  • Resource Information Network APIs (https://docs.scicrunch.io/). This site provides documentation for RIN APIs that are accessible via https://api.scicrunch.io. The APIs provide access to all components of the RIN (i.e. information on resources, their mentions, and performance)

Tech keywords: Knowledge Graph, Retrieval Augmented Generation (RAG), Artificial Intelligence, Large Language Model (LLM), LinkML, Python

1 Like

Hello, I am Janvi Soni, a second-year undergraduate student at VJTI Mumbai, studying Information Technology. I am passionate about AI and have worked extensively in deep learning ,generative ai and retrieval-based LLMs, developing a text-to-image generation model from scratch using GANs, VAEs, and Vision Transformers (ViTs) to generate images from text descriptions. I have also built a multimodal RAG-based research paper analyzer that processes and retrieves key insights from scientific documents, analyzing text , tables,graphs, images, and other structured data to improve research accessibility.
I am familiar with graph-based RAGs, knowledge graphs, and Neo4J and am now implementing them to enhance structured retrieval and AI-driven reasoning. This project aligns with my interest in intelligent retrieval systems, focusing on knowledge graphs and RAG methods to enhance LLM accuracy. I am looking forward and would love any guidance on how to best contribute and move ahead with this project.

Greetings @jgrethe, @anita_bandrowski,
I hope you’re doing well! I am Kinshuk Trivedi, a third-year undergraduate student at VJTI, Mumbai, India. I previously introduced myself via email regarding my interest in contributing to the “LinkML support for the Resource Information Network: unlocking LLM interrogation of a scientific resource and citation knowledge graph” project for GSoC’25. Now that the forum is set up for discussions, I wanted to follow up and better understand how I can start contributing effectively.

I’m excited about working on scientific research data. I’ve been looking into the RRID and Biomed Resource watch websites and will soon share the knowledge graph-based application which I’m currently working on to demonstrate my understanding of graph databases.

I just wanted to know whether this knowledge graph based LLM ( using RAG ) is to be created on a separate web based application (like ReactJs, Django, etc) or to be integrated with a present web interface.

Do share any updates regarding the tasks to be performed for GSoC’25.

Looking forward to hearing your insights! Thanks in advance for your guidance.

Best Regards,
Kinshuk Trivedi

Since this is part of Google Summer of Code, the timeline below and application process applies. Work on projects can not begin until contributor applications have been submitted and applications have been formally accepted (accepted projects are announced on May 8th). Reminder of upcoming deadlines:

  1. On March 24th, the GSoC contributor application period begins
  2. April 8 - 18:00 UTC is the deadline for the GSoC contributor application [You MUST SUBMIT AN APPLICATION to be considered for a project]

The Google Summer of Code - Getting Started page is at: Get Started | Google Summer of Code. This is where you will find out more about contributor eligibility and where you would start and submit your application for the program.

This can be a separate interface for the project. There are also some open source LLM interfaces (e.g. Open WebUI https://openwebui.com/)

1 Like

The primary activity for the project is to become familiar with the JSON data available for resources via the API (URL above) and develop a schema within LinkML that will allow for the transformation of the JSON data to a Knowledge Graph. Applications are due to Google Summer of Code later this month (March 24th)

1 Like

Hi @jgrethe @anita_bandrowski ,
Atharva here, final year AI undergrad from Mumbai, India. Interested in GSoc 25’ at INCF. My focus is Core ML, and GenAI (agents, knowledge graphs).

I have been working on GraphRAG for the past few months in my internship, and I have been exploring various GraphDBs like Neo4j, ArangoDB, MemGraph

I’ve previously built GraphRAG and better versions of it, but I was just wondering what is the scope of research in this project, are we free to explore the research aspect of this project (if any) and if yes, can you guide me to concept/s which would be potentially the primary topics of research for this.

Thanks,
Atharva Nawadkar

Greetings @jgrethe,
Thanks for sharing this insight. Will surely check out the open source LLM interfaces (Open WebUI).

Currently getting familiar with the JSON data via the API and diving into LinkML’s working mechanism. I’ll touch base with updates soon.

Regards,
Kinshuk Trivedi

Hey @jgrethe would there be any specific evaluation tasks one can do irrespective of getting familiar with the topics? Can the development of schema be considered a task of sorts?

Hello @jgrethe , @anita_bandrowski

I hope you’re doing well.

I am Mahi S. Palimkar, a Computer Engineering sophomore at Veermata Jijabai Technological Institute (VJTI), Mumbai. I am passionate about machine learning, with experience in NLP, computer vision, image processing, and web development. I have strong programming skills in Python, C/C++, and version control using Git/GitHub.

I found the “LinkML support for the Resource Information Network” project in this year’s INCF GSoC idea list particularly exciting because it aims to make specialized research resources more accessible through AI. The challenge of transforming JSON data from RIN’s Elasticsearch API into a structured knowledge graph using LinkML is something I find both interesting and impactful.

Recently, I have been exploring Retrieval-Augmented Generation (RAG) for improving scientific resource discovery. I documented my learnings in this blog and also built a RAG system from scratch, which you can find here: RAG-from-Scratch. Additionally, I have started learning Neo4j to understand how graph databases can effectively model complex relationships.

I had previously connected with you via email too to share my interest in the project and implementations of RAG.I have started familiarizing myself with the JSON data as you mentioned above. Can you guide me on the next best steps?

Regards
Mahi

Thank you for your insights, I am currently exploring RIN JSON data and also the process of converting it into a Knowledge Graph using LinkML and Neo4j and will update you as soon as possible with my findings and are there any particular tasks to be completed?

Yes, the development of a schema would be one of the tasks in the pipeline to go from JSON to a Knowledge Graph

The area where research could come in is the GraphRAG for querying across the knowledge graph

The next step is preparing your application for the GSoC. Formal applications must be submitted to the Google Summer of Code program. And once applications are accepted, work on the projects would begin.

Since this is part of Google Summer of Code, the timeline below and application process applies. Work on projects can not begin until contributor applications have been submitted and applications have been formally accepted (accepted projects are announced on May 8th). Reminder of upcoming deadlines:

  1. On March 24th, the GSoC contributor application period begins

  2. April 8 - 18:00 UTC is the deadline for the GSoC contributor application [You MUST SUBMIT AN APPLICATION to be considered for a project]

The Google Summer of Code - Getting Started page is at: Get Started | Google Summer of Code. This is where you will find out more about contributor eligibility and where you would start and submit your application for the program.

1 Like

Greetings @jgrethe @anita_bandrowski

I hope you’re doing well.

I am Viraj Vora, a second-year B.Tech Computer Engineering student at VJTI, Mumbai. I am passionate about AI/ML, with hands-on experience in computer vision, NLP, and web development. My projects include ExpenFlow , an AI-driven expense management and Reciept Fraud Detection tool using LLM, LangChain and OCR (93% accuracy), Decora , a React/Three.js-based virtual interior design tool for mapping 2D-to-3D models, a skill transferable to Neo4j graph design and XCELERATE , an autonomous vehicle model achieving 95% lane detection accuracy using OpenCV and YOLOv8 and implemented Behavioral Cloning using NVIDIA’s architecture on Udacity Simulator.

I am particularly excited about the “ LinkML support for the Resource Information Network: unlocking LLM interrogation of a scientific resource and citation knowledge graph” project as it combines my interests in structured data modeling, graph databases, and RAG workflows.
I’ve started diving into LinkML, Neo4j and exploring RIN JSON data and the process of converting it into Knowledge Graph.
Can you guide me on the next best steps to be taken?

Thank you for your time, and I look forward to contributing meaningfully to this impactful project.

Best regards,
Viraj Vora
GitHub: viraj200524 | LinkedIn: viraj-vora-09aa19288| Resume

Since this is part of Google Summer of Code, the timeline below and application process applies. Work on projects can not begin until contributor applications have been submitted and applications have been formally accepted (accepted projects are announced on May 8th). Reminder of upcoming deadlines:

  1. On March 24th, the GSoC contributor application period begins
  2. April 8 - 18:00 UTC is the deadline for the GSoC contributor application [You MUST SUBMIT AN APPLICATION to be considered for a project]

The Google Summer of Code - Getting Started page is at: Get Started | Google Summer of Code. This is where you will find out more about contributor eligibility and where you would start and submit your application for the program.

[quote=“jgrethe, post:16, topic:32207, full:true”]
Since this is part of Google Summer of Code, the timeline below and application process applies. Work on projects can not begin until contributor applications have been submitted and applications have been formally accepted (accepted projects are announced on May 8th). Reminder of upcoming deadlines:

  1. On March 24th, the GSoC contributor application period begins
  2. April 8 - 18:00 UTC is the deadline for the GSoC contributor application [You MUST SUBMIT AN APPLICATION to be considered for a project]

Since this is part of Google Summer of Code, the timeline below and application process applies. Work on projects can not begin until contributor applications have been submitted and applications have been formally accepted (accepted projects are announced on May 8th). Reminder of upcoming deadlines:

  1. On March 24th, the GSoC contributor application period begins
  2. April 8 - 18:00 UTC is the deadline for the GSoC contributor application [You MUST SUBMIT AN APPLICATION to be considered for a project]

The Google Summer of Code - Getting Started page is at: Get Started | Google Summer of Code. This is where you will find out more about contributor eligibility and where you would start and submit your application for the program.

Got it.
Thanks. :raised_hands: :raised_hands:

Greetings @jgrethe and @anita_bandrowski
I’m Vijayavallabh, a second-year B Tech Biological Engineering student minoring in AI at IIT Madras, India. I am interested in contributing to the ** unlocking LLM interrogation of a scientific resource and citation knowledge graph ** project. My experience in AI Agents and RAG involves leading the development of SECure RAG (Inter-IIT Tech Meet 13.0) leading a team of 8, a patent-pending dynamic agentic RAG framework for financial documents like SEC-10K filings. The system integrates a novel state-machine based Multi-HyDE retrieval and Explainable AI components, benchmarked against GPT-4o and Gemini-1.5-flash. This involved advanced PDF parsing, conversational and long-term memory integration using ZEP, and dynamic data handling with Pathway’s vector store.

I am looking forward to connecting with the mentors to discuss the potential contribution towards the project and am excited about the summer of learning ahead.