GSoC 2025 Project #30 UCSD Projects :: LinkML support for the Resource Information Network: unlocking LLM interrogation of a scientific resource and citation knowledge graph (350h)

Mentor: Jeffrey Grethe <jeffrey.grethe@gmail.com> and Anita Bandrowski <abandrowski@health.ucsd.edu>

Skill level: Intermediate or greater

Required skills: Python, Graph Databases (e.g. Neo4J), RESTful APIs

Time commitment: Full time (350 hours)

Forum for discussion

About: The Resource Information Network (RIN) is the largest collection of information about research resources (e.g. cell lines, antibodies, software tools). The RIN has been developed through support of projects like the NIDDK Information Network (https://dknet.org), the Neuroscience Information Framework (https://neuinfo.org), and the Research Resource Identification Initiative (https://rrid.site). Resources with the RIN are uniquely identified by a Research Resource Identifier (RRID). This allows the RIN to aggregate information about the performance of a resource and who has used a particular resource in a certain context - via mentions of the resource in the scientific literature. All this data for the RIN is available in JSON format through an Elasticsearch API.

Aims: The primary goal of this project is to generate a knowledge graph from this information that can be interrogated via Large Language Models utilizing Retrieval Augmented Generation (RAG) methods. RAG enhances LLMs by integrating structured knowledge graphs into their query process. Instead of relying solely on pre-trained data, RAG dynamically retrieves relevant information from knowledge graphs, improving accuracy and reducing hallucinations. To transition the JSON data to a formal knowledge graph we will utilize LinkML (https://linkml.io/), a general purpose modeling language that can be used with linked data, JSON, and other formalisms. Steps involved in the project would include: 1) Developing a LinkML model of the data sources for the knowledge graph; 2) make a knowledge graph of this data available in Neo4J (e.g. through use of LinkML-Store); 3) Inspect the knowledge graph through the development of some initial Cypher queries; 4) explore the use of RAG techniques through a LLM; and 5) explore the use of an open source LLM user interface framework (e.g. Open WebUI GitHub - open-webui/open-webui: User-friendly AI Interface (Supports Ollama, OpenAI API, ...)).

Websites:

  • Research Resource Identification Initiative (https://rrid.site) website which provides a searchable user interface for resources and related mentions in the literature
  • Biomed Resource Watch (Biomed Resource Watch | Home) is a unique knowledge base for storing validation and performance information about research resources such as antibodies, cell lines, and tools. It is a new service platform that uses Research Resource Identifiers (RRIDs) to aggregate and disseminate known issues and validation information about antibodies, cell lines, and tools
  • Resource Information Network APIs (https://docs.scicrunch.io/). This site provides documentation for RIN APIs that are accessible via https://api.scicrunch.io. The APIs provide access to all components of the RIN (i.e. information on resources, their mentions, and performance)

Tech keywords: Knowledge Graph, Retrieval Augmented Generation (RAG), Artificial Intelligence, Large Language Model (LLM), LinkML, Python

1 Like

Hello, I am Janvi Soni, a second-year undergraduate student at VJTI Mumbai, studying Information Technology. I am passionate about AI and have worked extensively in deep learning ,generative ai and retrieval-based LLMs, developing a text-to-image generation model from scratch using GANs, VAEs, and Vision Transformers (ViTs) to generate images from text descriptions. I have also built a multimodal RAG-based research paper analyzer that processes and retrieves key insights from scientific documents, analyzing text , tables,graphs, images, and other structured data to improve research accessibility.
I am familiar with graph-based RAGs, knowledge graphs, and Neo4J and am now implementing them to enhance structured retrieval and AI-driven reasoning. This project aligns with my interest in intelligent retrieval systems, focusing on knowledge graphs and RAG methods to enhance LLM accuracy. I am looking forward and would love any guidance on how to best contribute and move ahead with this project.

Greetings @jgrethe, @anita_bandrowski,
I hope you’re doing well! I am Kinshuk Trivedi, a third-year undergraduate student at VJTI, Mumbai, India. I previously introduced myself via email regarding my interest in contributing to the “LinkML support for the Resource Information Network: unlocking LLM interrogation of a scientific resource and citation knowledge graph” project for GSoC’25. Now that the forum is set up for discussions, I wanted to follow up and better understand how I can start contributing effectively.

I’m excited about working on scientific research data. I’ve been looking into the RRID and Biomed Resource watch websites and will soon share the knowledge graph-based application which I’m currently working on to demonstrate my understanding of graph databases.

I just wanted to know whether this knowledge graph based LLM ( using RAG ) is to be created on a separate web based application (like ReactJs, Django, etc) or to be integrated with a present web interface.

Do share any updates regarding the tasks to be performed for GSoC’25.

Looking forward to hearing your insights! Thanks in advance for your guidance.

Best Regards,
Kinshuk Trivedi