Mentor: Jeffrey Grethe <jeffrey.grethe@gmail.com> and Anita Bandrowski <abandrowski@health.ucsd.edu>
Skill level: Intermediate or greater
Required skills: Python, Graph Databases (e.g. Neo4J), RESTful APIs
Time commitment: Full time (350 hours)
About: The Resource Information Network (RIN) is the largest collection of information about research resources (e.g. cell lines, antibodies, software tools). The RIN has been developed through support of projects like the NIDDK Information Network (https://dknet.org), the Neuroscience Information Framework (https://neuinfo.org), and the Research Resource Identification Initiative (https://rrid.site). Resources with the RIN are uniquely identified by a Research Resource Identifier (RRID). This allows the RIN to aggregate information about the performance of a resource and who has used a particular resource in a certain context - via mentions of the resource in the scientific literature. All this data for the RIN is available in JSON format through an Elasticsearch API.
Aims: The primary goal of this project is to generate a knowledge graph from this information that can be interrogated via Large Language Models utilizing Retrieval Augmented Generation (RAG) methods. RAG enhances LLMs by integrating structured knowledge graphs into their query process. Instead of relying solely on pre-trained data, RAG dynamically retrieves relevant information from knowledge graphs, improving accuracy and reducing hallucinations. To transition the JSON data to a formal knowledge graph we will utilize LinkML (https://linkml.io/), a general purpose modeling language that can be used with linked data, JSON, and other formalisms. Steps involved in the project would include: 1) Developing a LinkML model of the data sources for the knowledge graph; 2) make a knowledge graph of this data available in Neo4J (e.g. through use of LinkML-Store); 3) Inspect the knowledge graph through the development of some initial Cypher queries; 4) explore the use of RAG techniques through a LLM; and 5) explore the use of an open source LLM user interface framework (e.g. Open WebUI GitHub - open-webui/open-webui: User-friendly AI Interface (Supports Ollama, OpenAI API, ...)).
Websites:
- Research Resource Identification Initiative (https://rrid.site) website which provides a searchable user interface for resources and related mentions in the literature
- Biomed Resource Watch (Biomed Resource Watch | Home) is a unique knowledge base for storing validation and performance information about research resources such as antibodies, cell lines, and tools. It is a new service platform that uses Research Resource Identifiers (RRIDs) to aggregate and disseminate known issues and validation information about antibodies, cell lines, and tools
- Resource Information Network APIs (https://docs.scicrunch.io/). This site provides documentation for RIN APIs that are accessible via https://api.scicrunch.io. The APIs provide access to all components of the RIN (i.e. information on resources, their mentions, and performance)
Tech keywords: Knowledge Graph, Retrieval Augmented Generation (RAG), Artificial Intelligence, Large Language Model (LLM), LinkML, Python