GSoC 2025 Project #32 UCSD Projects :: Semantic annotation and ingestion of domain specific knowledge types (350h)

Mentors: Tom Gillespie <tgbugs@gmail.com> and Troy Sincomb <troysincomb@gmail.com>

Skill level: Intermediate or greater

Required skills: Python

Time commitment: Full time (350 hours)

Forum for discussion

About: Knowledge systems that support research in life science routinely need to represent data about a wide variety of different kinds of biological entities, from species, to cells, to genes, proteins, ion channels, and biological processes. Each of these types of entities has distinct relations that define them and distinct data that are collected to study them. There are a wide variety of existing sources of knowledge about these entities. In order to create an integrated knowledge base about biology it is critical to ingest the wide variety of sources into a single system.

Aims: In the context of this project the objective focuses on the more tractable problem of being able to account for existing known biological entities and their formal names so that duplicate names are not created by accident. The primary way to prevent duplicate names is to have as complete a listing of existing names as possible. To this end we want to make it possible to ingest a wide variety of sources into Intelex. For example, we want to be able to keep the list of known mouse lines from the Jackson Laboratory (JAX) up to date with the latest information. In this project you will work to develop Extract Transform Load (ETL) pipelines in Python to make existing knowledge sources accessible to InterLex. You will also learn about how to build such pipelines so that they are maintainable as part of a larger system and so as to minimize the operational overhead of running such pipelines. Said another way, this project will build a set of mini-web crawlers that work together to build a knowledge graph and keep it up to date. Ingestion of specific knowledge types (e.g. cell types from neurondm, mouse lines from JAX or MGI, ion channel protein complexes) into InterLex.

Websites:

Tech keywords: Python, SQL, OWL, RDF, biology, neuroscience, ETL, ontology, knowledge graphs