GSoC 2025 Project #32 UCSD Projects :: Semantic annotation and ingestion of domain specific knowledge types (350h)

Mentors: Tom Gillespie <tgbugs@gmail.com> and Troy Sincomb <troysincomb@gmail.com>

Skill level: Intermediate or greater

Required skills: Python

Time commitment: Full time (350 hours)

Forum for discussion

About: Knowledge systems that support research in life science routinely need to represent data about a wide variety of different kinds of biological entities, from species, to cells, to genes, proteins, ion channels, and biological processes. Each of these types of entities has distinct relations that define them and distinct data that are collected to study them. There are a wide variety of existing sources of knowledge about these entities. In order to create an integrated knowledge base about biology it is critical to ingest the wide variety of sources into a single system.

Aims: In the context of this project the objective focuses on the more tractable problem of being able to account for existing known biological entities and their formal names so that duplicate names are not created by accident. The primary way to prevent duplicate names is to have as complete a listing of existing names as possible. To this end we want to make it possible to ingest a wide variety of sources into Intelex. For example, we want to be able to keep the list of known mouse lines from the Jackson Laboratory (JAX) up to date with the latest information. In this project you will work to develop Extract Transform Load (ETL) pipelines in Python to make existing knowledge sources accessible to InterLex. You will also learn about how to build such pipelines so that they are maintainable as part of a larger system and so as to minimize the operational overhead of running such pipelines. Said another way, this project will build a set of mini-web crawlers that work together to build a knowledge graph and keep it up to date. Ingestion of specific knowledge types (e.g. cell types from neurondm, mouse lines from JAX or MGI, ion channel protein complexes) into InterLex.

Websites:

Tech keywords: Python, SQL, OWL, RDF, biology, neuroscience, ETL, ontology, knowledge graphs

Hello Dr. Gillespie and Dr. Sincomb,

I hope you are doing well. My name is Vrushali, and I am currently pursuing my undergraduate studies in Data Science at IIT Madras. I am very interested in contributing to “Project #32 – Semantic Annotation and Ingestion of Domain-Specific Knowledge Types” as part of Google Summer of Code 2025.

With experience in Python, ETL pipelines, and knowledge graphs, along with my background in **AI, data science, and neuroinformatics, I find this project particularly exciting. I would love to learn more about the challenges involved in integrating structured biological knowledge into InterLex and how I can contribute meaningfully to this effort.

I would greatly appreciate any guidance on the next steps, such as relevant resources, preliminary tasks, or discussions that could help me gain deeper insights into the project. Looking forward to your advice and the opportunity to engage further.

Thank you for your time and consideration. I look forward to your response.

@arnab1896 Could you please share the usernames of the respective mentors or pin them? I am unable to find their profiles. Your help would be greatly appreciated!

Thanks in advance!

Dear Mentors,

My name is Sodir, a bachelor student in Engineering Technology - Electronics ICT at KU Leuven preparing for my masters. This project caught my attention because of its focus on semantic programming and the benefits it can have.

My understanding of semantics programming and how it could be used came after encountering RedPencil at a job fair. This is a Belgian FOSS company that uses linked data principles. Their applications demonstrated to me how these kind of technologies can be used to create open, integrated systems. Seeing how they used certain principles, specifically with how they use linked data and semantics programming made it click for me with regards to how this kind of technology can be really beneficial in all kinds of applications. The applications they work on, being government applications made sense to me, but I initially couldn’t think of other possible applications. In this context, I think this idea is great and could probably help a lot of researchers.

While my background is in electrical and computer engineering rather than bioinformatics, I have some experience with Python and SQL. However, I’m confident in my ability to learn quickly, adapt and solve problems.

If possible, I’d love to learn more about the ways I can contribute to this project and what steps to take.

Kind regards!