GSoC 2025 Project #32 UCSD Projects :: Semantic annotation and ingestion of domain specific knowledge types (350h)

arnab1896 · March 8, 2025, 8:16pm

Mentors: Tom Gillespie <tgbugs@gmail.com> and Troy Sincomb <troysincomb@gmail.com>

Skill level: Intermediate or greater

Required skills: Python

Time commitment: Full time (350 hours)

About: Knowledge systems that support research in life science routinely need to represent data about a wide variety of different kinds of biological entities, from species, to cells, to genes, proteins, ion channels, and biological processes. Each of these types of entities has distinct relations that define them and distinct data that are collected to study them. There are a wide variety of existing sources of knowledge about these entities. In order to create an integrated knowledge base about biology it is critical to ingest the wide variety of sources into a single system.

Aims: In the context of this project the objective focuses on the more tractable problem of being able to account for existing known biological entities and their formal names so that duplicate names are not created by accident. The primary way to prevent duplicate names is to have as complete a listing of existing names as possible. To this end we want to make it possible to ingest a wide variety of sources into Intelex. For example, we want to be able to keep the list of known mouse lines from the Jackson Laboratory (JAX) up to date with the latest information. In this project you will work to develop Extract Transform Load (ETL) pipelines in Python to make existing knowledge sources accessible to InterLex. You will also learn about how to build such pipelines so that they are maintainable as part of a larger system and so as to minimize the operational overhead of running such pipelines. Said another way, this project will build a set of mini-web crawlers that work together to build a knowledge graph and keep it up to date. Ingestion of specific knowledge types (e.g. cell types from neurondm, mouse lines from JAX or MGI, ion channel protein complexes) into InterLex.

Websites:

InterLex is a terminology management system for life sciences and provides the content for thevisualization(s): https://interlex.org
Python utilities for working with ontologies: GitHub - tgbugs/pyontutils: python utilities for working with ontologies

Tech keywords: Python, SQL, OWL, RDF, biology, neuroscience, ETL, ontology, knowledge graphs

vrun · March 12, 2025, 9:01am

Hello Dr. Gillespie and Dr. Sincomb,

I hope you are doing well. My name is Vrushali, and I am currently pursuing my undergraduate studies in Data Science at IIT Madras. I am very interested in contributing to “Project #32 – Semantic Annotation and Ingestion of Domain-Specific Knowledge Types” as part of Google Summer of Code 2025.

With experience in Python, ETL pipelines, and knowledge graphs, along with my background in **AI, data science, and neuroinformatics, I find this project particularly exciting. I would love to learn more about the challenges involved in integrating structured biological knowledge into InterLex and how I can contribute meaningfully to this effort.

I would greatly appreciate any guidance on the next steps, such as relevant resources, preliminary tasks, or discussions that could help me gain deeper insights into the project. Looking forward to your advice and the opportunity to engage further.

Thank you for your time and consideration. I look forward to your response.

@arnab1896 Could you please share the usernames of the respective mentors or pin them? I am unable to find their profiles. Your help would be greatly appreciated!

Thanks in advance!

sydon1 · March 12, 2025, 6:51pm

Dear Mentors,

My name is Sodir, a bachelor student in Engineering Technology - Electronics ICT at KU Leuven preparing for my masters. This project caught my attention because of its focus on semantic programming and the benefits it can have.

My understanding of semantics programming and how it could be used came after encountering RedPencil at a job fair. This is a Belgian FOSS company that uses linked data principles. Their applications demonstrated to me how these kind of technologies can be used to create open, integrated systems. Seeing how they used certain principles, specifically with how they use linked data and semantics programming made it click for me with regards to how this kind of technology can be really beneficial in all kinds of applications. The applications they work on, being government applications made sense to me, but I initially couldn’t think of other possible applications. In this context, I think this idea is great and could probably help a lot of researchers.

While my background is in electrical and computer engineering rather than bioinformatics, I have some experience with Python and SQL. However, I’m confident in my ability to learn quickly, adapt and solve problems.

If possible, I’d love to learn more about the ways I can contribute to this project and what steps to take.

Kind regards!

mabelle_attieh · March 30, 2025, 4:46pm

Dear Mentors,

I hope this message finds you well. My name is Mabelle Attieh, and I am currently pursuing my undergraduate studies in Computer Engineering with a premed track at the Lebanese American University (LAU). I am very interested in contributing to Project #32 as part of Google Summer of Code 2025.

I have a strong background in bioinformatics and software engineering, and I am particularly drawn to this project due to my passion for applying technology to improve healthcare and biological research. I have experience working with Python, SQL, and Java, and I am excited about the opportunity to collaborate with everyone to contribute to this project.

I would love to learn more about the specific challenges for this project, and I am excited to explore how I can contribute to its success. If there are any resources or next steps that would help me prepare, I would greatly appreciate any guidance provided.

Thank you for your time and consideration.