GeneCLIP: Understanding Genes Through DNA

Research Journey

This project was not something I understood immediately. I initially struggled with biology, especially genetics, where long DNA sequences and dense terminology felt disconnected and difficult to interpret. Early on, I found myself relying on memorization rather than true understanding, which made the subject feel frustrating and inaccessible.

GeneCLIP became a way for me to confront that challenge by approaching biology through a lens I was more comfortable with — computation and machine learning. Through many failed experiments, confusing results, and iterations that did not work as expected, I gradually began to see how biological meaning could emerge from patterns in data.

Building this system helped me bridge the gap between abstract genetic concepts and concrete representations, turning my earlier confusion into curiosity and, eventually, confidence in navigating interdisciplinary research.

Home / Email / Github

GeneClip Overview:

I have always been curious about how much information is truly hidden inside DNA. Genes are usually described using text — their function, behavior, and clinical relevance — but DNA itself is rarely connected directly to language in a meaningful way.

GeneCLIP explores whether a model can learn the *meaning* of a gene directly from its DNA sequence. Instead of predicting a single label, the goal is to align DNA with natural language descriptions so the model understands what a gene does, not just where it is.

By learning a shared space between genomic sequences and gene descriptions, this project aims to create a foundation for tasks like gene discovery, similarity search, and understanding genetic variants using language-based reasoning.

GeneCLIP starts by loading two types of data: genomic DNA sequences extracted from a reference genome, and text-based gene annotations describing gene function, biological pathways, and clinical relevance.

Each DNA sequence is converted into a numerical format and passed through a transformer-based encoder that learns patterns in nucleotide sequences. At the same time, gene descriptions are processed using a biomedical language model trained on scientific literature.

Both the DNA and text representations are projected into the same embedding space and trained using a contrastive learning objective. This encourages matching DNA–text pairs to move closer together while pushing unrelated pairs farther apart.

Over time, the model learns to associate specific DNA patterns with meaningful biological descriptions. Once trained, GeneCLIP can retrieve the correct gene description from DNA alone, or find the most relevant DNA sequence given a textual description. This suggests that genetic information can be understood and compared using language-based representations.

Download the Code and Try It Out!

Replicated from Jon Barron.