Biomedical Data Science Seminar Series

BIOMICS will establish dedicated seminar series that will feature scientific talks of experts from the partnering institutions, to take place every three months. These seminars will cover the latest achievements in the field and will be disseminated and open to the entire GIMM, as well as to the regional and national scientific communities.

[5] The fifth seminar is presented by both Mafalda Dias & Jonathan Frazer, group leaders at our partner CRG, Barcelona, where they co-lead the Probabilistic Machine Learning and Genomics group. They develop deep learning and probabilistic modeling approaches applied to genomics and other large-scale multimodal biological data to address questions in disease genetics, protein function and conservation genomics.

Reading evolutionary constraint with deep learning: from rare disease to endangered species

Deep learning models trained on the genetic variation seen across life on Earth provide a unique opportunity to learn about ultra rare variants. By identifying constraint patterns on large evolutionary timescales, these models provide valuable information about which rare variants are likely to be benign and which are likely to give rise to disease. While there has been considerable progress on this front in recent years, and these models are now a standard component of diagnostic pipelines, it seems clear that both our understanding of how to build useful models, and how to make effective use of such models, is still in its infancy. In this talk I’ll review recent successes and failures in developing deep learning models for learning about the genetic underpinnings of rare disease, first steps in developing similar models for endangered species, and what these species can teach us about lethal variants.

[4] The fourth speaker of the Biomedical Data Science Seminar Series is Manuel Irimia, Group Leader at the UPF/CRG Barcelona and Coordinator of the Joint Programme on Evolutionary Medical Genomics (EvoMG).

Evolutionary landscapes of zygotic genome activation across animals

During early animal embryogenesis, control over gene expression transitions from maternally deposited products to newly transcribed zygotic RNA. This process, termed zygotic genome activation (ZGA), is universal and essential but remains poorly characterized beyond a handful of model species. Here, we generated a comprehensive transcriptomic atlas of early embryogenesis from 61 animal species, spanning 13 phyla. By applying a unified computational framework, we systematically inferred the timing of ZGA across species. We uncover a large variation in ZGA timing but find that a proxy for nuclear-to-cytoplasmic (N/C) ratio robustly predicts the onset of genome activation. Comparative analyses of the properties of zygotic genes showed that they have distinct genomic architectures, functional enrichments, and evolutionary conservation patterns compared to maternal transcripts. Altogether, our findings suggest that ZGA is universally timed by the stoichiometry between DNA content and specific maternally deposited factors, and this activation involves a highly flexible transcriptomic program that nonetheless follows a deeply conserved molecular logic.

[3] The third speaker of the Biomedical Data Science Seminar Series is Bernardo Almeida, a Senior AI Research Scientist at InstaDeep, in Paris, where he is developing large language foundational models for biology.

Decoding the genome with foundation models

The human genome encodes the fundamental instructions of human biology, yet deciphering how its sequence governs molecular function and influences disease remains one of the central challenges in biomedicine. As genomics and biomedical data continue to expand exponentially, genomics foundation models have emerged as powerful approaches capable of capturing complex, multi-scale patterns embedded in these sequences. In this talk, I will present our efforts to develop such models to learn the “code” of the genome – beginning with self-supervised models trained directly on genomic sequences, extending these architectures to integrate natural language, and introducing a next generation of unified models that bridge multiple training paradigms. Together, these advances bring us closer to a comprehensive, computable understanding of genome function.

[2] The second speaker of the Biomedical Data Science Seminar Series is Pedro Beltrão, core member of BIOMICS project and spokesperson for ETH at the consortium.

The genetics of human trait variation across the scales of biological organization

The number of genetic studies of human traits and diseases has grown over the past years with hundreds of thousands of gene-to-phenotype mappings done through genome-wide association, clinical studies or studies of model organisms. However, connecting trait associated genetic variation to mechanisms through individual proteins and cellular processes remains a challenge. Our group is interested in building computational and experimental approaches that aim to address this challenge. In this talk I will briefly introduce some of our work on using AlphaFold models to study the impact of protein missense mutations and on predicting tissue type differences in protein-protein interactions. I will then focus primarily on describing our ongoing efforts to study the differences and similarities between genes linked to traits by different genetic approaches: GWAS, rare disorder studies and mouse KO phenotypes. We find that rare disorder studies and GWAS are biased in the identification of different types of genes that often do not overlap even for the same or related traits. Despite the low gene-level overlap, we observe convergence at the level of cellular processes linked to the same types of traits regardless of the technologies used to study gene-to-trait associations. Finally, we show how this convergence allows us to improve the prediction of novel candidate disease genes.

[1] The first speaker of the Biomedical Data Science Seminar Series is Hagen Tilgner, member of BIOMICS Scientific Advisory Board.