Mathématiques et Informatique Appliquées
du Génome à l'Environnement

 

 

BRETON Hugo

Type
Stagiaire
Sujet
Development of Machine Learning Methods for Detecting Genomic Context Modules from Pangenome Multigraphs
Date de début
Date de fin
Encadrant(s)
Christophe Ambroise ; Marie Szafranski ; Guillaume Gautreau
Equipe(s)
StatInfOmics
Année de soutenance (pour les thèses ou les stages)
2026
Ecole/université (pour les thèses et les stages)
Université Paris-Cité
Niveau/diplôme (pour les stages)
M2
Description/résumé

Context and Objectives:

 

Prokaryotes (i.e. bacteria and archaea) constitute a fascinating field of living organisms, representing remarkable diversity and ubiquity. Their impact on the biosphere is immense, influencing human and animal health, soil and ocean biogeochemistry, and much more. Large-scale exploration of microbial genomes has helped uncover the molecular mechanisms underlying their diversity, and particularly the role of Mobile Genetic Elements (MGE).

In recent years, with the explosion of sequencing projects, several bioinformatics approaches have been developed based on the pangenome concept, offering solutions for efficiently managing and exploiting large quantities of data [1]. Pangenomics examines genetic variability across all available genomes of a given group, usually a species, rather than relying on a single reference genome or making pairwise comparisons. In terms of gene content, a distinction is made between the core genome, i.e. the genes present in all individuals, and the accessory (or variable) genes that are more or less conserved in the genomes, and therefore likely to explain phenotypic particularities. The development of pangenomic methods is thus a response to the challenge of massive data in biology, helping to understand the evolution of microorganisms in relation to epidemiological or environmental data.

 

For several years now, the LABGeM and the LaMME team has been working on a model to represent genomic data as a pangenome graph at the gene family level, enabling the compression of information from thousands of genomes while preserving the chromosomal organization of genes. The PPanGGOLiN software suite [2] (awarded an Open Science Research Prize by the French Ministry of Research in 2023; >220 citations since 2020) has been developed to reconstruct and analyze pangenome graphs. It includes methods such as the identification of regions of genomic plasticity (panRGP method) [3] and their fine description in conserved modules (panModule method) [4], demonstrating their utility for identifying genomic islands and their MGEs. LABGeM is also developing PanGBank, a database of pangenomes reconstructed from public genomes from Genbank and RefSeq databases using the GTDB classification. It currently gathers pangenomes for >4300 prokaryotic species.

 

The PanGAIMiX project aims to revolutionize microbial genome analysis by integrating pangenome graph models with advanced machine learning techniques. Within this project, Work Package 2 (WP2) focuses on developing methods to detect conserved genomic context modules across multiple pangenomes using a MultiGraph Neural Network approach. The goal is to overcome the scalability limitations of traditional graph-based algorithms and enable the detection of evolutionary patterns across hundreds of species.

 

Tasks:

  • Build a cross-species layered pangenome multigraph from the panGBank resource, where edges encode either gene co-localization within genomes or homology relationships across species.
  • Identify conserved modules across pangenomes by applying deep learning architectures, such as U-Net [5], adapted for graph-based data segmentation.
  • Benchmark the method against state-of-the-art approaches for detecting conserved modules (ex: panModule, STRING-DB https://string-db.org/ )
  • Interpret and visualize the detected genomic modules by projecting their learned embeddings into low-dimensional spaces using dimensionality reduction techniques such as UMAP. This will facilitate the exploration of phylogenomic relationships by revealing clusters, gradients, and evolutionary trajectories among gene families across species.

 

Environment:

 

This internship topic is part of the ANR PanGAIMiX project. The internship will be conducted in collaboration with leading researchers from LABGeM (David Vallenet, Alexandra Calteau), MalAGE (Guillaume Gautreau), and LaMME (Christophe Ambroise and Marie Szafranski). The intern will have access to high-performance computing resources and will work in a multidisciplinary environment that combines expertise in microbial genomics, bioinformatics, and machine learning.