Mathématiques et Informatique Appliquées
du Génome à l'Environnement

 

 

Lundi 13 avril 2026

Titre
MetagenBERT: a Transformer Architecture using Foundational DNA Read Embedding Models to enhance Disease Classification from metagenomic data
Nom intervenant
Gaspar Roy
Organisme intervenant (ou équipe pour les séminaires internes)
IRD ; invité par Guillaume G
Lieu
Salle de réunion 142, bâtiment 210
Date du jour
Résumé

Metagenomic disease prediction commonly relies on species-abundance tables derived from large and incomplete reference catalogs, a strategy that constrains resolution, increases computational costs, and discards valuable information contained in raw sequencing reads. To overcome these limitations, we introduce MetagenBERT, a Transformer-based framework that produces end-to-end metagenome embeddings directly from raw DNA sequences, without taxonomic or functional annotations. Individual reads are embedded using foundational genomic language models (DNABERT-2 and the microbiome-specialized DNABERT-MS), then aggregated through a scalable global clustering strategy based on FAISS-accelerated K-Means. Each metagenome is represented as a cluster-abundance vector summarizing the distribution of its embedded reads.

We evaluate this approach on five benchmark gut microbiome datasets (Cirrhosis, T2D, Obesity, IBD, CRC). MetagenBERT achieves competitive or superior AUC performance relative to species-abundance baselines across most tasks. Concatenating species abundances with embedding-based cluster abundances further improves prediction, demonstrating complementarity between taxonomic and embedding-derived signals. Global clustering remains robust when applied to as little as 5--10\% of reads, highlighting substantial redundancy in metagenomes and enabling major computational gains.

We additionally introduce MetagenBERT-Glob-MCardis, a cross-cohort variant in which clusters are trained on the bigger MetaCardis dataset and transferred to benchmark datasets. Despite reduced performance compared to dataset-specific clustering, the MetaCardis-trained clusters retain strong predictive signal, including for phenotypes absent from MetaCardis, indicating the feasibility of a foundation model for metagenome representation. Robustness analyses show consistent separation of healthy and diseased states and stable cluster structures across subsamples.

Overall, MetagenBERT provides a scalable, annotation-free, and interpretable representation of metagenomes, bridging foundational genomic language models with population-level microbiome variation. These results point toward future phenotype-aware metagenomic LLMs capable of generalizing across heterogeneous cohorts and sequencing technologies.