Metagenomic disease prediction commonly relies on species-abundance tables derived from large and incomplete reference catalogs, a strategy that constrains resolution, increases computational costs, and discards valuable information contained in raw sequencing reads. To overcome these limitations, we introduce MetagenBERT, a Transformer-based framework that produces end-to-end metagenome embeddings directly from raw DNA sequences, without taxonomic or functional annotations. Individual reads are embedded using foundational genomic language models (DNABERT-2 and the microbiome-specialized DNABERT-MS), then aggregated through a scalable global clustering strategy based on FAISS-accelerated K-Means. Each metagenome is represented as a cluster-abundance vector summarizing the distribution of its embedded reads.
We evaluate this approach on five benchmark gut microbiome datasets (Cirrhosis, T2D, Obesity, IBD, CRC). MetagenBERT achieves competitive or superior AUC performance relative to species-abundance baselines across most tasks. Concatenating species abundances with embedding-based cluster abundances further improves prediction, demonstrating complementarity between taxonomic and embedding-derived signals. Global clustering remains robust when applied to as little as 5--10\% of reads, highlighting substantial redundancy in metagenomes and enabling major computational gains.
We additionally introduce MetagenBERT-Glob-MCardis, a cross-cohort variant in which clusters are trained on the bigger MetaCardis dataset and transferred to benchmark datasets. Despite reduced performance compared to dataset-specific clustering, the MetaCardis-trained clusters retain strong predictive signal, including for phenotypes absent from MetaCardis, indicating the feasibility of a foundation model for metagenome representation. Robustness analyses show consistent separation of healthy and diseased states and stable cluster structures across subsamples.
Overall, MetagenBERT provides a scalable, annotation-free, and interpretable representation of metagenomes, bridging foundational genomic language models with population-level microbiome variation. These results point toward future phenotype-aware metagenomic LLMs capable of generalizing across heterogeneous cohorts and sequencing technologies.