Antibiotic-resistance genes in various bacteria are the focus of intense attention given the global concern around the rise of multi-resistant pathogens. These genes come in several variants, and the precise identification of variants in samples is very useful to track antibiotic-resistance gene propagation. We identify the variants of antibiotic resistance genes and their relative abundances using shotgun metagenomic data collected on multiple samples. We consider a probabilistic approach, as it is essential to model DNA sequencing errors to distinguish actual variants from noise. We adapt ideas from the DESMAN software by Quince et al. (2017). The model is essentially a hierarchy of finite mixture submodels, which allows sharing of components (sharing of variants) among the different metagenomic samples. A challenge for efficient posterior sampling comes from the fact that the mixture components (which correspond to the variants' genome) take values on a large dimensional discrete space (of long nucleotide sequences), which is difficult to explore and thwarts standard Gibbs sampling or MCMC strategies by causing severe mixing issues. Drawing on recent advances in tempered Sequential Monte Carlo (Dau & Chopin 2022), we take advantage of natural tempering parameters in the mixture model to build efficient posterior sampling algorithms and leverage Sequential Monte Carlo estimates of marginal likelihoods to build a model choice framework for estimating the number of variants present in a collection of metagenomic samples.
Mathématiques et Informatique Appliquées
du Génome à l'Environnement