The aim of this thesis is to develop and analyse Bayesian nonparametric models to explore diversity in metagenomic data. This involves 1) furthering the knowledge about the fundamental properties of existing Bayesian nonparametric processes, 2) using them as building blocks to develop flexible models for noisy and high-dimensional data and 3) designing efficient and scalable inference algorithms, via parallelisation, optimisation and/or careful approximations.

The conceptual framework of Bayesian nonparametric models is particularly well-suited to describe complex and noisy data such as metagenomic data. Such data represent a crucial tool to explore the diversity of environments, such as marine environments (with environmental DNA, Cowart et al., 2018), human body (Van Rossum et al., 2020), tumor diversity (Nik-Zainal et al., 2012) or virus strain diversity. They are inherently large dimensional, suffer from multiple sources of noise, exhibit a complex latent structure (clusters, tree, network) and present challenges for which Bayesian nonparametric approaches have been recognised as promising, e.g. Lee et al. (2015); Roth et al. (2014). Bayesian nonparametric approaches are particularly interesting for complex data because they naturally account for uncertainty about the precise data generating mechanism, allowing flexibility in crucial aspects such as the functional form of the dependence to covariates, the error model, or the size of the latent space. On top of this, the Bayesian framework allows carrying this uncertainty seamlessly into the estimation uncertainty or real-time prediction uncertainty.

Moreover, addressing concrete biological questions often stimulates the development of new Bayesian nonparametric processes, for instance because standard processes such as the Dirichlet process are sometimes too simplistic and fail to describe certain patterns in the data (such as power-law behaviours). Additional advances are stimulated by the computational challenges in dealing with large dimensional data, requiring the development of bespoke inference strategies. We envision several possible approaches: variational inference (Blei and Jordan, 2006) replaces costly Markov chain Monte Carlo sampling by a high-dimensional optimisation problem for which efficient algorithms such as stochastic gradient descent may be used. Approximate Bayesian computation (ABC) replaces unfeasible likelihood evaluations, which may occur when modelling data with complex latent discrete structures (e.g. trees), by a large number of simulations. We have used ABC in the context of Bayesian nonparametric models and of inverse problems respectively in Kon Kam King et al. (2019) and Forbes et al. (2021). An efficient implementation of approximate Bayesian computation strategies will entail particular efforts in parallelisation and high-performance computing. Finally, when possible, we will also investigate the possibility of analytical approximations of Bayesian nonparametric processes, for which finite-size and asymptotic approximations have been shown to give good results (Bystrova et al., 2021).

The main biological application which will be addressed in this thesis will be to describe the diversity observed in metagenomic data and its relation to covariates. The type of metagenomic data considered can include environmental DNA (eDNA), describing for instance how the composition of soil microbial communities relates to environmental pollution (Arbel et al., 2016), or shotgun metagenomic data characterising the microbial composition of several compartments (milk, air, grass, cheese) in an agroecological cheese production line (TANDEM project, see below). We have access to a couple of rich datasets to carry out this application: (i) the StatInfOmics team is involved in the project TANDEM, supported by the INRAE flagship project HOLOFLUX. This project aims to study bacterial fluxes inside agro-ecological systems for cheese production, from grazing material to cheese through cows and milk. This involvement will present multiple opportunities for tackling interesting biological questions, analyse original data and develop mature and practical methodology directly benefiting areas of interest to INRAE; (ii) eDNA data sampled at study sites in the northern French Alps thanks to collaborations of Daria Bystrova and Julyan Arbel with Wilfried Thuiller at LECA. This study sites belong to the long-term observatory ORCHAMP (https://orchamp.osug.fr/home), which aims to observe, understand and model biodiversity and ecosystem functioning over space and time.

References

Arbel, J. (2019). Bayesian Statistical Learning and Applications. HDR thesis, Universit ́e Grenoble-Alpes.

Arbel, J., Kon Kam King, G., Lijoi, A., Nieto-Barajas, L. E., and Pr ̈unster, I. (2021). BNPdensity: Bayesian

nonparametric mixture modeling in R. Australian & New Zealand Journal of Statistics, in press.

Arbel, J., Mengersen, K., and Rousseau, J. (2016). Bayesian nonparametric dependent model for par-

tially replicated data: The influence of fuel spills on species diversity. The Annals of Applied Statistics,

10(3):1496–1516.

Blei, D. M. and Jordan, M. I. (2006). Variational inference for Dirichlet process mixtures. Bayesian Analysis,

1(1).

Bystrova, D., Arbel, J., Kon Kam King, G., and Deslandes, F. (2021). Approximating the clusters’ prior dis-

tribution in Bayesian nonparametric models. In Third Symposium on Advances in Approximate Bayesian

Inference.

Cowart, D. A., Murphy, K. R., and Cheng, C.-H. C. (2018). Metagenomic sequencing of environmental DNA

reveals marine faunal assemblages from the West Antarctic Peninsula. Marine Genomics, 37:148–160.

Forbes, F., Nguyen, H. D., Nguyen, T. T., and Arbel, J. (2021). Approximate Bayesian computation with

surrogate posteriors. Submitted.

Kon Kam King, G., Arbel, J., and Pr ̈unster, I. (2017). Bayesian Statistics in Action, chapter A Bayesian non-

parametric approach to ecological risk assessment, pages 151–159. Springer Proceedings in Mathematics

& Statistics, Volume 194. Springer International Publishing, Editors: Raffaele Argiento et al.

Kon Kam King, G., Canale, A., and Ruggiero, M. (2019). Bayesian functional forecasting with locally-

autoregressive dependent processes. Bayesian Analysis, 14(4):1121–1141.

Lee, J., M ̈uller, P., Gulukota, K., and Ji, Y. (2015). A Bayesian feature allocation model for tumor hetero-

geneity. The Annals of Applied Statistics, 9(2):621–639.

Nik-Zainal, S., Van Loo, P., Wedge, D. C., Alexandrov, L. B., Greenman, C. D., Lau, K. W., Raine, K.,

Jones, D., Marshall, J., Ramakrishna, M., Shlien, A., Cooke, S. L., Hinton, J., Menzies, A., Stebbings,

L. A., Leroy, C., Jia, M., Rance, R., Mudie, L. J., Gamble, S. J., Stephens, P. J., McLaren, S., Tarpey,

P. S., Papaemmanuil, E., Davies, H. R., Varela, I., McBride, D. J., Bignell, G. R., Leung, K., Butler,

A. P., Teague, J. W., Martin, S., J ̈onsson, G., Mariani, O., Boyault, S., Miron, P., Fatima, A., Langerød,

A., Aparicio, S. A. J. R., Tutt, A., Sieuwerts, A. M., Borg, ̊A., Thomas, G., Salomon, A. V., Richardson,

A. L., Børresen-Dale, A.-L., Futreal, P. A., Stratton, M. R., and Campbell, P. J. (2012). The Life History

of 21 Breast Cancers. Cell, 149(5):994–1007.

Roth, A., Khattra, J., Yap, D., Wan, A., Laks, E., Biele, J., Ha, G., Aparicio, S., Bouchard-Côté, A.,

and Shah, S. P. (2014). PyClone: Statistical inference of clonal population structure in cancer. Nature

Methods, 11(4):396–398.

Van Rossum, T., Ferretti, P., Maistrenko, O. M., and Bork, P. (2020). Diversity within species: Interpreting

strains in microbiomes. Nature Reviews Microbiology, 18(9):491–506.