Anne-Carmen Sanchez

Internship

Sujet

Analysis of microbial diversity in metagenomic datasets

Date de début

lun 24/02/2020 - 12:00

Date de fin

ven 17/07/2020 - 12:00

Encadrant(s)

Anne-Laure Abraham, Hélène Chiapello, Pierre Nicolas

StatInfOmics

Description/résumé

Complex microbial ecosystems are composed of a large number of microorganisms, including
hundred to several thousands of bacterial organisms. Metagenomic shotgun sequencing
methods and taxonomic assignation tools have been recently developed to better understand the
precise composition of these ecosystems, and some methods can assign taxonomy at the
species/strain level (StrainPhlAn, Truong et al. Genome Research, 2017, MetaSNV, Costea et
al. Plos One 2017, ConStrains, Luo et al. Nat Biotechnol, 2016).
Since more and more metagenomes and genomes are sequenced, it is now possible to go a step
further in the understanding of these ecosystems by studying population genetics of microbial
species at short evolutionary scales. Human gut microbiota is an interesting ecosystem to focus
on since it has been extensively studied during the last years, and there are now more than 9000
metagenomes and 4000 genomes public datasets available (Pasolli et al. Cell, 2019,
Huttenhower et al. Nature, 2012). The gut ecosystem has also the ability to evolve rapidly in
response to diet, host species, and other colonizing taxa to adapt to new environmental
conditions. The study of Garud et al. (Plos Biology 2019) was focused on the evolutionary
dynamics of 40 prevalent species and suggests that gut bacteria evolve on human-relevant
timescales.
In order to study this microbial diversity, a pipeline is currently being developed in the
StatInfOmics team of the MaIAGE research unit (pipeline written in python3 and Snakemake
(Köster et al. Bioinformatics, 2012). Briefly, metagenomic samples are first aligned on
reference bacterial genomes (obtained from Refseq) with BWA (Li et al. bioinformatics, 2009).
In a second step allelic frequencies of ecosystem strains are computed using Samtools (Li,
bioinformatics, 2011). Finally, two genomic diversity indices were designed based on diversity
indices used in population genetics. The first index evaluates the diversity for each species the
metagenomes into a dataset, and the second one allow to evaluate the diversity between
metagenomes of two datasets. Our preliminary results indicated that our two indices are robust
to low coverage making it possible to analyze low abundance species and outpassing the only
similar approach published that is limited to prevalent and/or most abundant species (for
example Garud et al. Plos Biology 2019, Zhao et al. Cell host & microbe 2019, Schloissnig et
al. Nature 2013).
A previous L3 internship student ot the team has developed several scripts to compute
polymorphisms in public complete genomes of 261 prevalent gut microbiota species (Schmidt
et al. Elife, 2019). These scripts will be useful to identify the closer and more relevant strains
of an ecosystem.
The aim of the M2 internship is to apply the whole pipeline on one or two published
metagenomic dataset(s) and design relevant approaches to visualize, analyse and interpret the
obtained results. These analyses will be a first step to test and evaluate our approach to study
the diversity of species of the gut microbiota on a complete biological dataset.
We have identified 2 relevant datasets to study the evolution of diversity over a few months. A
first study (Yassour et al. Sci Transl Med, 2016) provide metagenomic samples of gutmicrobiota of 39 children at 2, 12, 24 and 36 months. A second study (Ferretti et al. Cell Host
and Microbe, 2018) present microbiota of 25 children at 1, 3, 7 days, 1 and 3 months.
We would like to answer questions such as: (i) What are the sequenced genomes that are close
to ecosystem strains? (ii) what is the diversity of bacterial species in a given sample? (iii) What
is the diversity between samples of different individuals? (iv) How do the diversity varies over
time for each individual and each bacterial species? In the first dataset, some children have
taken antibiotics, and these questions may provide relevant biological results on their impact
on strain diversity.

Année de soutenance (pour les thèses ou les stages)

2020

Ecole/université (pour les thèses et les stages)

Sorbonne université

Niveau/diplôme (pour les stages)

Master 2

Mathématiques et Informatique Appliquéesdu Génome à l'Environnement

Anne-Carmen Sanchez

Mathématiques et Informatique Appliquées
du Génome à l'Environnement