The internship project will aim at identifying micro-organisms fluxes between ecosystems. This work is a part of the TANDEM project, which is an INRAE flagship project that gathers 10 teams. It aims to better understand micro-organisms fluxes in an agri-food cheese production chain. Metagenomic samples have been collected from 10 compartments, from grass to cheese through milk, cow bedding, rumen... Our idea is to identify species present in metagenomic samples and detect if some strains are shared between samples, to infer flows between the various compartments. A workflow has been developed for a previous project, based on the mapping of metagenomic reads on a dedicated catalog of reference genomes. Shared nucleotide polymorphisms across samples are used to identify strain fluxes with various statistical techniques.
The objective of the internship will be to adapt the workflow to the specificities of samples of the current TANDEM project, analyze the results and improve the statistical model to identify strain fluxes. It will require programming in python3, R, and snakemake, using git and R notebook for analysis reproducibility, and computational resources of the cluster of Migale platform).
Complex microbial ecosystems are composed of a large number of microorganisms, including hundreds to several thousands of bacterial organisms. Metagenomic shotgun sequencing methods and taxonomic assignation tools have been developed to better understand the precise composition of these ecosystems. However, it is an important step, in order to better understand the flux of micro-organisms between compartments and the evolution of ecosystems over time, to attain the sub-species level and to identify strains shared between metagenomic samples. These analyses need more refined tools than taxonomic assignation, taking into account, for instance, the variable relative abundance between strains of a species in the different samples.
We are working within the TANDEM project, which is an INRAE flagship project that gathers 10 teams and aims to better understand micro-organisms fluxes in an agri-food chain. An experimental plan has been designed, comparing 4 farming conditions differing in the diet of cows. 250 samples have been collected across a chain of cheese production (soil, grass, litter, cow’s feces and rumen, milk, cheese) and will be sequenced by shotgun metagenomics. Our objective in the project is to identify fluxes between compartments at the taxonomic level of the strain and compare them between the 4 farming conditions.
To identify these fluxes, a workflow has been developed in the team. The workflow is based on aligning metagenomic sample reads on a catalog of reference genomes with BWA-MEM (Li et al.bioinformatics, 2009), and samtools (Li, bioinformatics, 2011). These are used to compute nucleotidic abundance at each position. The next step is to identify shared nucleotidic polymorphism across samples in the different ecosystems. A research assistant has been working for 18 months on this workflow, with samples of a previous project with the same ecosystem, and has built a dedicated reference genome catalog, adapted to the ecosystems of interest. This dedicated catalog is based on the RefSeq database (O'Leary, NAR, 2016), with the addition of relevant genomes from different origins and projects: metagenomic assembled genomes (MAGS) from the project, and microbial genomes isolated from cows’ rumen and feces, and cheese. One of the specificities of the pipeline is that the species in the reference catalog must be different enough to avoid ambiguous mapping of the metagenomic reads, which requires aggregating similar genomes and choosing a representative for groups of aggregated species.
Once the shared nucleotidic polymorphism across the sample is identified, various statistical techniques are used to describe the relations among the samples and describe potential strain fluxes. One is to fully resolve the strain genomes and estimate the strains’ relative abundance in each sample. This can be used to hunt for evidence of strain transfers between samples based on ideas from population genetics. Another is akin to kernel methods in machine learning and involves building a suitable metric of distance between samples based on strain-level information, which can then be used to cluster them using dimension reduction techniques such as Nonnegative Tensor Factorisation.
Missions & competences:
The objective of the internship will be to adapt the current pipeline (snakemake) to analyse the TANDEM samples and to identify strain fluxes between compartments using the dedicated workflow. The intern will leverage analyses and scripts written by our current research assistant, adapt them to accommodate the specificities of the TANDEM samples, to take into account a large amount of metadata available for the project. A significant part of the internship will be devoted to furthering the statistical modelling of the data. This work will be carried out using on a modern high-performance computing cluster with tools interfaced with python3 and R (a certain proficiency with these two programming languages will be needed). Code will be versioned with git, and analyses traceability will be ensured using RStudio notebooks.
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics, 25(14), 1754-1760.
Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
Nuala A O'Leary, Mathew W Wright and al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44(D1):D733-45, 2016.