Mathématiques et Informatique Appliquées
du Génome à l'Environnement

 

 

 

SULTAN Ibrahim

Type
Doctorant.e
Sujet
Statistical modeling of bacterial promoter sequences for regulatory motif discovery
Date de début
Date de fin
Encadrant(s)
P. Nicolas (MaIAGE, INRA Jouy en Josas)
Equipe(s)
StatInfOmics
Contrat de recherche
ITN List_MAPS
Ecole doctorale (pour les thèses)
ED577 SDSV
Directeur.trice (pour les thèses)
S. Schbath (MaIAGE, INRA Jouy en Josas)
Année de soutenance (pour les thèses ou les stages)
2019
Date de soutenance (pour les thèses)
Ecole/université (pour les thèses et les stages)
Université Paris-Saclay
Description/résumé

Transcription factors play a key role in mediating the adaptation of bacteria to environmental conditions. Powerful algorithms and approaches have been developed for the discovery of their binding sites but automatic de novo identification of the main regulons of a bacterium from genome and transcriptome data remains a challenge. The approach that we propose here to address this task is based on a probabilistic model of the DNA sequence that can make use of precise information on the position of the transcription start sites and of condition-dependent transcription profiles. Two main novelties of our model are to allow overlaps between motif occurrences and to incorporate covariates summarising transcription profiles into the probability of occurrence in a given promoter region. Each covariate may correspond to the coordinate of the gene on an axis (e.g. obtained by PCA or ICA) or to its position in a tree (e.g. obtained by hierarchical clustering). All the parameters are estimated in a Bayesian framework using a dedicated trans-dimensional MCMC algorithm. This allows simultaneously adjusting, for many motifs and with many transcription covariates, the width of the corresponding position weight matrices, the number of parameters to describe positions with respect to the transcription start site, and the covariates that are relevant.