Mathématiques et Informatique Appliquées
du Génome à l'Environnement

 

 

 

LE ROUX Zoé

Type
Doctorant.e
Sujet
Characterisation of Knowledge Graph through network tools: application to the Omnicrobe knowledge graph
Date de début
Date de fin
Encadrant(s)
Sandra Dérozier
Equipe(s)
StatInfOmics
Contrat de recherche
FAIROmics
Ecole doctorale (pour les thèses)
UNIBO / UPSaclay SDSV
Directeur.trice (pour les thèses)
Hélène Chiapello
Année de soutenance (pour les thèses ou les stages)
2027
Ecole/université (pour les thèses et les stages)
UNIBO / UPSaclay
Description/résumé

FAIROmics project:

The FAIROmics initiative, an interdisciplinary research programme, will gather universities, research centres and private companies to enable the FAIRification of omics data and databases interoperability and develop knowledge graphs for data-driven decision-making to rationally design microbial communities for imparting desirable characteristics to plant-based fermented foods in the context of open science and its regulations. The FAIROmics training programme aims to develop doctoral candidates’ skills at the interface between artificial intelligence, life sciences, humanities, and social sciences.

Scientific context:

Plant-based dairy and meat alternatives have grown in popularity in recent years for various reasons, including sustainability and health benefits, as well as lifestyle trends and dietary restrictions. However, plant-based food products can be nutritionally unbalanced, and their flavour profiles may limit their acceptance by consumers. Microorganisms have been used in making food products for millennia. However, the diversity of microbial communities driving plant-based fermentations, as well as their key genetic and phenotypic traits and potential synergies among community members, remain poorly characterised. Many data exist, but they are spread into different literature (scientific and grey) or, in the best case, in different databases, but not always reusable because they are difficult to find and access and because databases are not systematically interoperable.

Objectives:

To understand the network structure of Knowledge Graphs to develop and test algorithms for their characterisation and optimisation. In particular, spectral approaches based on the network Laplacian operator and techniques for node embedding derived from AI (eg DeepWalk, node2vec, Transformer Networks or Autoencoders) will be tested, providing an interpretation of KG elements useful for manifold learning or geometric deep learning. Moreover, network analysis theory, such as community structure characterisation or identification of key elements (nodes, links, pathways), will be studied and applied to the available cases.

In FAIROmics project framework, the nodes of the graph represent biological entities, e.g. bacteria, food matrix, food ingredient, metabolites, genes, function of these genes, etc. The edges represent relationships between these entities e.g. bacteria growing in food matrix. Moreover, the entities are themselves linked to reference classes, defined in knowledge graphs such as ontologies (e.g. bacteria taxa in NCBI taxonomy, food matrix in FoodEx2).

Specific data will be analysed and produced within the FAIRomics project, and as a starting case study the Omnicrobe knowledge graph, containing bacteria habitats and phenotypes, will be analysed to characterise element similarity at different levels (nodes, paths, modules, communities), allowing to check network structure, possible inconsistencies, missing or hidden relationships.

Expected results:

It is expected to get a deeper comprehension of Knowledge Graph structures and how to query and manipulate them. This should allow them to improve their understanding and usability, for example by 1) identifying hidden relationships through link imputation and analysis of node embedding similarity; 2) extracting possible relevant outliers or anomalies (regarding ontologies/nodes and/or relationships/links) corresponding to wrong elements within KGs; 3) clustering of KG elements through network community algorithms; 4) identify "knowledge modules" through network diffusion algorithms. The developed Network tools are also useful in a wide range of contexts, from biological networks (eg Protein interaction, Gene regulation, Microbial community ecology) to Social networks (structure and dynamics of social networks, sentiment analysis, node classification, etc).