The Ph.D. project aims to develop information extraction (IE) methods to automatically produce a knowledge graph about microbe biology involved in plant-based food transformation or preservation. The knowledge graph will formalize the molecules produced and degraded by microorganisms in the fermentation process.
The IE methods will involve named-entity recognition, entity normalization with respect to semantic references and relationship extraction. They will be based on the most recent deep learning approaches that train language models using few or no training examples by transfer learning or exploiting existing structured information, i.e. knowledge bases and ontologies for distant or weak learning by including relevant information according to the needs of the FAIROmics dedicated use cases (e.g. NCBI Taxonomy for taxa, FoodEX2 for food, ChEBI for molecules, KEGG for pathways). Existing annotated corpora will serve as a starting point for training (e.g. CHEMDNER, Pathway Curation, Bacteria Biotope).
The project will rely on existing tools and resources on microbe biology developed by MaIAGE partners (e.g. Omnicrobe application*, Ontobiotope ontology*, extraction workflow).