Mathématiques et Informatique Appliquées
du Génome à l'Environnement

 

 

TANG Anfu

Sujet
Extraction of relational information from text in specific domain - adaptability and scalability
Date de début
Date de fin
Encadrant(s)
C. Nédellec, L. Deléger, P. Zweigenbaum
Equipe(s)
Contrat de recherche
DigiCosme
Description/résumé

This thesis addresses the extraction of relational information from scientific documents in Life Sciences, i.e. transforming unstructured text into machine-readable structured information. The extraction of semantic relationships between entities detected in text makes explicit and formalizes the underlying structures. Current state-of-the art methods rely on supervised machine learning. Supervised learning, and even more so recent deep learning methods, require many training examples that are costly to produce, all the more in specific domains such as Life Sciences. We hypothesize that combining information and knowledge available in specific domains with the latest deep learning word embedding models can offset the absence or limited amount of annotated training data. For this purpose, the thesis will design a rich representation of texts that draws both from linguistic information obtained from syntactic parsing and domain knowledge obtained from knowledge graphs such as ontologies. Integrating ontologies in the information extraction process will additionally facilitate information integration with other data, such as experimental or analytical data.

Ecole doctorale (pour les thèses)
ED STIC
Directeur.trice (pour les thèses)
A. Denise
Année de soutenance (pour les thèses ou les stages)
2023
Date de soutenance (pour les thèses)
Ecole/université (pour les thèses et les stages)
Université Paris-Saclay