Type
Equipe(s)
This thesis addresses the extraction of relational information from scientific documents in Life Sciences, i.e. transforming unstructured text into machine-readable structured information. The extraction of semantic relationships between entities detected in text makes explicit and formalizes the underlying structures. Current state-of-the art methods rely on supervised machine learning. Supervised learning, and even more so recent deep learning methods, require many training examples that are costly to produce, all the more in specific domains such as Life Sciences. We hypothesize that combining information and knowledge available in specific domains with the latest deep learning word embedding models can offset the absence or limited amount of annotated training data. For this purpose, the thesis will design a rich representation of texts that draws both from linguistic information obtained from syntactic parsing and domain knowledge obtained from knowledge graphs such as ontologies. Integrating ontologies in the information extraction process will additionally facilitate information integration with other data, such as experimental or analytical data.