Acquisition et formalisation de connaissances à partir de textes
Responsable : Claire Nédellec
L'équipe Bibliome développe des méthodes de traitement du langage naturel (NLP) et d'apprentissage automatique (ML) pour extraire des informations de textes dans le domaine de la biologie.
Nous travaillons sur des tâches spécifiques d'extraction d'information (IE) telles que la reconnaissance d'entités, la normalisation d'entités (entity linking) et l'extraction de relations. Nous nous concentrons sur les méthodes qui combinent l'information linguistique, l'apprentissage automatique et la connaissance du domaine (ontologies et taxonomies) et qui sont capables de traiter un petit nombre d'exemples d'apprentissage.
Nous appliquons nos méthodes à un large éventail d'applications en Sciences de la Vie - de la diversité microbienne à la biologie végétale et à la surveillance épidémiologique.
Une part importante de notre activité consiste également à promouvoir le développement et l'évaluation de systèmes IE en organisant des challenges.
Projets
Projets en cours
EcoControl : Community Ecology and Digital Tools to Increase the Natural Regulation of Insect Pests in Agriculture PEPR Agroécologie et Numérique (2025-2029)
FairOmics : FAIRification of multiOmics data to link databases and create knowledge graphs for fermented foods. DTN H2020 - HORIZON-MSCA-2022-DN-01-01
Omnicrobe : développement d’une base de données d'informations sur les habitats et les phénotypes microbiens à partir de textes. En cours.
HoloOligo Structure diversity, functionality and modulation of milk oligosaccharides in monogastric livestock species: towards optimal development of rabbit and pig holobionts. Project-ANR-21-CE20-0045 - Biologie des animaux, des organismes photosynthétiques et des microorganismes (2022-2025)
TIERS - ESV. Traitement de l’Information et Expertise des Risques Sanitaires pour l’Epidémiosurveillance en Santé Végétal. IB2021 Departments INRAE MathNum and SPE. (2021-2023)
TyDI Terminology Design Interface. DiBiSO, université Paris-Saclay, INIST-CNRS, BIA-INRAE et MaIAGE-INRAE. (2021-2025).
Beyond - ANR Programme Prioritaire de Recherche Cultiver et protéger autrement. Building epidemiological surveillance and prophylaxis with observations both near and distant Projet IA-20-PCPA-0002 (2021-2026)
D2KAB Data to Knowledge in Agriculture and Biodiversity. ANR AAPG 2018-CE23-0017. (2019-2024)
Projets récents
ENovFood Linking a phenotypic and a network food microbe databases: an application to food microbial ecology and food innovation. Metaprogramme MEM INRA. 2018-2020.
OntoBedding. Amélioration de plongements lexicaux par des ontologies pour leur adaptation aux domaines de spécialité, avec le LIMSI. Projet financé par le DIM RFSI. 2019
Visa TM (Towards an advanced infrastructure in text-mining) CoSO project, (2017-2019)
OpenMinTeD (Open Mining Infrastructure for Text and Data) Infrastructure H2020 project (2015-2018)
D-ONT, Exploitation optimisée des bases de données phénotypiques - Des ontologies pour le partage d’information, ACI Phase 2016-2018.
IMSV, Institut de modélisation des systèmes vivants, Lidex de l'Université Paris-Saclay (2014-2016)
SeeDev, Regulations in the development of Arabidopsis thaliana seed (Challenge Lidex CDS) (2015)
OntoBiotope: Metaprogramme INRA MEM (Metagenomics of microbial ecosystems). (2012-2013).
Triphase: Semantic information system for publications in animal physiology and agricultural systems. PHASE department (2013-2014).
Quaero: Automatic multimedia content processing. Oséo. (2008-2013).
FSOV SAM Blé: Selection of wheat by genetic markers. Fond de soutien à l'obtention végétale (2010-2013).
Animation
Réseau2Neurones INRAE Workgroup
Workgroup Labex DigiCosme D2K (from Data to Knowledge)
BioNLP-Open Shared Task 2019: annotated corpora and online evaluation services
BioNLP-Shared Task (2011, 2013, 2016): annotated corpora and on-line evaluation services
Membres
Directrice de recherche Responsable de l'équipe | Ingénieur de recherche Responsable de la "Suite Alvis" | Chargée de recherche | Chargé de recherche |
Doctorante | Doctorante | Post-doctorante |
Anciens membres
Mariya Borovikova | Doctorante |
Myriam Dulor | Stage |
Anfu Tang | Doctorant |
Elisa Lubrini | Stage |
Clara Sauvion | R&D |
Mouhamadou Ba | Postdoc, projet OpenMinTeD |
Estelle Chaix | Postdoc, projet OpenMinTeD |
Philippe Bessières | Directeur de recherche |
Dialekti Valsamou | Doctorante, IDEX IDI |
Software
AlvisNLP is a corpus processing engine. AlvisNLP is highly parametrable, supports a wide range of file formats, and provides a wide range of natural language processing, machine learning, and corpus analysis tools. AlvisNLP is an ideal tool for information extraction and information retrieval experiments, as well for deploying text-mining services. Funding: Alvis (EU project), Quaero (French project). Cite: Nédellec et al., 2009; Ba & Bossy, 2016.
TyDI is a collaborative tool for the validation and structuring of terms originating from either an existing terminology or from a term extractor program (like BioYatea or TermSuite). TyDI supports collaborative term validation, term relations (synonymy, hyponymy, see-also). TyDI exports projects into standard formats (CSV, SKOS). Funding: Quaero (French project) and INRAE-CNRS-Univ Paris-Saclay. consortium Cite: Golik et al., 2010.
re-bert is a BERT-based relation extraction architecture. re-bert supports cross-sentence relation extraction and ensemble voting.
CNorm is a shallow neural network method, which outperforms other evaluated methods in multi-class and few-shot entity normalization tasks in biomedical and life science domains such Bacteria Biotope 4. Funding: OpenMinTeD (EU-INFRA). Cite: Ferré et al., 2020.
AlvisAE is a manual annotation Web application. AlvisAE supports the annotation of named entities, relations, and normalization with free properties or with an ontology. AlvisAE also features functions to manage annotation campaigns (attribution of documents, adjudication of double annotation). Funding: Quaero (French project). Cite: Papazian, et al., 2012.
AlvisIR is framework for building and deploying semantic search engines. AlvisIR semantic search engines index document terms as well as annotations extracted from the text. The user can search for terms, named entities, relations and concepts from named entity normalization. See for example search on microorganisms. Funding: Alvis (EU project), Quaero (French project). Cite: Bossy et al., 2008.
BioYaTeA is an extension of the YaTeA term extractor that deals with prepositional attachments and adjectival participle. It extracts terms from documents in French and in English. Its distribution includes post-filtering of irrelevant terms. It is publicly available as CPAN module. Part of this work has been funded by the European project Alvis and the French project Quaero. Cite: Golik et al., 2013.
Libraries
obo-utils (Python): read, validate, serialize ontologies in the OBO format.
bionlp-st-py (Python): read, validate, serialize annotations in the BioNLP-ST format.
IEval (Java): specify and deploy services for Information Extraction tasks evaluation. Supports Recall/Precision/F-Score, Slot Error Rate, Jaccard and many other metrics.
evaluate (Python): Information Extraction tasks evaluation library.
JSFragments (CSS/JS): widgets for presenting text-bound annotations.
Bibliome Utils (Java): miscellaneous utility classes and boilerplate.
misc-utils (Python): miscellaneous utility classes and boilerplate.
Online Services
Online Evaluation Service: evaluate your predictions for several Information Extraction datasets from the BioNLP-ST challenge series.
AlvisIR Semantic search engines
Omnicrobe search engine: 3M PubMed article indexed with micro-organisms taxa using the NCBI Taxonomy, and habitats and phenotypes using the OntoBiotope Ontology.
SamBlé indexes a large set of references on genetic markers and phentoypes in bread wheat with Alvis Suite technology and Wheat Trait Ontology. FSOV SamBlé Project and OpenMinTeD
SeeDev indexes a large set of references on molecular mechanism involved in seed development using Alvis Suite technology. Supported by UPSay CDS&IMSV projects and OpenMinTeD.
TriPhasIR indexes the publications of the PHASE scientific department (2010-2014) with the TriPhase termino-ontology.
AnimalIR indexes Animal Journal articles with the ATOL ontology
Omnicrobe
Omnicrobe is an online database that integrates information on microbe habitats and phenotypes from articles and databases, BRC, and genetic databases.
AlvisNLP API endpoints
Keyword Selection: extract terms from your corpus and rank them with TFIDF/Okapi-BM25.
Keyword Selection with Terminology: detect terms in your corpus and rank them with TFIDF/Okapi-BM25.
PESV Classifier: document filtering by relevance for Plateforme ESV (crop epidemiomonitoring).
- YaTeA API: extract terms from a set of documents.
Shared Tasks, Corpora and Ontologies
Corpora and Shared Tasks
- The BB'19 Corpus is part of the Bacteria Biotope Task at BioNLP Open Shared Tasks 2019. The goal is (1) to identify microorganisms and their habitats and phenotypes; (2) to normalize them with taxa from the NCBI taxonomy or concepts from the OntoBiotope ontologies; and (3) to extract relations between microorganisms and their habitats and phenotypes. The online evaluation service is available.
References
Robert Bossy, Louise Deléger, Estelle Chaix, Mouhamadou Ba, and Claire Nédellec. 2019. Bacteria Biotope at BioNLP Open Shared Tasks 2019. In Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, pages 121–131, Hong Kong, China. Association for Computational Linguistics.
- The BB'16 Corpus is part of the Bacteria Biotope Task in the BioNLP Shared Task 2016. The goal is (1) to identify the bacteria and their habitat that have to be categorized by the concept of the OntoBiotope ontologies and (2) to extract relations between bacteria and their habitat from Pubmed reference. The online evaluation service is available.
References
Louise Deléger, Robert Bossy, Estelle Chaix, Mouhamadou Ba, Arnaud Ferré, Philippe Bessières, Claire Nédellec, Overview of the Bacteria Biotope Task at BioNLP Shared Task. In Proceedings of the BioNLP Shared Task 2016 Workshop, Association for Computational Linguistics, Berlin, Germany 2016.
- The SeeDev'16 Corpus is part of the SeeDev Task of the BioNLP Shared Task 2016. The goal is to extract complex interaction events involved in the development of Arabidopsis model plant seed. The online evaluation service is available.
References
Estelle Chaix, Bertrand Dubreucq, Abdelhak Fatihi, Dialekti Valsamou, Robert Bossy, Mouhamadou Ba, Louise Deléger, Pierre Zweigenbaum, Philippe Bessières, Loïc Lepiniec, Claire Nédellec. Overview of the Regulatory Network of Plant Seed Development (SeeDev) Task at the BioNLP Shared Task. In Proceedings of the BioNLP Shared Task 2016 Workshop, Association for Computational Linguistics, Berlin, Germany 2016.
- The BB'13 Corpus is part of the Bacteria Biotope Task in the BioNLP Shared Task 2013. The goal is (1) to identify the bacteria and their habitat that have to be categorized by the concept of the OntoBiotope ontologies and (2) to extract relations between bacteria and their habitat from webpages. The on-line evaluation service is available.
References
- Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Dialekti Valsamou, Philippe Bessières, Claire Nédellec. An Overview of the Gene Regulation Network and the Bacteria Biotope Tasks in BioNLP’13. BMC Bioinformatics, Vol 16 Suppl 10, 2015.
- Bossy R., Golik W., Ratkovic Z., Bessières P., Nédellec C. BioNLP shared Task 2013 – An Overview of the Bacteria Biotope Task. In Proceedings of the BioNLP 2013 Workshop, Association for Computational Linguistics, Sofia, Bulgaria, 2013.
- The BB'11 Corpus is part of the Bacteria Biotope Task in the BioNLP Shared Task 2011. The goal is (1) to identify the bacteria and their habitat that have to be categorized in seven different types and (2) to extract relations between bacteria and their habitat.
References
- Robert Bossy, Julien Jourde, Alain-Pierre Manine, Philippe Veber, Erick Alphonse, Maarten van de Guchte, Philippe Bessières, Claire Nédellec. BioNLP Shared Task - The Bacteria Track. BMC Bioinformatics, (Suppl 11):S3, juin 2012.
- Robert Bossy, Julien Jourde, Philippe Bessières, Maarten van de Guchte, Claire Nédellec, « BioNLP shared Tasks 2011 - Bacteria Biotope », BioNLP workshop associé à ACL, Portland, Etats-Unis, 2011.
- The GRN Corpus is part of the Gene Regulation Network in Bacteria task in the BioNLP Shared Task 2013. The goal is to extract the full regulation network of Bacillus subtilis sporulation. The on-line evaluation service is available.
References
- Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Dialekti Valsamou, Philippe Bessières, Claire Nédellec. An Overview of the Gene Regulation Network and the Bacteria Biotope Tasks in BioNLP’13. BMC Bioinformatics, Vol 16 Suppl 10, 2015
- Bossy R., Bessières P., Nédellec C. BioNLP Shared Task 2013 – An overview of the Genic Regulation Network Task. In Proceedings of the BioNLP 2013 Workshop, Association for Computational Linguistics, Sofia, Bulgaria, 2013.
- The BI Corpus is part of the Bacteria Interaction task in the BioNLP Shared Task 2011. The goal is to extract complex interaction events from Pubmed references.
References
- Robert Bossy, Julien Jourde, Alain-Pierre Manine, Philippe Veber, Erick Alphonse, Maarten van de Guchte, Philippe Bessières, Claire Nédellec. BioNLP Shared Task - The Bacteria Track. BMC Bioinformatics, (Suppl 11):S3, juin 2012.
- Julien Jourde, Alain-Pierre Manine, Philippe Veber, Karen Fort, Robert Bossy, Erick Alphonse, Philippe Bessières, "BioNLP Shared Task 2011 - Bacteria Gene Interactions and Renaming", BioNLP workshop joint to ACL, Portland, USA, 2011.
- LLL Corpus (Learning Language is Logic): This is the original corpus of the LLL challenge. The goal of the LLL challenge is to evaluate the ability of the participating Information Extraction systems to identify directed interactions and the gene/proteins that interact (named entities must detected). The on-line evaluation service is still available. Note that the LLL corpus differs from the BioInfer LLL corpus. The Bioinfer corpus is a transformation of the original LLL corpus where the IE task has been made much easier: the relation arguments are given and the relation is not directed.
References
Nédellec C. "Learning Language in Logic - Genic Interaction Extraction Challenge" in Proceedings of the Learning Language in Logic (LLL05) workshop joint to ICML'05. Cussens J. and Nédellec C. (eds). p 31-37, Bonn, August 2005.
Ontologies
- WheatPhenotype Ontology
WheatPhenotype describes bread wheat phenotypes (Triticum aestivum) and environmental factors that influence them. Traits include resistance, development, nutrition, and bread quality. Environmental factors include biotic and abiotic traits.
References
- Dialekti Valsamou, Robert Bossy, Marion Ranoux, Wiktoria Golik, Pierre Sourdille, Claire Nédellec. "Extraction d’information pour la sélection du blé par marqueur génétique". Actes de l'atelier IN-OVIVE 2ème édition des 25èmes Journées francophones d'Ingénierie des Connaissances, Clermont Ferrand, 14 mai 2014.
- Claire Nédellec, Robert Bossy, Dialekti Valsamou, Marion Ranoux, Wiktoria Golik, Pierre Sourdille. Information Extraction from Bibliography for Marker Assisted Selection in Wheat. In proceedings of Metadata and Semantics for Agriculture, Food & Environment (AgroSEM'14), special track of the 8th Metadata and Semantics Research Conference (MTSR’14), Springer Communications in Computer and Information Science, Series Volume 478, Karlsruhe, pp 301-313, Allemagne, 2014. DOI: 10.1007/978-3-319-13674-5_28
- Bossy et C. Nédellec. SamBlé. Moteur de recherche bibliographique sur la Sélection du blé assistée par marqueur. Projet FSOV Sélection du Blé Assistée par Marqueur.
- OntoBiotope Ontology
OntoBiotope describes all types of microorganism habitats. The BioNLP-ST'16 version of the ontology contains more than 2,000 concepts. OntoBiotope is used for the annotation of the corpus of the BioNLP-ST'11, 13 and 16 Bacteria Biotope tasks and the indexing of the PubMed Biotope semantic search engine. It is distributed by AgroPortal and LovINRA.
References
- Louise Deléger, Robert Bossy, Estelle Chaix, Mouhamadou Ba, Arnaud Ferré, Philippe Bessières, Claire Nédellec, Overview of the Bacteria Biotope Task at BioNLP Shared Task, In Proceedings of the BioNLP Shared Task 2016 Workshop, Association for Computational Linguistics, Berlin, Germany 2016.
- Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Dialekti Valsamou, Philippe Bessières, Claire Nédellec. An Overview of the Gene Regulation Network and the Bacteria Biotope Tasks in BioNLP’13. BMC Bioinformatics, juillet 2015.
- Robert Bossy, Julien Jourde, Alain-Pierre Manine, Philippe Veber, Erick Alphonse, Maarten van de Guchte, Philippe Bessières, Claire Nédellec. BioNLP Shared Task - The Bacteria Track. BMC Bioinformatics, (Suppl 11):S3, juin 2012.
- Bossy R., Golik W., Ratkovic Z., Bessières P., Nédellec C.. BioNLP shared Task 2013 - An Overview of the Bacteria Biotope Task. In Proceedings of the BioNLP 2013 Workshop, Association for Computational Linguistics, pages 74-82. Sofia, Bulgaria, 2013.
- Zorana Ratkovic, Wiktoria Golik, Pierre Warnier. BioNLP 2011 Task Bacteria Biotope - The Alvis System. BMC Bioinformatics 13(Suppl 11):S3, juin 2012.
- Zorana Ratkovic, Wiktoria Golik, Pierre Warnier, Philippe Veber, Claire Nédellec, "BioNLP 2011 Task Bacteria Biotope - The Alvis system", BioNLP workshop associé à ACL, Portland, Etats-Unis, 2011.
- Robert Bossy, Julien Jourde, Philippe Bessières, Maarten van de Guchte, Claire Nédellec, "BioNLP shared Tasks 2011 - Bacteria Biotope", BioNLP workshop associé à ACL, Portland, Etats-Unis, 2011.
- ATOL Ontology
ATOL, the Animal Trait Ontology for Livestock, describes the traits of livestock animals. It is developed by the INRA scientific department Phase in collaboration with the Bibliome group (D-ONT project).
References
- P.-Y. Le Bail, J. Bugeon, O. Dameron, A. Fatet, W. Golik, J.-F. Hocquette, C. Hurtaud, I. Hue, C. Jondreville, L. Joret, M.-C. Meunier-Salaün, J. Vernet, C. Nédellec, M. Reichstadt, P. Chemineau. Un langage de référence pour le phénotypage des animaux d’élevage : l’ontologie ATOL, INRA Prod. Anim., 2014, 27 (3), 195-208.
- Hue I , Bugeon J Dameron O, Fatet A, Hurtaud C, Joret L, Meunier-Salaün MC, Nédellec C, Reichstadt M, Vernet J, Le Bail PY. ATOL AND EOL ONTOLOGIES, STEPS TOWARDS EMBRYONIC PHENOTYPES SHARED WORLDWIDE?, 4th Mammalian Embryo Genomics meeting, Québec, octobre 2013.
- Salaün, M.-C., Bugeon, J., Dameron, O., Fatet, A., Hue, I., Hurtaud, C., Nédellec, C., Reichstadt, M., Vernet, J., Reecy, J., Park, C., Le Bail, P.-Y. ATOL: an ontology for livestock. In : Book of abstracts of the 63rd Annual Meeting of the European Federation of Animal Science, Bratislava (Slovaquie).Wageningen (NLD) : Wageningen Academic Publishers (EAAP Book of Abstracts, 18), page 299, 2012.
- Wiktoria Golik, Olivier Dameron, Jérôme Bugeon, Alice Fatet, Isabelle Hue, Catherine Hurtaud, Matthieu Reichstadt, Marie-Christine Salaün, Jean Vernet, Léa Joret, Frédéric Papazian, Claire Nédellec et Pierre-Yves Le Bail. " ATOL: the multi-species livestock trait ontology" in proceedings of The 6th Metadata and Semantics Research Conference (MTSR 2012), pp 289-300. Springer Verlag Communications in Computer and Information Science Serie. Cadiz, Espagne, 28 au 30 novembre 2012. DOI: 10.1007/978-3-642-35233-1_28
- M. C. Meunier-Salaun, J. Bugeon, O. Dameron, A. Fatet, I. Hue, C. Hurtaud, L. Joret, C. Nédellec, M. Reichstadt, J. Vernet, PY Le Bail., Les ontologies ATOL et /EOL: des outils en appui aux nouveaux challenges en production porcine : phénotypage et élevage de précision, Journées de la Recherche Porcine (JRP), 4 et 5 février 2014.
Ontologie TriPhase, « Terminologie pour la recherche d’information du département Phase »
Objective:
The Triphase termino-ontology formally represents the research topics of the INRA scientific department PHASE, i.e. animal physiology and farming systems. Dedicated text-mining tools use TriPhase for the analysis of topics of Phase department researchers from their publications referenced in the ProdInra bibliographic database.
It has been developed by the Bibliome team and Information Science specialists from the Phase department to answer the needs for strategic analysis.
The structure of the TriPhase termino-ontology is hierarchical. It represents the entirety of the research themes of the Phase department. This set of research themes is defined in the department's scientific orientation document (2010–2015 departmental strategic plan). It contains 1,320 concepts named by 2,093 terms. The fine granularity of TriPhase is useful for the analysis of minor and transdisciplinary topics.
Use:
TriPhase has been used for the analysis of concept distribution and evolution in time in publications from 2009 to 2013. The ANStrat tool developed by the Bibliome group is used to express queries on various criteria (e.g. topics, laboratories, type of publication, co-author partnership) and to display the results. Interactive navigation of TriPhase and concept selection is used to analyze topics at various levels of detail in combination with other bibliographic criteria.
Access and Licence:
TriPhase is available on AgroPortal under CC-BY-SA license v3.0. Copyright Inra 2014.
References:
Agnès Girard et le réseau des documentalistes du Département Phase, Inra Rennes et Claire Nédellec et l’équipe de recherche Bibliome. Triphase : co-construction d’une ressource termino-ontologique. Arabesque, Revue trimestrielle de l'agence bibliographique de l'Enseignement Supérieur, August 2016.
Corpora and Ontologies are distributed under Creative Commons CC-BY-SA license.