Mathématiques et Informatique Appliquées
du Génome à l'Environnement

 

 

 

Équipe Bibliome

  1. English version available

Acquisition et formalisation de connaissances à partir de textes

Responsable : Claire Nédellec

L'équipe Bibliome développe des méthodes de traitement du langage naturel (NLP) et d'apprentissage automatique (ML) pour extraire des informations de textes dans le domaine de la biologie.

Nous travaillons sur des tâches spécifiques d'extraction d'information (IE) telles que la reconnaissance d'entités, la normalisation d'entités (entity linking) et l'extraction de relations. Nous nous concentrons sur les méthodes qui combinent l'information linguistique, l'apprentissage automatique et la connaissance du domaine (ontologies et taxonomies) et qui sont capables de traiter un petit nombre d'exemples d'apprentissage.

Nous appliquons nos méthodes à un large éventail d'applications en Sciences de la Vie - de la diversité microbienne à la biologie végétale et à la surveillance épidémiologique.

Une part importante de notre activité consiste également à promouvoir le développement et l'évaluation de systèmes IE en organisant des challenges.


Projets

Projets en cours

EcoControl : Community Ecology and Digital Tools to Increase the Natural Regulation of Insect Pests in Agriculture PEPR Agroécologie et Numérique (2025-2029)

FairOmics : FAIRification of multiOmics data to link databases and create knowledge graphs for fermented foods. DTN H2020 - HORIZON-MSCA-2022-DN-01-01 

Omnicrobe : développement d’une base de données d'informations sur les habitats et les phénotypes microbiens à partir de textes. En cours. 

HoloOligo Structure diversity, functionality and modulation of milk oligosaccharides in monogastric livestock species: towards optimal development of rabbit and pig holobionts. Project-ANR-21-CE20-0045 - Biologie des animaux, des organismes photosynthétiques et des microorganismes (2022-2025)

TIERS - ESV. Traitement de l’Information et Expertise des Risques Sanitaires pour l’Epidémiosurveillance en Santé Végétal. IB2021 Departments INRAE MathNum and SPE. (2021-2023)

TyDI Terminology Design Interface. DiBiSO, université Paris-Saclay, INIST-CNRS, BIA-INRAE et MaIAGE-INRAE. (2021-2025).

Beyond - ANR Programme Prioritaire de Recherche Cultiver et protéger autrement. Building epidemiological surveillance and prophylaxis with observations both near and distant Projet IA-20-PCPA-0002 (2021-2026)

D2KAB Data to Knowledge in Agriculture and Biodiversity. ANR AAPG 2018-CE23-0017. (2019-2024)

 

Projets récents

ENovFood Linking a phenotypic and a network food microbe databases: an application to food microbial ecology and food innovation. Metaprogramme MEM INRA. 2018-2020.

OntoBedding. Amélioration de plongements lexicaux par des ontologies pour leur adaptation aux domaines de spécialité, avec le LIMSI. Projet financé par le DIM RFSI. 2019

Visa TM (Towards an advanced infrastructure in text-mining) CoSO project, (2017-2019)

OpenMinTeD (Open Mining Infrastructure for Text and Data) Infrastructure H2020 project (2015-2018)

D-ONT, Exploitation optimisée des bases de données phénotypiques - Des ontologies pour le partage d’information, ACI Phase 2016-2018.

IMSV, Institut de modélisation des systèmes vivants, Lidex de l'Université Paris-Saclay (2014-2016)

SeeDev, Regulations in the development of Arabidopsis thaliana seed (Challenge Lidex CDS) (2015)

OntoBiotopeMetaprogramme INRA MEM (Metagenomics of microbial ecosystems). (2012-2013).

Triphase: Semantic information system for publications in animal physiology and agricultural systems. PHASE department (2013-2014).

Quaero: Automatic multimedia content processing. Oséo. (2008-2013).

FSOV SAM BléSelection of wheat by genetic markers. Fond de soutien à l'obtention végétale (2010-2013).


Animation

Réseau2Neurones INRAE Workgroup

Workgroup Labex DigiCosme D2K (from Data to Knowledge)

BioNLP-Open Shared Task 2019: annotated corpora and online evaluation services

BioNLP-Shared Task (201120132016): annotated corpora and on-line evaluation services


Membres

Claire Nédellec

Claire Nédellec

Directrice de recherche

Responsable de l'équipe

Robert Bossy

Robert Bossy

Ingénieur de recherche

Responsable de la "Suite Alvis"

Louise Deléger

Louise Deléger

Chargée de recherche

Arnaud Ferré

Arnaud Ferré

Chargé de recherche

    

Anne-Sophie Foussat

Doctorante

Mariya Borovikova

Mariya Borovikova

Doctorante

Marine Courtin

Marine Courtin

Post-doctorante

 

 

Anciens membres

Mariya BorovikovaDoctorante
Myriam DulorStage
Anfu TangDoctorant
Elisa LubriniStage
Clara SauvionR&D
Mouhamadou BaPostdoc, projet OpenMinTeD
Estelle ChaixPostdoc, projet OpenMinTeD
Philippe BessièresDirecteur de recherche
Dialekti ValsamouDoctorante, IDEX IDI

Software

 

  • AlvisNLP is a corpus processing engine. AlvisNLP is highly parametrable, supports a wide range of file formats, and provides a wide range of natural language processing, machine learning, and corpus analysis tools. AlvisNLP is an ideal tool for information extraction and information retrieval experiments, as well for deploying text-mining services. Funding: Alvis (EU project), Quaero (French project). Cite: Nédellec et al., 2009; Ba & Bossy, 2016.

  • TyDI is a collaborative tool for the validation and structuring of terms originating from either an existing terminology or from a term extractor program (like BioYatea or TermSuite). TyDI supports collaborative term validation, term relations (synonymy, hyponymy, see-also). TyDI exports projects into standard formats (CSV, SKOS). Funding: Quaero (French project) and INRAE-CNRS-Univ Paris-Saclay. consortium Cite: Golik et al., 2010.

  • re-bert is a BERT-based relation extraction architecture. re-bert supports cross-sentence relation extraction and ensemble voting.

  • CNorm is a shallow neural network method, which outperforms other evaluated methods in multi-class and few-shot entity normalization tasks in biomedical and life science domains such Bacteria Biotope 4. Funding: OpenMinTeD (EU-INFRA). Cite: Ferré et al., 2020.

  • AlvisAE is a manual annotation Web application. AlvisAE supports the annotation of named entities, relations, and normalization with free properties or with an ontology. AlvisAE also features functions to manage annotation campaigns (attribution of documents, adjudication of double annotation). Funding: Quaero (French project). Cite: Papazian, et al., 2012.

  • AlvisIR is framework for building and deploying semantic search engines. AlvisIR semantic search engines index document terms as well as annotations extracted from the text. The user can search for terms, named entities, relations and concepts from named entity normalization. See for example search on microorganisms. Funding: Alvis (EU project), Quaero (French project). Cite: Bossy et al., 2008.

  • BioYaTeA is an extension of the YaTeA term extractor that deals with prepositional attachments and adjectival participle. It extracts terms from documents in French and in English. Its distribution includes post-filtering of irrelevant terms. It is publicly available as CPAN module. Part of this work has been funded by the European project Alvis and the French project Quaero. Cite: Golik et al., 2013.

  • Libraries

    • obo-utils (Python): read, validate, serialize ontologies in the OBO format.

    • bionlp-st-py (Python): read, validate, serialize annotations in the BioNLP-ST format.

    • IEval (Java): specify and deploy services for Information Extraction tasks evaluation. Supports Recall/Precision/F-Score, Slot Error Rate, Jaccard and many other metrics.

    • evaluate (Python): Information Extraction tasks evaluation library.

    • JSFragments (CSS/JS): widgets for presenting text-bound annotations.

    • Bibliome Utils (Java): miscellaneous utility classes and boilerplate.

    • misc-utils (Python): miscellaneous utility classes and boilerplate.


Online Services

  • Online Evaluation Service: evaluate your predictions for several Information Extraction datasets from the BioNLP-ST challenge series.

AlvisIR Semantic search engines
  • Omnicrobe search engine: 3M PubMed article indexed with micro-organisms taxa using the NCBI Taxonomy, and habitats and phenotypes using the OntoBiotope Ontology.

  • SamBlé indexes a large set of references on genetic markers and phentoypes in bread wheat with Alvis Suite technology and Wheat Trait Ontology. FSOV SamBlé Project and OpenMinTeD

  • SeeDev indexes a large set of references on molecular mechanism involved in seed development using Alvis Suite technology. Supported by UPSay CDS&IMSV projects and OpenMinTeD.

  • TriPhasIR indexes the publications of the PHASE scientific department (2010-2014) with the TriPhase termino-ontology.

  • AnimalIR indexes Animal Journal articles with the ATOL ontology

Omnicrobe
  • Omnicrobe is an online database that integrates information on microbe habitats and phenotypes from articles and databases, BRC, and genetic databases.

AlvisNLP API endpoints

Shared Tasks, Corpora and Ontologies

Corpora and Shared Tasks

  • The BB'19 Corpus is part of the Bacteria Biotope Task at BioNLP Open Shared Tasks 2019. The goal is (1) to identify microorganisms and their habitats and phenotypes; (2) to normalize them with taxa from the NCBI taxonomy or concepts from the OntoBiotope ontologies; and (3) to extract relations between microorganisms and their habitats and phenotypes. The online evaluation service is available. 
    References 
    Robert Bossy, Louise Deléger, Estelle Chaix, Mouhamadou Ba, and Claire Nédellec. 2019. Bacteria Biotope at BioNLP Open Shared Tasks 2019. In Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, pages 121–131, Hong Kong, China. Association for Computational Linguistics.
     
  • The BB'16 Corpus is part of the Bacteria Biotope Task in the BioNLP Shared Task 2016. The goal is (1) to identify the bacteria and their habitat that have to be categorized by the concept of the OntoBiotope ontologies and (2) to extract relations between bacteria and their habitat from Pubmed reference. The online evaluation service is available. 
    References 
    Louise Deléger, Robert Bossy, Estelle Chaix, Mouhamadou Ba, Arnaud Ferré, Philippe Bessières, Claire Nédellec, Overview of the Bacteria Biotope Task at BioNLP Shared Task.  In Proceedings of the BioNLP Shared Task 2016 Workshop, Association for Computational Linguistics, Berlin, Germany 2016. 
     
  • The SeeDev'16 Corpus is part of the SeeDev Task of the BioNLP Shared Task 2016. The goal is to extract complex interaction events involved in the development of Arabidopsis model plant seed. The online evaluation service is available. 
    References 
    Estelle Chaix, Bertrand Dubreucq, Abdelhak Fatihi, Dialekti Valsamou, Robert Bossy, Mouhamadou Ba, Louise Deléger, Pierre Zweigenbaum, Philippe Bessières, Loïc Lepiniec, Claire Nédellec. Overview of the Regulatory Network of Plant Seed Development (SeeDev) Task at the BioNLP Shared Task.  In Proceedings of the BioNLP Shared Task 2016 Workshop, Association for Computational Linguistics, Berlin, Germany 2016. 
     
  • The BB'13 Corpus is part of the Bacteria Biotope Task in the BioNLP Shared Task 2013. The goal is (1) to identify the bacteria and their habitat that have to be categorized by the concept of the OntoBiotope ontologies and (2) to extract relations between bacteria and their habitat from webpages. The on-line evaluation service is available.
    References 
    - Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Dialekti Valsamou, Philippe Bessières, Claire Nédellec. An Overview of the  Gene Regulation Network and the Bacteria Biotope Tasks in BioNLP’13. BMC Bioinformatics, Vol 16 Suppl 10, 2015.
    - Bossy R., Golik W., Ratkovic Z., Bessières P., Nédellec C. BioNLP shared Task 2013 – An Overview of the  Bacteria Biotope Task. In Proceedings of the BioNLP 2013 Workshop, Association for Computational Linguistics, Sofia, Bulgaria, 2013. 
     
  • The BB'11 Corpus is part of the Bacteria Biotope Task in the BioNLP Shared Task 2011. The goal is (1) to identify the bacteria and their habitat that have to be categorized in seven different types and (2) to extract relations between bacteria and their habitat.
    References 
    - Robert Bossy, Julien Jourde, Alain-Pierre Manine, Philippe Veber, Erick Alphonse, Maarten van de Guchte, Philippe Bessières, Claire Nédellec. BioNLP Shared Task - The Bacteria Track. BMC Bioinformatics, (Suppl 11):S3, juin 2012.
    - Robert Bossy, Julien Jourde, Philippe Bessières, Maarten van de Guchte, Claire Nédellec, « BioNLP shared Tasks 2011 - Bacteria Biotope », BioNLP workshop associé à ACL, Portland, Etats-Unis, 2011. 
     
  • The GRN Corpus is part of the Gene Regulation Network in Bacteria task in the BioNLP Shared Task 2013. The goal is to extract the full regulation network of Bacillus subtilis sporulation. The on-line evaluation service is available.
    References 
    - Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Dialekti Valsamou, Philippe Bessières, Claire Nédellec. An Overview of the  Gene Regulation Network and the Bacteria Biotope Tasks in BioNLP’13. BMC Bioinformatics, Vol 16 Suppl 10, 2015
    - Bossy R., Bessières P., Nédellec C. BioNLP Shared Task 2013 – An overview of the Genic Regulation Network Task. In Proceedings of the BioNLP 2013 Workshop, Association for Computational Linguistics, Sofia, Bulgaria, 2013. 
     
  • The BI Corpus is part of the Bacteria Interaction task in the BioNLP Shared Task 2011. The goal is to extract complex interaction events from Pubmed references. 
    References 
    - Robert Bossy, Julien Jourde, Alain-Pierre Manine, Philippe Veber, Erick Alphonse, Maarten van de Guchte, Philippe Bessières, Claire Nédellec. BioNLP Shared Task - The Bacteria Track. BMC Bioinformatics, (Suppl 11):S3, juin 2012.
    - Julien Jourde, Alain-Pierre Manine, Philippe Veber, Karen Fort, Robert Bossy, Erick Alphonse, Philippe Bessières, "BioNLP Shared Task 2011 - Bacteria Gene Interactions and Renaming", BioNLP workshop joint to ACL, Portland, USA, 2011. 
     
  • LLL Corpus (Learning Language is Logic): This is the original corpus of the LLL challenge. The goal of the LLL challenge is to evaluate the ability of the participating Information Extraction systems to identify directed interactions and the gene/proteins that interact (named entities must detected). The on-line evaluation service is still available. Note that the LLL corpus differs from the BioInfer LLL corpus. The Bioinfer corpus is a transformation of the original LLL corpus where the IE task has been made much easier: the relation arguments are given and the relation is not directed.
    References 
    Nédellec C. "Learning Language in Logic - Genic Interaction Extraction Challenge" in Proceedings of the Learning Language in Logic (LLL05) workshop joint to ICML'05. Cussens J. and Nédellec C. (eds). p 31-37, Bonn, August 2005. 
     

Ontologies

  • WheatPhenotype Ontology
    WheatPhenotype describes bread wheat phenotypes (Triticum aestivum) and environmental factors that influence them. Traits include resistance, development, nutrition, and bread quality. Environmental factors include biotic and abiotic traits. 
    References
    - Dialekti Valsamou, Robert Bossy, Marion Ranoux, Wiktoria Golik, Pierre Sourdille, Claire Nédellec. "Extraction d’information pour la sélection du blé par marqueur génétique". Actes de l'atelier IN-OVIVE 2ème édition des 25èmes Journées francophones d'Ingénierie des Connaissances, Clermont Ferrand, 14 mai 2014.
    - Claire Nédellec, Robert Bossy, Dialekti Valsamou, Marion Ranoux, Wiktoria Golik, Pierre Sourdille. Information Extraction from Bibliography for Marker Assisted Selection in Wheat. In proceedings of Metadata and Semantics for Agriculture, Food & Environment (AgroSEM'14), special track of the 8th Metadata and Semantics Research Conference (MTSR’14), Springer Communications in Computer and Information Science, Series Volume 478, Karlsruhe, pp 301-313, Allemagne, 2014. DOI: 10.1007/978-3-319-13674-5_28
    - Bossy et C. Nédellec. SamBlé. Moteur de recherche bibliographique sur la Sélection du blé assistée par marqueur. Projet FSOV Sélection du Blé Assistée par Marqueur
     
  • OntoBiotope Ontology
    OntoBiotope describes all types of microorganism habitats. The BioNLP-ST'16 version of the ontology contains more than 2,000 concepts. OntoBiotope is used for the annotation of the corpus of the BioNLP-ST'11, 13 and 16 Bacteria Biotope tasks and the indexing of the PubMed Biotope semantic search engine. It is distributed by AgroPortal and LovINRA.
    References
    - Louise Deléger, Robert Bossy, Estelle Chaix, Mouhamadou Ba, Arnaud Ferré, Philippe Bessières, Claire Nédellec, Overview of the Bacteria Biotope Task at BioNLP Shared Task,  In Proceedings of the BioNLP Shared Task 2016 Workshop, Association for Computational Linguistics, Berlin, Germany 2016.
    - Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Dialekti Valsamou, Philippe Bessières, Claire Nédellec. An Overview of the Gene Regulation Network and the Bacteria Biotope Tasks in BioNLP’13. BMC Bioinformatics, juillet 2015.
    - Robert Bossy, Julien Jourde, Alain-Pierre Manine, Philippe Veber, Erick Alphonse, Maarten van de Guchte, Philippe Bessières, Claire Nédellec. BioNLP Shared Task - The Bacteria Track. BMC Bioinformatics, (Suppl 11):S3, juin 2012. 
    - Bossy R., Golik W., Ratkovic Z., Bessières P., Nédellec C.. BioNLP shared Task 2013 - An Overview of the Bacteria Biotope Task. In Proceedings of the BioNLP 2013 Workshop, Association for Computational Linguistics, pages 74-82. Sofia, Bulgaria, 2013. 
    - Zorana Ratkovic, Wiktoria Golik, Pierre Warnier. BioNLP 2011 Task Bacteria Biotope - The Alvis System. BMC Bioinformatics 13(Suppl 11):S3, juin 2012. 
    - Zorana Ratkovic, Wiktoria Golik, Pierre Warnier, Philippe Veber, Claire Nédellec, "BioNLP 2011 Task Bacteria Biotope - The Alvis system", BioNLP workshop associé à ACL, Portland, Etats-Unis, 2011. 
    - Robert Bossy, Julien Jourde, Philippe Bessières, Maarten van de Guchte, Claire Nédellec, "BioNLP shared Tasks 2011 - Bacteria Biotope", BioNLP workshop associé à ACL, Portland, Etats-Unis, 2011. 
     
  • ATOL Ontology 
    ATOLthe Animal Trait Ontology for Livestockdescribes the traits of livestock animals. It is developed by the INRA scientific department Phase in collaboration with the Bibliome group (D-ONT project). 
    References
    - P.-Y. Le Bail, J. Bugeon, O. Dameron, A. Fatet, W. Golik, J.-F. Hocquette, C. Hurtaud, I. Hue, C. Jondreville, L. Joret, M.-C. Meunier-Salaün, J. Vernet, C. Nédellec, M. Reichstadt, P. Chemineau. Un langage de référence pour le phénotypage des animaux d’élevage : l’ontologie ATOL, INRA Prod. Anim., 2014, 27 (3), 195-208.
    - Hue I , Bugeon J Dameron O, Fatet A, Hurtaud C, Joret L, Meunier-Salaün MC, Nédellec C, Reichstadt M, Vernet J, Le Bail PY. ATOL AND EOL ONTOLOGIES, STEPS TOWARDS EMBRYONIC PHENOTYPES SHARED WORLDWIDE?, 4th Mammalian Embryo Genomics meeting, Québec, octobre 2013.
    - Salaün, M.-C., Bugeon, J., Dameron, O., Fatet, A., Hue, I., Hurtaud, C., Nédellec, C., Reichstadt, M., Vernet, J., Reecy, J., Park, C., Le Bail, P.-Y. ATOL: an ontology for livestock. In : Book of abstracts of the 63rd Annual Meeting of the European Federation of Animal Science, Bratislava (Slovaquie).Wageningen (NLD) : Wageningen Academic Publishers (EAAP Book of Abstracts, 18), page 299, 2012.
    - Wiktoria Golik, Olivier Dameron, Jérôme Bugeon, Alice Fatet, Isabelle Hue, Catherine Hurtaud, Matthieu Reichstadt, Marie-Christine Salaün, Jean Vernet, Léa Joret, Frédéric Papazian, Claire Nédellec et Pierre-Yves Le Bail. " ATOL: the multi-species livestock trait ontology" in proceedings of The 6th Metadata and Semantics Research Conference (MTSR 2012), pp 289-300. Springer Verlag Communications in Computer and Information Science Serie. Cadiz, Espagne, 28 au 30 novembre 2012. DOI: 10.1007/978-3-642-35233-1_28
    - M. C. Meunier-Salaun, J. Bugeon, O. Dameron, A. Fatet, I. Hue, C. Hurtaud, L. Joret, C. Nédellec, M. Reichstadt, J. Vernet, PY Le Bail., Les ontologies ATOL et /EOL: des outils en appui aux nouveaux challenges en production porcine : phénotypage et élevage de précision, Journées de la Recherche Porcine (JRP), 4 et 5 février 2014. 
     
  • Ontologie TriPhase, « Terminologie pour la recherche d’information du département Phase »
    Objective:
    The Triphase termino-ontology formally represents the research topics of the INRA scientific department PHASE, i.e. animal physiology and farming systems. Dedicated text-mining tools use TriPhase for the analysis of topics of Phase department researchers from their publications referenced in the ProdInra bibliographic database. 
    It has been developed by the Bibliome team and Information Science specialists from the Phase department to answer the needs for strategic analysis. 
    The structure of the TriPhase termino-ontology is hierarchical. It represents the entirety of the research themes of the Phase department. This set of research themes is defined in the department's scientific orientation document (2010–2015 departmental strategic plan). It contains 1,320 concepts named by 2,093 terms. The fine granularity of TriPhase is useful for the analysis of minor and transdisciplinary topics. 
    Use:
    TriPhase has been used for the analysis of concept distribution and evolution in time in publications from 2009 to 2013. The ANStrat tool developed by the Bibliome group is used to express queries on various criteria (e.g. topics, laboratories, type of publication, co-author partnership) and to display the results. Interactive navigation of TriPhase and concept selection is used to analyze topics at various levels of detail in combination with other bibliographic criteria. 
    Access and Licence:
    TriPhase is available on AgroPortal under CC-BY-SA license v3.0. Copyright Inra 2014.
    References:
    Agnès Girard et le réseau des documentalistes du Département Phase, Inra Rennes et Claire Nédellec et l’équipe de recherche Bibliome. Triphase : co-construction d’une ressource termino-ontologique. Arabesque, Revue trimestrielle de l'agence bibliographique de l'Enseignement Supérieur, August 2016.

     

Corpora and Ontologies are distributed under Creative Commons CC-BY-SA license.