Bibliome Team

"Extraction and formalization of knowledge from text"

Lead: Claire Nédellec

The Bibliome team develops methods in natural language processing (NLP) and machine learning (ML) to extract information from texts in the Life Science domain.

We work on specific information extraction (IE) tasks such as entity recognition, entity normalization (/entity linking), and relation extraction. Our focus is on methods that combine linguistic information, machine learning, and domain knowledge (deep ontologies and taxonomies), and that are capable of handling a small number of training examples.

We adapt our methods to a wide range of applications in Life Sciences, from microbial diversity to plant biology and epidemiological surveillance.

An important part of our activity also involves promoting the development and evaluation of IE systems by organizing challenges.

Projects
Scientific community engagement
Members
Software
Online services
Shared Tasks, Ontologies and Corpora

Projects

On-going projects

EcoControl: COmmunity ecology and Numerical Tools to promote the natural Regulation Of insect pests in agricuLture. PEPR Agroécologie et Numérique (2025-2030).

FAIROmics: FAIRification of multiOmics data to link databases and create knowledge graphs for fermented foods. DTN H2020 - HORIZON-MSCA-2022-DN-01-01 (2024-2027).

Omnicrobe : reference database on microbe habitats and phenotypes. CRD ANSES (ongoing).

HoloOligo Structure diversity, functionality and modulation of milk oligosaccharides in monogastric livestock species: towards optimal development of rabbit and pig holobionts. Project-ANR-21-CE20-0045 - Biologie des animaux, des organismes photosynthétiques et des microorganismes (2022-2025).

TyDI: Terminology Design Interface. INRAE (MaIAGE, DiPSO, BIA) DiBiSO, University of Paris-Saclay, INIST-CNRS (2021-2025).

TIERS - ESV: Information Processing and Expertise on Health Risks for Plant Health Epidemiological Surveillance. IB2021 Departments INRAE MathNum and SPE (2021–2023).

BEYOND: Building epidemiological surveillance and prophylaxis using both near and distant observations. ANR Priority Research Program “Cultiver et protéger autrement” (Growing and Protecting crops Differently). Project IA-20-PCPA-0002 (2021-2026).

D2KAB: Data to Knowledge in Agriculture and Biodiversity. ANR AAPG 2018-CE23-0017 (2019–2024).

Recent projects

ENovFood: Linking a phenotypic and a network food microbe data bases: an application to food microbial ecology and food innovation. Metaprogramme MEM INRA. 2018-2020.

OntoBedding. Amélioration de plongements lexicaux par des ontologies pour leur adaptation aux domaines de spécialité, avec le LIMSI. Projet financé par le DIM RFSI. 2019

Visa TM (Towards an advanced infrastructure in text-mining) CoSO project, (2017-2019)

OpenMinTeD (Open Mining Infrastructure for Text and Data) Infrastructure H2020 project (2015-2018)

D-ONT, Exploitation optimisée des bases de données phénotypiques - Des ontologies pour le partage d’information, ACI Phase 2016-2018.

IMSV, Institut de modélisation des systèmes vivants, Lidex de l'Université Paris-Saclay (2014-2016)

SeeDev, Regulations in the development of Arabidopsis thaliana seed (Challenge Lidex CDS) (2015)

OntoBiotope: Metaprogramme INRA MEM (Metagenomics of microbial ecosystems). (2012-2013).

Triphase: Semantic information system for publications in animal physiology and agricultural systems. PHASE department (2013-2014).

Quaero: Automatic multimedia content processing. Oséo. (2008-2013).

FSOV SAM Blé: Selection of wheat by genetic markers. Fond de soutien à l'obtention végétale (2010-2013).

Scientific community engagement

Réseau2Neurones INRAE Workgroup

Working group Labex DigiCosme D2K (from Data to Knowledge)

BioNLP-Open Shared Task 2019: annotated corpora and online evaluation services

BioNLP-Shared Task (2011, 2013, 2016): annotated corpora and online evaluation services

LLL, Learning Language in Logics (2005)

Members

Claire Nédellec

Research director

Head of "Bibliome"

Robert Bossy

Research engineer

Head of the "Alvis Suite"

Louise Deléger

Research Scientist

Arnaud Ferré

Research Scientist

Anne-Sophie Foussat

PhD student

Xingyu Zhu

PhD student

Marine Courtin

PhD candidate

Xinzhi Yao

visiting researcher

Past members

Mariya Borovikova	PhD student
Myriam Dulor	Intern
Sofiane Sadat	Intern
Anfu Tang	PhD student
Antoine Toffano	Intern
Elisa Lubrini	Engineer
Clara Sauvion	Engineer
Mouhamadou Ba	PhD candidate, OpenMinTeD project
Estelle Chaix	PhD candidate, OpenMinTeD project
Philippe Bessières	Research director
Zorana Ratkovic	PhD student
Dialekti Valsamou	PhD student, IDEX IDI

Software

AlvisNLP is a corpus processing engine. AlvisNLP is highly parametrable, supports a wide range of file formats, and provides a wide range of natural language processing, machine learning, and corpus analysis tools. AlvisNLP is an ideal tool for information extraction and information retrieval experiments, as well for deploying text-mining services. Funding: Alvis (EU project), Quaero (French project). Cite: Nédellec et al., 2009; Ba & Bossy, 2016.
TyDI is a collaborative tool for the validation and structuring of terms originating from either an existing terminology or from a term extractor program (like BioYatea or TermSuite). TyDI supports collaborative term validation, term relations (synonymy, hyponymy, see-also). TyDI exports projects into standard formats (CSV, SKOS). Funding: Quaero (French project) and INRAE-CNRS-Univ Paris-Saclay. consortium Cite: Golik et al., 2010.
re-bert is a BERT-based relation extraction architecture. re-bert supports cross-sentence relation extraction and ensemble voting.
CNorm is a shallow neural network method, which outperforms other evaluated methods in multi-class and few-shot entity normalization tasks in biomedical and life science domains such Bacteria Biotope 4. Funding: OpenMinTeD (EU-INFRA). Cite: Ferré et al., 2020.
AlvisAE is a manual annotation Web application. AlvisAE supports the annotation of named entities, relations, and normalization with free properties or with an ontology. AlvisAE also features functions to manage annotation campaigns (attribution of documents, adjudication of double annotation). Funding: Quaero (French project). Cite: Papazian, et al., 2012.
AlvisIR is framework for building and deploying semantic search engines. AlvisIR semantic search engines index document terms as well as annotations extracted from the text. The user can search for terms, named entities, relations and concepts from named entity normalization. See for example search on microorganisms. Funding: Alvis (EU project), Quaero (French project). Cite: Bossy et al., 2008.
BioYaTeA is an extension of the YaTeA term extractor that deals with prepositional attachments and adjectival participle. It extracts terms from documents in French and in English. Its distribution includes post-filtering of irrelevant terms. It is publicly available as CPAN module. Part of this work has been funded by the European project Alvis and the French project Quaero. Cite: Golik et al., 2013.
Libraries
- obo-utils (Python): read, validate, serialize ontologies in the OBO format.
- bionlp-st-py (Python): read, validate, serialize annotations in the BioNLP-ST format.
- IEval (Java): specify and deploy services for Information Extraction tasks evaluation. Supports Recall/Precision/F-Score, Slot Error Rate, Jaccard and many other metrics.
- evaluate (Python): Information Extraction tasks evaluation library.
- JSFragments (CSS/JS): widgets for presenting text-bound annotations.
- Bibliome Utils (Java): miscellaneous utility classes and boilerplate.
- misc-utils (Python): miscellaneous utility classes and boilerplate.

Online Services

Online Evaluation Service: evaluate your predictions for several Information Extraction datasets from the BioNLP-ST challenge series.

Semantic search engines based on the AlvisIR technology

Omnicrobe search engine: 3M PubMed article indexed with micro-organisms taxa using the NCBI Taxonomy, and habitats and phenotypes using the OntoBiotope Ontology.
SamBlé indexes a large set of references on genetic markers and phentoypes in bread wheat with Alvis Suite technology and Wheat Trait Ontology. FSOV SamBlé Project and OpenMinTeD.
SeeDev indexes a large set of references on molecular mechanism involved in seed development using Alvis Suite technology. Supported by UPSay CDS&IMSV projects and OpenMinTeD.
TriPhasIR indexes the publications of the PHASE scientific department (2010-2014) with the TriPhase termino-ontology.
AnimalIR indexes Animal Journal articles with the ATOL ontology.

Omnicrobe

Omnicrobe is an online database that integrates information on microbe habitats and phenotypes from articles and databases, BRC, and genetic databases.

AlvisNLP API endpoints

Keyword Selection: extract terms from your corpus and rank them with TFIDF/Okapi-BM25.
Keyword Selection with Terminology: detect terms in your corpus and rank them with TFIDF/Okapi-BM25.
PESV Classifier: document filtering by relevance for Plateforme ESV (crop epidemiomonitoring).
YaTeA API: extract terms from a set of documents.

Shared Tasks, Ontologies and Corpora

Corpora and Shared Tasks

The BB'19 Corpus is part of the Bacteria Biotope Task at BioNLP Open Shared Tasks 2019. The goal is (1) to identify microorganisms and their habitats and phenotypes; (2) to normalize them with taxa from the NCBI taxonomy or concepts from the OntoBiotope ontologies; and (3) to extract relations between microorganisms and their habitats and phenotypes. The online evaluation service is available.
References
Robert Bossy, Louise Deléger, Estelle Chaix, Mouhamadou Ba, and Claire Nédellec. 2019. Bacteria Biotope at BioNLP Open Shared Tasks 2019. In Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, pages 121–131, Hong Kong, China. Association for Computational Linguistics.
The BB'16 Corpus is part of the Bacteria Biotope Task in the BioNLP Shared Task 2016. The goal is (1) to identify the bacteria and their habitat that have to be categorized by the concept of the OntoBiotope ontologies and (2) to extract relations between bacteria and their habitat from Pubmed reference. The online evaluation service is available.
References
Louise Deléger, Robert Bossy, Estelle Chaix, Mouhamadou Ba, Arnaud Ferré, Philippe Bessières, Claire Nédellec, Overview of the Bacteria Biotope Task at BioNLP Shared Task. In Proceedings of the BioNLP Shared Task 2016 Workshop, Association for Computational Linguistics, Berlin, Germany 2016.
The SeeDev'16 Corpus is part of the SeeDev Task of the BioNLP Shared Task 2016. The goal is to extract complex interaction events involved in the development of Arabidopsis model plant seed. The online evaluation service is available.
References
Estelle Chaix, Bertrand Dubreucq, Abdelhak Fatihi, Dialekti Valsamou, Robert Bossy, Mouhamadou Ba, Louise Deléger, Pierre Zweigenbaum, Philippe Bessières, Loïc Lepiniec, Claire Nédellec. Overview of the Regulatory Network of Plant Seed Development (SeeDev) Task at the BioNLP Shared Task. In Proceedings of the BioNLP Shared Task 2016 Workshop, Association for Computational Linguistics, Berlin, Germany 2016.
The BB'13 Corpus is part of the Bacteria Biotope Task in the BioNLP Shared Task 2013. The goal is (1) to identify the bacteria and their habitat that have to be categorized by the concept of the OntoBiotope ontologies and (2) to extract relations between bacteria and their habitat from webpages. The on-line evaluation service is available.
References
- Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Dialekti Valsamou, Philippe Bessières, Claire Nédellec. An Overview of the Gene Regulation Network and the Bacteria Biotope Tasks in BioNLP’13. BMC Bioinformatics, Vol 16 Suppl 10, 2015.
- Bossy R., Golik W., Ratkovic Z., Bessières P., Nédellec C. BioNLP shared Task 2013 – An Overview of the Bacteria Biotope Task. In Proceedings of the BioNLP 2013 Workshop, Association for Computational Linguistics, Sofia, Bulgaria, 2013.
The BB'11 Corpus is part of the Bacteria Biotope Task in the BioNLP Shared Task 2011. The goal is (1) to identify the bacteria and their habitat that have to be categorized in seven different types and (2) to extract relations between bacteria and their habitat.
References
- Robert Bossy, Julien Jourde, Alain-Pierre Manine, Philippe Veber, Erick Alphonse, Maarten van de Guchte, Philippe Bessières, Claire Nédellec. BioNLP Shared Task - The Bacteria Track. BMC Bioinformatics, (Suppl 11):S3, juin 2012.
- Robert Bossy, Julien Jourde, Philippe Bessières, Maarten van de Guchte, Claire Nédellec, « BioNLP shared Tasks 2011 - Bacteria Biotope », BioNLP workshop associé à ACL, Portland, Etats-Unis, 2011.
The GRN Corpus is part of the Gene Regulation Network in Bacteria task in the BioNLP Shared Task 2013. The goal is to extract the full regulation network of Bacillus subtilis sporulation. The on-line evaluation service is available.
References
- Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Dialekti Valsamou, Philippe Bessières, Claire Nédellec. An Overview of the Gene Regulation Network and the Bacteria Biotope Tasks in BioNLP’13. BMC Bioinformatics, Vol 16 Suppl 10, 2015
- Bossy R., Bessières P., Nédellec C. BioNLP Shared Task 2013 – An overview of the Genic Regulation Network Task. In Proceedings of the BioNLP 2013 Workshop, Association for Computational Linguistics, Sofia, Bulgaria, 2013.
The BI Corpus is part of the Bacteria Interaction task in the BioNLP Shared Task 2011. The goal is to extract complex interaction events from Pubmed references.
References
- Robert Bossy, Julien Jourde, Alain-Pierre Manine, Philippe Veber, Erick Alphonse, Maarten van de Guchte, Philippe Bessières, Claire Nédellec. BioNLP Shared Task - The Bacteria Track. BMC Bioinformatics, (Suppl 11):S3, juin 2012.
- Julien Jourde, Alain-Pierre Manine, Philippe Veber, Karen Fort, Robert Bossy, Erick Alphonse, Philippe Bessières, "BioNLP Shared Task 2011 - Bacteria Gene Interactions and Renaming", BioNLP workshop joint to ACL, Portland, USA, 2011.
LLL Corpus (Learning Language is Logic): This is the original corpus of the LLL challenge. The goal of the LLL challenge is to evaluate the ability of the participating Information Extraction systems to identify directed interactions and the gene/proteins that interact (named entities must detected). Note that the LLL corpus differs from the BioInfer LLL corpus. The Bioinfer corpus is a transformation of the original LLL corpus where the IE task has been made much easier: the relation arguments are given and the relation is not directed.
References
Nédellec C. "Learning Language in Logic - Genic Interaction Extraction Challenge" in Proceedings of the Learning Language in Logic (LLL05) workshop joint to ICML'05. Cussens J. and Nédellec C. (eds). p 31-37, Bonn, August 2005.

Ontologies

WheatPhenotype Ontology
WheatPhenotype describes bread wheat phenotypes (Triticum aestivum) and environmental factors that influence them. Traits include resistance, development, nutrition, and bread quality. Environmental factors include biotic and abiotic traits.
References
- Dialekti Valsamou, Robert Bossy, Marion Ranoux, Wiktoria Golik, Pierre Sourdille, Claire Nédellec. "Extraction d’information pour la sélection du blé par marqueur génétique". Actes de l'atelier IN-OVIVE 2ème édition des 25èmes Journées francophones d'Ingénierie des Connaissances, Clermont Ferrand, 14 mai 2014.
- Claire Nédellec, Robert Bossy, Dialekti Valsamou, Marion Ranoux, Wiktoria Golik, Pierre Sourdille. Information Extraction from Bibliography for Marker Assisted Selection in Wheat. In proceedings of Metadata and Semantics for Agriculture, Food & Environment (AgroSEM'14), special track of the 8th Metadata and Semantics Research Conference (MTSR’14), Springer Communications in Computer and Information Science, Series Volume 478, Karlsruhe, pp 301-313, Allemagne, 2014. DOI: 10.1007/978-3-319-13674-5_28
- Bossy et C. Nédellec. SamBlé. Moteur de recherche bibliographique sur la Sélection du blé assistée par marqueur. Projet FSOV Sélection du Blé Assistée par Marqueur.
OntoBiotope Ontology
OntoBiotope describes all types of microorganism habitats. The BioNLP-ST'16 version of the ontology contains more than 2,000 concepts. OntoBiotope is used for the annotation of the corpus of the BioNLP-ST'13, 16 and 19 Bacteria Biotope tasks and the indexing of the PubMed Biotope semantic search engine. It is distributed by AgroPortal and LovINRA.
References
- Louise Deléger, Robert Bossy, Estelle Chaix, Mouhamadou Ba, Arnaud Ferré, Philippe Bessières, Claire Nédellec, Overview of the Bacteria Biotope Task at BioNLP Shared Task, In Proceedings of the BioNLP Shared Task 2016 Workshop, Association for Computational Linguistics, Berlin, Germany 2016.
- Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Dialekti Valsamou, Philippe Bessières, Claire Nédellec. An Overview of the Gene Regulation Network and the Bacteria Biotope Tasks in BioNLP’13. BMC Bioinformatics, juillet 2015.
- Robert Bossy, Julien Jourde, Alain-Pierre Manine, Philippe Veber, Erick Alphonse, Maarten van de Guchte, Philippe Bessières, Claire Nédellec. BioNLP Shared Task - The Bacteria Track. BMC Bioinformatics, (Suppl 11):S3, juin 2012.
- Bossy R., Golik W., Ratkovic Z., Bessières P., Nédellec C.. BioNLP shared Task 2013 - An Overview of the Bacteria Biotope Task. In Proceedings of the BioNLP 2013 Workshop, Association for Computational Linguistics, pages 74-82. Sofia, Bulgaria, 2013.
- Zorana Ratkovic, Wiktoria Golik, Pierre Warnier. BioNLP 2011 Task Bacteria Biotope - The Alvis System. BMC Bioinformatics 13(Suppl 11):S3, juin 2012.
- Zorana Ratkovic, Wiktoria Golik, Pierre Warnier, Philippe Veber, Claire Nédellec, "BioNLP 2011 Task Bacteria Biotope - The Alvis system", BioNLP workshop associé à ACL, Portland, Etats-Unis, 2011.
- Robert Bossy, Julien Jourde, Philippe Bessières, Maarten van de Guchte, Claire Nédellec, "BioNLP shared Tasks 2011 - Bacteria Biotope", BioNLP workshop associé à ACL, Portland, Etats-Unis, 2011.
ATOL Ontology
ATOL, the Animal Trait Ontology for Livestock, describes the traits of livestock animals. It is developed by the INRA scientific department Phase in collaboration with the Bibliome group (D-ONT project).
References
- P.-Y. Le Bail, J. Bugeon, O. Dameron, A. Fatet, W. Golik, J.-F. Hocquette, C. Hurtaud, I. Hue, C. Jondreville, L. Joret, M.-C. Meunier-Salaün, J. Vernet, C. Nédellec, M. Reichstadt, P. Chemineau. Un langage de référence pour le phénotypage des animaux d’élevage : l’ontologie ATOL, INRA Prod. Anim., 2014, 27 (3), 195-208.
- Hue I , Bugeon J Dameron O, Fatet A, Hurtaud C, Joret L, Meunier-Salaün MC, Nédellec C, Reichstadt M, Vernet J, Le Bail PY. ATOL AND EOL ONTOLOGIES, STEPS TOWARDS EMBRYONIC PHENOTYPES SHARED WORLDWIDE?, 4th Mammalian Embryo Genomics meeting, Québec, octobre 2013.
- Salaün, M.-C., Bugeon, J., Dameron, O., Fatet, A., Hue, I., Hurtaud, C., Nédellec, C., Reichstadt, M., Vernet, J., Reecy, J., Park, C., Le Bail, P.-Y. ATOL: an ontology for livestock. In : Book of abstracts of the 63rd Annual Meeting of the European Federation of Animal Science, Bratislava (Slovaquie).Wageningen (NLD) : Wageningen Academic Publishers (EAAP Book of Abstracts, 18), page 299, 2012.
- Wiktoria Golik, Olivier Dameron, Jérôme Bugeon, Alice Fatet, Isabelle Hue, Catherine Hurtaud, Matthieu Reichstadt, Marie-Christine Salaün, Jean Vernet, Léa Joret, Frédéric Papazian, Claire Nédellec et Pierre-Yves Le Bail. " ATOL: the multi-species livestock trait ontology" in proceedings of The 6th Metadata and Semantics Research Conference (MTSR 2012), pp 289-300. Springer Verlag Communications in Computer and Information Science Serie. Cadiz, Espagne, 28 au 30 novembre 2012. DOI: 10.1007/978-3-642-35233-1_28
- M. C. Meunier-Salaun, J. Bugeon, O. Dameron, A. Fatet, I. Hue, C. Hurtaud, L. Joret, C. Nédellec, M. Reichstadt, J. Vernet, PY Le Bail., Les ontologies ATOL et /EOL: des outils en appui aux nouveaux challenges en production porcine : phénotypage et élevage de précision, Journées de la Recherche Porcine (JRP), 4 et 5 février 2014.
Ontologie TriPhase, « Terminologie pour la recherche d’information du département Phase »
Objective:
The Triphase termino-ontology formally represents the research topics of the INRA scientific department PHASE, i.e. animal physiology and farming systems. Dedicated text-mining tools use TriPhase for the analysis of topics of Phase department researchers from their publications referenced in the ProdInra bibliographic database.
It has been developed by the Bibliome team and Information Science specialists from the Phase department to answer the needs for strategic analysis.
The structure of the TriPhase termino-ontology is hierarchical. It represents the entirety of the research themes of the Phase department. This set of research themes is defined in the department's scientific orientation document (2010–2015 departmental strategic plan). It contains 1,320 concepts named by 2,093 terms. The fine granularity of TriPhase is useful for the analysis of minor and transdisciplinary topics.
Use:
TriPhase has been used for the analysis of concept distribution and evolution in time in publications from 2009 to 2013. The ANStrat tool developed by the Bibliome group is used to express queries on various criteria (e.g. topics, laboratories, type of publication, co-author partnership) and to display the results. Interactive navigation of TriPhase and concept selection is used to analyze topics at various levels of detail in combination with other bibliographic criteria.
Access and Licence:
TriPhase is available on AgroPortal under CC-BY-SA license v3.0. Copyright Inra 2014.
References:
Agnès Girard et le réseau des documentalistes du Département Phase, Inra Rennes et Claire Nédellec et l’équipe de recherche Bibliome. Triphase : co-construction d’une ressource termino-ontologique. Arabesque, Revue trimestrielle de l'agence bibliographique de l'Enseignement Supérieur, August 2016.

Corpora and Ontologies are distributed under Creative Commons CC-BY-SA license.

Mathématiques et Informatique Appliquéesdu Génome à l'Environnement

Bibliome Team

Bibliome Team

"Extraction and formalization of knowledge from text"

Projects

On-going projects

Recent projects

Scientific community engagement

Members

Past members

Software

Online Services

Semantic search engines based on the AlvisIR technology

Omnicrobe

AlvisNLP API endpoints

Shared Tasks, Ontologies and Corpora

Corpora and Shared Tasks

Ontologies

Mathématiques et Informatique Appliquées
du Génome à l'Environnement