Mathématiques et Informatique Appliquées
du Génome à l'Environnement

 

 

EcoMiner

Agence de moyen
Etat
Titre du projet
Miner of knowledge contained in ecology and environmental literature
Nom de l'appel d'offre
IA Cluster
Coordinateur.trice
A. Ferré (MaIAGE, Jouy-en-Josas)
Participants de MaIAGE
L. Deléger, S. Dérozier, A. Ferré
Année de démarrage - Année de fin de projet
2024
Date de fin du projet
Résumé
Academic and private actors operating in the field of the environment (e.g., climatology, agroecology, sustainable food, biodiversity, etc.) require more comprehensive databases for their activities (e.g., simulation parameters, biodiversity indicators, etc.). Many of these relevant pieces of information are already scattered in textual documents, particularly in scientific publications. However, the increasing volume of literature makes exhaustive information extraction by human analysts impossible. A finalized solution for information extraction, i.e. a set of natural language processing (NLP) methods capable of automatically creating databases from large quantities of textual documents, could provide a solution to this problem.

Creating custom methods for a given formalized information extraction need once required significant efforts over an extended period. These methods often relied on the creation of thematic lexicons and matching rules and were not adaptable to other applications. Deep neural network approaches have, particularly over the last decade, reduced these efforts by enabling the reuse and adaptation of existing methods, provided that a sufficient number of manually annotated examples are available to train the models. However, the production of these examples also demands substantial efforts, especially in specialized domains where annotators must be experts in the targeted field. Since 2019, latent knowledge contained in recent large language models (e.g. GPT-3 in 2020 and its well-known conversational variant, ChatGPT) seems to increasingly reduce the need for manually annotated examples [Brown et al., 2020]. Nevertheless, the current literature on information extraction largely focuses on how different methods can further enhance their predictions, rather than on ways to expedite or reduce the cost of producing a finalized solution.

We propose a strategy, based on automatic training data augmentation and active learning, to optimize the overall production of different solutions for various actors in the environmental domain through a single platform. We suggest developing a platform for these actors that enables:
- Automating the collection and formalization of non-experts' AI needs through a user-friendly interface and a testing environment;
- Executing a robust information extraction pipeline accordingly, producing a database as well as a query and visualization interface (e.g. in the style of Omnicrobe [Dérozier et al., 2023]);
- Taking into account user feedback for improving predictive quality [Nguyen et al., 2016].

While we aim to create an adaptable extraction method for any domain, the choice of a specific domain such as the environment will allow for the upfront selection of relevant corpora, as well as external resources that can enhance results (e.g., ontologies, databases, etc.). To demonstrate the platform's robustness to various needs and its general relevance to the environmental domain, other actors will be included during the project.

References
Brown T et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems. 2020.
Dérozier S et al. Omnicrobe, an open-access database of microbial habitats and phenotypes using a comprehensive text mining and data fusion approach. PLOS ONE. 2023.
Nguyen H, Patrick J. Text Mining in Clinical Domain: Dealing with Noise. ACM SIGKDD. 2016.

AAP : https://anr.fr/fr/france-2030/france2030/call/ia-cluster-poles-de-recherche-et-de-formation-de-rang-mondial-en-intelligence-artificielle-app/
Année de soumission
2023