Mathématiques et Informatique Appliquées
du Génome à l'Environnement


Bases de données

The Databases

An activity concerning database design and development has developed in the unit. Databases are indeed an essential part of bioinformatics for structuring and exploiting the mass of data produced by genomics programs. In parallel, the dissemination of the methods resulting from the MIG unit's research to the community of biologists, bioanalysts or bioinformaticians is largely due to the soffware qui mettent en oeuvre ces méthodes

that implements these methods

Our objective is to have, in the long term, a coherent set of bases from the point of view of its design and interfaces, which will be the basis of the unit's future information system. The conceptual choices we have made are:

  • the use of the relational model for the design and implementation of physical models,
  • the physical centralization of data on a national SUN/Unix server,
  • the development of user-friendly Web interfaces to access all databases.

The technical constraints we have imposed on ourselves concern on the one hand the use of free software allowing the diffusion of databases and their interfaces, and on the other hand the use of standard software and modules allowing an easy porting on different platforms. All databases in the unit are implemented on a PostgreSQL server (Object Relational Database Management System). The translators (parsers) and web interfaces were created in Perl using standard modules: DBI (DataBase independent Interface) for connection to the database server, CGI (Common Gateway Interface) for web interfaces and BioPerl (Perl Script Toolbox for Bioinformatics and Genomics) for some translators.

We have already achieved the following foundations:

  • FUNYBASE (FUNgal phYlogenomic dataBASE) Database dedicated to the analysis and classification of homologous proteins extracted from complete fungal genomes. This resource offers two types of results: on the one hand, all orthologic and paralogue gene families detected from 31 complete fungal genomes and, on the other hand, a subset of 246 unique to 21 complete orthologic gene families for which in-depth analyses are available: protein evolution model, percentage of average identity of aligned proteins, number of variable sites, phylogenetic tree.
  • The IGO portal allows the integration of the following different databases (New features of version 2):
    • MICADO (MICrobial Advanced Database Organization) Relational database dedicated to microbial genomes. In particular, it integrates all the primary microbial sequences from Genbank, the complete microbial genomes reannotated in the Emglib bank and the functional analysis data of the B. subtilis model bacterium.
    • MOSAIC (Comparative Microbial Genome Analysis) Relational database that allows to compare bacterial genomes of the same species and to define the skeleton and loops.
    • PAREO (PAthway RElational Organization) Relational database integrating knowledge on metabolic pathways from the Japanese Kegg database.
    • PROSE (PROtein SEquences) BRelational database that manages protein sequences from SwissProt and trEMBL. A user-friendly web interface allows fine queries to be made on the database or even to execute a customized SQL query directly on the database server (account to be requested from the MIG unit). The relational model of the database is provided in the documentation section.

Other database projects are being developed in the unit. The most important one concerns the creation of a relational database managing the 3D structures of the proteins extracted from the PDB database. In addition to the development aspect of an information system described above, data on 3D protein structures play an obvious central role in the analysis of 3D sequence-structure relationships of proteins, a subject of interest in the unit.