Synthetic data generation via generative modeling has recently become a prominent research field in genomics, with applications ranging from functional sequence design to high-quality, privacy-preserving artificial in silico genomes. Following a body of work on Artificial Genomes (AGs) created via various generative models trained with raw genomic input, we propose a conceptually different approach to address the issues of scalability and complexity of genomic data generation in very high dimensions. Our method combines dimensionality reduction, achieved by Principal Component Analysis (PCA), and a generative neural network (GAN or diffusion model) learning in this reduced space. Using this framework, we generated genomic proxy datasets for very diverse human populations around the world. We compared the quality of AGs generated by our approach with AGs generated by the established models and report improvements in capturing population structure, linkage disequilibrium, and metrics related to privacy leakage. Furthermore, we developed a frugal model with orders of magnitude fewer parameters and comparable performance to larger models. For quality assessment, we also implemented a new evaluation metric based on information theory to measure local haplotypic diversity, showing that generative models yield higher diversity than real genomes.
Notwithstanding these advancements, a key challenge remains: while some methods exist to evaluate the quality of artificially generated genomic data, comprehensive tools to assess its usefulness are still limited. To tackle this issue and present a promising use case, we test artificial genomes within the framework of population genetics and local ancestry inference (LAI). Our study highlights the potential of frugal generative models and synthetic data integration in genomics. This approach could improve fair representation across populations by overcoming data accessibility challenges, while ensuring the reliability of genomic analyses conducted on artificial data.
Mathématiques et Informatique Appliquées
du Génome à l'Environnement