Behind the Paper

Predicting sequence evolution through machine learning

The ever-expanding genome sequencing surveys have unraveled unfathomable degrees of sequence diversity within species. Whether in humans, plants or microbes, genomes within a species are far more diverse than expected. How and where such sequence changes occur remains hard to track down though.

Published in Bioengineering & Biotechnology

Jun 10, 2021

Daniel Croll

Professor, University of Neuchatel

Predicting sequence evolution through machine learning

Like Be the first to like this

Explore the Research

Genome sequencing surveys have unraveled previously unfathomable degrees of sequence diversity. Beyond point mutations (or SNPs), massive amounts of sequence rearrangements were identified thanks to innovations in sequencing technology and assembly algorithms. Long-read genome assemblies make it now possible to assemble most prokaryote and smaller eukaryote genomes with ease. We are just at the cusp of getting fully resolved human chromosomes.

Zymoseptoria tritici on wheat leaves

Disease symptoms caused by the pathogen on wheat leaves. Public domain picture by Maccheek (Wikipedia)

With the progress in getting complete genomes, Thomas Badet, a junior group leader at the University of Neuchâtel and I started wondering how we could capture the full extent of chromosomal rearrangements segregating within a species. In our paper, just published in Nature Communications, we mostly focus on a fungal pathogen called Zymoseptoria tritici,which can severely affect wheat production worldwide. We started with defining the pangenome of the species based on representative strains across the world. Comparing genomes, we found an astonishing number of chromosomal rearrangements many of which were affecting genes. As a consequence of the rearrangements, strains carried substantially different sets of genes. We do know now that many of these genes carry out important functions to resist pesticides or cause disease on wheat. Hence, understanding the nature of these gene presence-absence variation is important.

Hypervariable chromosomes

Extensive variation in chromosome structure across the species range.

Given the massive scale of sequence rearrangements, we were wondering whether we could spot patterns where such rearrangements occur along chromosomes. Obviously, repetitive sequences were more likely associated with some form of rearrangement. But what about sequence composition? Epigenetic modifications? Recombination rates? Hence, we started putting together a set of 30 distinct features characterizing individual chromosomal segments. Using machine learning, we train models that were capable of predicting where sequence rearrangements happened. For this, we used a subset of the 19 completely assembled genomes of the species, trained the model and then tested whether we could predict chromosomal rearrangements not included in the training. To our surprise, this worked fairly well.

Machine learning of structural variation

Machine learning and prediction of chromosomal rearrangements.

To take a step further, we wanted to put our trained model to the test for rearrangements happening in future generations. We established controlled crosses of parents with fully assembled genomes. We have also assembled complete genomes of haploid progeny over multiple generations to track any spontaneous sequence rearrangement. Finally, we applied our previously trained model to predict the appearance of these new sequence rearrangements. Our model performed very well in particular for insertion/deletion variants but was capable of predicting all kinds of rearrangements with at least with some success. To further validate our machine learning approach, we have analyzed a set of eight completely assembled Arabidopsis thaliana genomes. Even for these substantially larger genomes, machine learning showed significant power to predict the location of sequence arrangements in genomes not included in the training.

Progeny analyses

Four-generation haploid pedigree to evaluate the predictive power of the machine learning approach.

Could machine learning also help to predict variation among human genomes? The idea to predict fragile sites prone to rearrangements has received considerable attention in cancer research or heritable diseases. If genomes are of sufficient quality, machine learning could help predict the mutational trajectory of cancer lineages and possibly identify particular weaknesses for treatment. Predicting sequence rearrangements could also help make antimicrobials more durable or genetic modifications in breeding more robust. Much road needs to be travelled still but we believe that machine learning of sequence features will find many future implementations.

Reference

Badet T, Fouché S, Hartmann FE, Zala M, Croll D. 2021. Machine-learning predicts genomic determinants of meiosis-driven structural variation in a eukaryotic pathogen. Nature Communications. DOI: 10.1038/s41467-021-23862-x

Daniel Croll

Professor, University of Neuchatel

Daniel joined the University of Neuchâtel as an Assistant Professor in 2017 where he leads the evolutionary genetics lab. The lab's main focus is host-pathogen interactions but Daniel is happy to dabble in many other population genetics or genomics topics.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Biotechnology

Life Sciences > Biological Sciences > Biotechnology

Nature Communications

Nature Communications

An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Women's Health

A selection of recent articles that highlight issues relevant to the treatment of neurological and psychiatric disorders in women.

Publishing Model: Hybrid

Deadline: Ongoing

Explore this Collection

Tumor Microenvironment Crosstalk and Therapeutic Implications

With this cross-journal Collection, the editors at Nature Immunology, Nature Communications, Communications Medicine and Scientific Reports invite manuscripts that highlight cutting-edge research on TME crosstalk and its therapeutic implications. Topics of interest include immune modulation and checkpoint pathways, cancer-associated fibroblasts and stromal remodeling, angiogenesis and vascular normalization, metabolic reprogramming within the TME, and the role of microbiota in tumor-immune dynamics. We also welcome studies on novel therapeutic approaches that exploit TME vulnerabilities to advance cancer treatment.

Publishing Model: Hybrid

Deadline: Nov 02, 2026

Explore this Collection

Predicting sequence evolution through machine learning

Behind the Paper

Of kings and Alpine ibex: the amazing resurrection of a species from near-extinction

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Predicting sequence evolution through machine learning

Share this post

Share with...

...or copy the link