A machine-learning approach to phylogenetic tree inference

The following two fields have barely interacted before: artificial intelligence and molecular evolution. To demonstrate proof of concept, we established a machine-learning-based framework that substantially boosts tree-search algorithms, without compromising accuracy!
Published in Ecology & Evolution
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Reconstructing a phylogenetic tree for a group of organisms based on molecular sequence data is a fundamental challenge in evolutionary research. When only a few dozen of species are analyzed, billions of alternative phylogenetic trees could potentially describe the evolutionary patterns, thus rendering the search for the tree that best describes the data algorithmically challenging.

If one were to develop a naïve-search algorithm, they would possibly start with a candidate tree and generate all its immediate neighboring trees by pruning each branch in turn and regrafting it to some other branch in that tree. Then, the same procedure would be done with the highest-scoring neighboring tree as the current candidate tree. These steps should be computed iteratively until convergence. As the number of possible tree topologies increases super-exponentially with the number of sequences, previously developed heuristic strategies attempt to balance between accuracy and running time. That means that providing a feasible solution comes at the cost of accuracy.

The challenge: speed up heuristic searches without compromising accuracy.
Our solution: harness machine learning to boost heuristic tree searches.
How: we trained a machine-learning algorithm to rank the candidate trees according to their propensity to improve the fit to the data, without actually calculating it.

To this end, we generated a starting tree and all its immediate neighboring trees for each of the 4,200 empirical datasets we collected (resulting in dozens of millions training samples). For each possible move to a neighboring tree, we extracted 19 features that represent that move, and for each potential neighboring tree, we computed the increase/decrease in the fit to the data. At this point, we were ready to train a machine-learning algorithm that would predict the change in the fit, according to these features. Our trained-random-forest-regression model is able to rapidly predict which are the most promising candidate trees, and which can be discarded. This way we avoid the computationally intensive evaluation of many trees.

Take-home message

There are patterns in the data that can be learned using a machine-learning model. More generally, we provided a proof of concept that learning approaches can greatly improve our ability to accurately and efficiently reconstruct phylogenetic trees.

---------

Stay tuned! We are already progressing with this research direction to provide improved AI-based algorithms for phylogeny reconstruction.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Ecology
Life Sciences > Biological Sciences > Ecology

Related Collections

With collections, you can get published faster and increase your visibility.

Advances in catalytic hydrogen evolution

This collection encourages submissions related to hydrogen evolution catalysis, particularly where hydrogen gas is the primary product. This is a cross-journal partnership between the Energy Materials team at Nature Communications with Communications Chemistry, Communications Engineering, Communications Materials, and Scientific Reports. We seek studies covering a range of perspectives including materials design & development, catalytic performance, or underlying mechanistic understanding. Other works focused on potential applications and large-scale demonstration of hydrogen evolution are also welcome.

Publishing Model: Open Access

Deadline: Sep 30, 2024

Cancer epigenetics

With this cross-journal Collection, the editors at Nature Communications, Communications Biology, Communications Medicine, and Scientific Reports invite submissions covering the breadth of research carried out in the field of cancer epigenetics. We will highlight studies aiming at the improvement of our understanding of the epigenetic mechanisms underlying cancer initiation, progression, response to therapy, metastasis and tumour plasticity as well as findings that have the potential to be translated into the clinic.

Publishing Model: Open Access

Deadline: Oct 31, 2024