Phylogeny based on the sequences of marker genes is a powerful bioinformatic method to study the evolutionary relationships between organisms. While it usually provides reliable predictions, in many other cases especially for organisms which have very ancient divergent age and fast evolving rate, it gives problematic results caused by computation artifacts. One famous case is the phylogenetic relationship between mitochondria and Alphaproteobacteria. Our work recently published in Nature Ecology & Evolution (read) provides to-date one of the most systematic studies on this topic.
The emergence of eukaryotic cells was one of the most significant events of life-form innovation in the Earth’s history. In contrast to bacterial or archaeal cells, eukaryotic cells contain nuclei along with various cellular organelles including mitochondria. Mitochondria are the ‘energy factory’ of eukaryotic cells where the biochemical process of respiration and energy production occur.
Numerous studies have shown that the ancestor of mitochondria was likely a close relative of extant Alphaproteobacteria. However, the exact position of mitochondria in the phylogenetic tree of Alphaproteobacteria is still unresolved leaving an open question to this most significant event in evolution.
The difficulty is caused by a number of factors including that a number of alphaproteobacteria are either streamlined planktons or endosymbionts. Similar to mitochondria, these lineages are featured by fast evolution and genomic composition biased towards adenylate, thymidine or specific amino acids. A typical topology is that when these fast-evolving taxa and mitochondria are present in the same tree, they falsely group not necessarily because they are close relatives but as the result of long-branch attraction or compositional bias – two common phylogenetic computation artifacts.
Scientists have attempted to alleviate the impacts of these systematic errors to the phylogeny of mitochondria and Alphaproteobacteria. A remarkable study is published in Nature in 2018 by a European group (read). The authors applied several data treatment strategies including site-trimming that is to remove compositionally most heterogenous sites in a sequence alignmentbefore phylogenetic tree reconstruction. This approach efficiently improved model fit in Bayesian inference (CAT+GTR model setup). However, they reported a very odd tree topology that distinct to all previous findings: mitochondria branch before the divergence of all alphaproteobacteria.
Surely a breathtaking discovery but is it real or actually an artifact caused by data pre-treatment? (See Gawryluk's comment in Current Biology (read)).
When we had a close look at that paper, we found that while the author emphasized their achievements in improving model fit, a critical question was not answered: did site-trimming drop historical signals between mitochondria and Alphaproteobacteria and if yes, how did that effect the tree topology?
In our paper,we carried out two approaches to justify this issue. Firstly, in addition to the ones use by Martijn et al., we utilized several site-trimming algorithms and found that tree topology (mitochondria branching within or beside Alphaproteobacteria) is not causatively related to the level of model fit. Factors including algorithm and the specific sites trimmed obvious had a function on it.
Secondly, if the loss of specific informative sites does cause the loss of essential historical information between mitochondria and Alphaproteobacteria, the ‘Alphaproteobacteria-sister’ topology (mitochondria branching beside Alphaproteobacteria) is then likely a result of long-branch attraction of mitochondria towards the distant outgroup which is comprised of Beta and Gammaproteobacteria. Martijn et al. did discuss this possibility in their paper. However, the evidences they provided were inconvincible (see Supplementary Note 3 in our paper). The trick is that, if someone predetermines that mitochondria branch outside Alphaproteobacteria, as no closely related bacteria of Alphaproteobacteria is currently known, it will be impossible to use an outgroup with short branch to relieve the possible long-branch attraction effect between mitochondria and the outgroup.
Fortunately, by developing a new systematic taxon sampling strategy, we were able to predict that mitochondria branch within a subclade of Alphaproteobacteria, namely Alpha IIb, which contains Rickettsiales and currently unclassified marine lineages. This subclade is closely adjacent to the subclade Alpha IIa. Therefore, by using Alpha IIa taxa as the outgroup, we were able to test the robustness of phylogenetic connection between mitochondria and Alphaproteobacteria to site-trimming treatments without the potent attractive effect by the outgroup. Figure 1d in our paper clearly shows that the topology that mitochondria branching within Alphaproteobacteria was stable even when 60% sites were removed.
Our results reach a conclusion that the ‘Alphaproteobacteria-sister’ topology obtained by Martijn et al. was an odd result caused by unjustified data pretreatment. The verification approaches we show in our paper have provided a good demo to future phylogenetic researches in resolving challenging evolutionary questions such as those of very ancient divergence events and of fast-evolving lineages.
Generally speaking, data pretreatment in phylogenetic analyses is acceptable only if it is necessary and used under strict supervision.