Soon after I started working with genome-scale data, as a postdoc with Axel Meyer at the University of Konstanz (Germany), I realized people used quite different approaches to generate and analyze phylogenomic datasets. Clearly, handling thousands of sequences from many species requires bioinformatic solutions and building automatic (or semi-automatic) workflows. However, decisions in such pipelines can greatly impact the final phylogenomic results and there exist the risk of bioinformatic workflows becoming a huge blackbox. The effect of contamination, paralogy, long-branch-attraction, or model misspecification (systematic error) do not vanish when using large datasets, but instead, their detrimental effect can be magnified. On top of that, complex evolutionary models that try to correct systematic errors can be computationally prohibitive for genome-scale data. Although these challenges were commonly recognized, no widely accepted “golden rules” or standards existed for data selection, dataset curation and tree inference.
This project evolved as a joint effort by several research groups aiming at establishing a solid evolutionary framework for jawed vertebrates. While some of us were concerned with flaws in phylogenomic practice and the lack of “quality” standards, others worried about confirming or refuting key relationships in the jawed vertebrate tree. And I am passionate about both! The recovery of unorthodox relationships by phylogenomic studies has been a common theme since the beginning of this discipline, and jawed vertebrates are not an exception. Therefore, we saw the need of reconstructing a robust evolutionary framework for jawed vertebrates using genomic data and high methodological standards that would ensure the reliability of our inferences. In addition, jawed vertebrates were a perfect system to benchmark a new phylogenomic pipeline, given the existence of several undisputed nodes to be used as controls as well as controversial relationships that required further scrutiny.
The long-nosed horned frog (Megophrys nasuta) is used to improve the jawed vertebrate tree. Picture: Quentin Martinez.
Existing genomic data was taxonomically biased and clearly absent for species of high evolutionary relevance, often representing the earliest offshoots of major groups. In many cases, these species are bizarre, rare, relict, or present extreme features such as very large genomes, or all of the above. We generated new genomic data for several early-branching ray-finned fish such as the bichir, sturgeon, bowfin and gar, lobe-finned lungfishes, the blind cave and giant Chinese salamanders, as well as several frogs and caecilians (limbless amphibians). Access to these species is not always easy, but we could obtain fresh material for transcriptomics thanks to various collaborators including Miguel Vences from the Braunschweig University of Technology (Germany). We adopted transcriptomics as a cost-effective sequencing technique and showed that even shallow sequencing (1.5-10 Gbp of total data per species) is enough to obtain thousands of suitable genes for phylogenomics.
Probably the most challenging and time-consuming part of the project was the assembly and refinement of a new genome-scale dataset, although we were lucky enough to have Hervé Philippe, a world-leading phylogenomics expert, in our team. I visited Hervé at the CNRS Center for Biodiversity Theory and Modelling in Moulis (France) and worked elbow to elbow (literally, on the same desk!) for about three intense weeks. It was a great time learning lots of good phylogenomics, enjoying beautiful late summer landscapes in the Pyrenees (surrounded by more cows than people) and tasting French food and wine. Perhaps the most important lesson I learned was that one needs to carefully check every intermediate step in a workflow, which often implies visualizing tens of gene alignments and trees. Careful checks also helped to spot new challenges we did not at first foresee, such as transcriptome cross-contaminations, which seem to be the rule rather than the exception. Another important lesson is to never underestimate computation time. We exploited all computational resources available to us, including taking advantage of weekends to simultaneously use 6000 CPUs.
The cave blind salamander (Proteus anguinus) is another key species used in our study. Picture: Patrick Cabrol © CNRS.
In this study, we present a robust time-calibrated tree of jawed vertebrates estimated from the largest and most comprehensive dataset analyzed to date. The tree is almost fully resolved and highly supported, and thus it can justifiably be considered the most solid reference framework for understanding jawed vertebrate evolution. Our divergence time estimates –averaged across loci and based on cross-validated fossil calibrations– largely agree with previous analyses and current knowledge on the fossil record. Nevertheless, the application of new analytical methods, including new clock models and locus partitioning strategies that are currently under rapid development, as well as the discovery of new fossils, can probably fine-tune final divergence times. As illustrated here with jawed vertebrates, our new phylogenomics pipeline has the potential of producing robust inferences of evolutionary history and can thus help resolving other recalcitrant nodes in the Tree of Life.
The paper in Nature Ecology & Evolution is here: http://go.nature.com/2vBS9Oi
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in