Behind the Paper

Automated phylogenetic analysis of bacterial pathogens

Published in Ecology & Evolution

Mar 26, 2020

Judit Szarvas

PhD Student, Technical University of Denmark

Automated phylogenetic analysis of bacterial pathogens

Liked by Patrick Goymer and 1 other

Explore the Research

Six years ago, when our group first started discussing the need for continuous phylogenetic analysis of whole-genome sequenced bacterial isolates, large-scale meant hundreds to thousands of samples and not hundreds of thousands to millions, as it does now. Whole-genome sequencing (WGS) was gradually making its way into the toolbox of public health laboratories, having been proven useful to detect and track outbreaks of food-borne bacterial illnesses and other infectious diseases. WGS is now part of the routine in many public health, clinical and food safety microbiology laboratories, and thousands of sequences are being shared publicly every week. Hence, our “outbreak-spotting” pipeline, called evergreen, has gone through a long development phase, to tackle the new challenges.

In a regular phylogenetic analysis, all sequences in the study are compared with all other sequences in the study. The computational cost of this is increasing with the square of the number of sequences in the study. One could see, that it would become unfeasible to compute when the number of sequences became too large. Furthermore, a phylogenetic tree is made on the sequences available at given time, and when new sequences arrive, it is often a wish to add these to the already available ones. The challenge was to do this in a fast and accurate way.

Workflow from an early stage of development

From the get go, the key idea for reducing computational burden was to divide the problem into many, smaller problems, or in this case, phylogenetic trees. In reference-based phylogenetic inference, the closer the reference is to the studied sequences, the better results can be achieved. Thus, splitting up the sequences by their sequence identity to selected reference sequences was an evident move. These references were chosen by homology reducing complete chromosomal genomes, to lessen the overlap between trees. The threshold is by default set to 1% nucleotide sequence difference on the whole genome. In the beginning, k-mer based identity calculations had been performed with the algorithm in KmerFinder¹, which was later swapped out with KMA², to speed-up the process by a hundred-fold.

The hope was, that these smaller trees would flourish in time, and in anticipation, a new genetic distance calculation method was implemented, that doesn’t necessitate the re-calculation of the full distance matrix each time a new sequence is added. This reduced the growth of the computational time of updating an existing “evergreen” tree.

Initially, the plan was to couple this pipeline to the Bacterial Analysis Pipeline running on the Center for Genomic Epidemiology website³, so users could monitor their own isolates for outbreaks. However, as more and more laboratories were publishing their WGS data in public repositories, the idea of the Evergreen Online platform emerged, which could connect food related samples to clinical samples, possibly revealing the culprit behind foodborne-disease outbreaks. Even across country borders! But this meant a lot more samples, than we originally planned for, so we added a homology-reduction step for the WGS samples. The threshold in Evergreen Online is 10 bases, which loosely corresponds to cut-offs for outbreak clusters, so we can use these groupings for surveillance as well.

Never ending circle of genomic surveillance

These simple steps made it possible to compare hundreds of thousands of isolates since Evergreen Online started running, and we are planning further development to meet new demands.

References
1. Larsen, M. V. et al. Benchmarking of Methods for Genomic Taxonomy. J. Clin. Microbiol. 52, 1529–1539 (2014).
2. Clausen, P. T. L. C., Aarestrup, F. M. & Lund, O. Rapid and precise alignment of raw reads against redundant databases with KMA. BMC Bioinformatics 19, 307 (2018).
3. https://cge.cbs.dtu.dk/services/

Judit Szarvas

PhD Student, Technical University of Denmark

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Ecology

Life Sciences > Biological Sciences > Ecology

Communications Biology

Communications Biology

An open access journal from Nature Portfolio publishing high-quality research, reviews and commentary in all areas of the biological sciences, representing significant advances and bringing new biological insight to a specialized area of research.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Artificial Intelligence Methodology in Structural Biology

In this joint collection between Communications Biology, Nature Communications, Nature Structural & Molecular Biology, and Scientific Reports, we invite submissions containing new methods for the prediction of structure of macromolecules such as proteins, peptides and RNA.

Publishing Model: Hybrid

Deadline: Nov 30, 2026

Explore this Collection

Healthy Aging

This collection welcomes submissions based on studying preclinical models, as well as population-wide and clinical studies. Studies that advance our understanding of mechanisms behind healthy aging are also welcomed. Clinical research of interest will include epidemiological studies, observational studies, longitudinal cohort studies, systematic reviews and clinical trials.

Publishing Model: Open Access

Deadline: Dec 31, 2026

Explore this Collection

Latest Content

Behind the Paper, News and Opinion, Empower Your Research

When Protocols Fail: Lessons from Fragmented Healthcare Systems

News and Opinion

Institutional Intelligence: Evidence, Judgment and the Capacity to Act

Tomorrow’s Table: Food Systems in the Era of Climate Change

Reading oral cancer’s molecular warning signs without a biopsy

Spacetime Curvature Inside a Stationary Volume Completely Enclosed by a Near-Light-Speed Energy Shell: The Börekci Energy Field Apparatus, the Redesigned Börekci Metric and Antimatter Production

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Automated phylogenetic analysis of bacterial pathogens

Share this post

Share with...

...or copy the link