Automated phylogenetic analysis of bacterial pathogens

Published in Ecology & Evolution
Automated phylogenetic analysis of bacterial pathogens
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Six years ago, when our group first started discussing the need for continuous phylogenetic analysis of whole-genome sequenced bacterial isolates, large-scale meant hundreds to thousands of samples and not hundreds of thousands to millions, as it does now. Whole-genome sequencing (WGS) was gradually making its way into the toolbox of public health laboratories, having been proven useful to detect and track outbreaks of food-borne bacterial illnesses and other infectious diseases. WGS is now part of the routine in many public health, clinical and food safety microbiology laboratories, and thousands of sequences are being shared publicly every week. Hence, our “outbreak-spotting” pipeline, called evergreen, has gone through a long development phase, to tackle the new challenges.

In a regular phylogenetic analysis, all sequences in the study are compared with all other sequences in the study. The computational cost of this is increasing with the square of the number of sequences in the study. One could see, that it would become unfeasible to compute when the number of sequences became too large. Furthermore, a phylogenetic tree is made on the sequences available at given time, and when new sequences arrive, it is often a wish to add these to the already available ones. The challenge was to do this in a fast and accurate way.

Workflow from an early stage of development

From the get go, the key idea for reducing computational burden was to divide the problem into many, smaller problems, or in this case, phylogenetic trees. In reference-based phylogenetic inference, the closer the reference is to the studied sequences, the better results can be achieved. Thus, splitting up the sequences by their sequence identity to selected reference sequences was an evident move. These references were chosen by homology reducing complete chromosomal genomes, to lessen the overlap between trees. The threshold is by default set to 1% nucleotide sequence difference on the whole genome. In the beginning, k-mer based identity calculations had been performed with the algorithm in KmerFinder1, which was later swapped out with KMA2, to speed-up the process by a hundred-fold.

The hope was, that these smaller trees would flourish in time, and in anticipation, a new genetic distance calculation method was implemented, that doesn’t necessitate the re-calculation of the full distance matrix each time a new sequence is added. This reduced the growth of the computational time of updating an existing “evergreen” tree.

Initially, the plan was to couple this pipeline to the Bacterial Analysis Pipeline running on the Center for Genomic Epidemiology website3, so users could monitor their own isolates for outbreaks. However, as more and more laboratories were publishing their WGS data in public repositories, the idea of the Evergreen Online platform emerged, which could connect food related samples to clinical samples, possibly revealing the culprit behind foodborne-disease outbreaks. Even across country borders! But this meant a lot more samples, than we originally planned for, so we added a homology-reduction step for the WGS samples. The threshold in Evergreen Online is 10 bases, which loosely corresponds to cut-offs for outbreak clusters, so we can use these groupings for surveillance as well.

Never ending circle of genomic surveillance

 These simple steps made it possible to compare hundreds of thousands of isolates since Evergreen Online started running, and we are planning further development to meet new demands.

References
1. Larsen, M. V. et al. Benchmarking of Methods for Genomic Taxonomy. J. Clin. Microbiol. 52, 1529–1539 (2014).
2. Clausen, P. T. L. C., Aarestrup, F. M. & Lund, O. Rapid and precise alignment of raw reads against redundant databases with KMA. BMC Bioinformatics 19, 307 (2018).
3. https://cge.cbs.dtu.dk/services/

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Ecology
Life Sciences > Biological Sciences > Ecology

Related Collections

With collections, you can get published faster and increase your visibility.

Applications of Artificial Intelligence in Cancer

In this cross-journal collection between Nature Communications, npj Digital Medicine, npj Precision Oncology, Communications Medicine, Communications Biology, and Scientific Reports, we invite submissions with a focus on artificial intelligence in cancer.

Publishing Model: Open Access

Deadline: Jun 30, 2025

Brain and Body Communication in Health and Disease

In this cross-journal Collection we invite submissions of basic, pre-clinical, and clinical studies focusing on the bidirectional communication between the brain and the body in both health and disease.

Publishing Model: Open Access

Deadline: Jul 31, 2025