Behind the Paper

ContScout: removing contamination from annotated genomes

Published in Ecology & Evolution, Protocols & Methods, and Genetics & Genomics

Feb 01, 2024

Balázs Bálint

postdoctoral researcher, HUN-REN Biological Research Centre, Szeged

Liked by India Ambler

Explore the Research

Technological improvements in high-throughput sequencing have led to an unprecedented increase in the number of sequenced genomes in public databases. Several large-scale sequencing projects are underway focusing on specific taxonomic groups such as insects, fungi, vertebrates or plants with the ultimate goal of capturing the entire biodiversity of the Earth. On the way to this goal an important question arises when we look at the published genomes: can we really believe the images that we see? (Fig. 1)

Figure 1. Can we believe our eyes? — Fig. 1: Can we believe our eyes? These ambiguous images are hard to grasp at first glimpse. Similarly, published draft genomes often include foreign sequences that are not trivial to spot. If they remain undetected, such ambiguities can seriously interfere with downstream analysis.
Image sources: Fliegende Blätter (1892), Jorge Rodríguez (Pixabay)

Draft genomes are by definition incomplete. They often lack information about regions that are difficult to sequence, assemble or annotate. On the other hand, published genomes may contain excess sequences that do not belong to the target organism but are results of contamination. If these sequences remain in the data set, they can affect all subsequent comparative genomic phylogenomic analysis and possibly lead to completely wrong conclusions as was the case with the infamous tardigrade genome¹. There, foreign sequences introduced by massive bacterial contamination were mistaken for horizontal gene transfer between bacteria and eukaryote.

In our article, recently published in Nature Communications, we present a tool called ContScout which was developed to identify and remove contamination from annotated draft genomes. The key feature that distinguishes ContScout from other decontamination software is its ability to combine taxonomic information with gene locus information (Fig. 2).

Fig 2. Overview of the ContScout algorithm. — Fig. 2: Overview of the ContScout algorithm. a A quick database search is performed on each query protein sequence against a taxonomy-aware reference database. b Bar charts show the top hits ranked according to the alignment scores. For each query sequence, the taxon information of the best hit is taken, together with an assignment confidence score. c Protein taxon calls are summarised over contigs yielding a consensus label on each contig. Contigs with taxon call not matching the query genome are marked for removal together with all associated proteins.

Using synthetic data and manually filtered sequences, we demonstrate that ContScout has an outstanding sensitivity and specificity, largely outperforming most taxon labeling / sequence decontaminating tools available to date (Conterminator, BASTA, MMSeqs and DIAMOND). By screening 844 published genomes representing all major clades of the eukaryotic tree of life, we show that contamination is widespread among them with some extreme cases containing thousands of foreign proteins. (Fig. 3)

Fig 3. Oak leaves with powdery mildews — Fig. 3: Example of a massive contamination: oak leaf infected with fungi. If such a sample is used for genomic DNA purification and subsequent sequencing, the assembled draft genome becomes a mixture with all contaminating fungal sequences mislabeled as plant. Image generated by Dall E 2.

Using a ubiquitous pyridoxal kinase protein family as an example, we show in detail how the presence of two mislabelled proteins can confound the inferred evolutionary history of the gene family by introducing 15 excess deletion events throughout the species and several gene gain events near the root of the tree. (Fig. 4)

Fig. 4. Effect of contamination in the pyridoxal kinase gene family — Fig. 4: Evolutionary history of a pyridoxal kinase gene family inferred with Compare². Blue circles indicate gain events while red circles show gene losses. All differences (marked by thick dark stroke around the circles) are the consequence of two contaminating proteins within the family: Quersube_4764 and Bombimpa_11962.

Using 36-genome demonstration data set and a leave-one out approach, we show that the confounding effect of contaminated genomes on ancestral gene content estimates is profound and cumulative. Finally, we challenge ContScout with a set of known HGT cases. We prove that the tool is able to distinguish HGT from contamination in most of the tested cases.

With our article, we aim to draw the attention of the scientific community on the prevalence of contamination among published genomes and show how unhandled contamination can bias phylogenomics studies. Furthermore, we present a user-friendly tool for the detection and removal of contamination in annotated genomes.

Separating closely related genomes from each other with ContScout posed the biggest challenge that we encountered during the development. This major improvement was unanimously suggested by two of the referees during the manuscript revision procedure. In order to enable this feature, the taxon calling module needed a complete redesigning and re-implementation that took nearly two months to complete. I firmly believe that this improvement is a game changer and hope that the tool will prove useful for the research community.

References:

1. Boothby TC et al. Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proc Natl Acad Sci U S A. (2015)
doi: 10.1073/pnas.1510461112.

2. Nagy LG et al. Latent homology and convergent regulatory evolution underlies the repeated emergence of yeasts. Nat Commun. (2014)
doi: 10.1038/ncomms5471

Balázs Bálint

postdoctoral researcher, HUN-REN Biological Research Centre, Szeged

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Genomics

Life Sciences > Biological Sciences > Genetics and Genomics > Genomics

Bioinformatics

Life Sciences > Biological Sciences > Biological Techniques > Computational and Systems Biology > Bioinformatics

Phylogenetics

Life Sciences > Biological Sciences > Evolutionary Biology > Phylogenetics

Evolutionary Biology

Life Sciences > Biological Sciences > Evolutionary Biology

Comparative Genomics

Life Sciences > Biological Sciences > Genetics and Genomics > Genomics > Comparative Genomics

Nature Communications

Nature Communications

An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Women's Health

A selection of recent articles that highlight issues relevant to the treatment of neurological and psychiatric disorders in women.

Publishing Model: Hybrid

Deadline: Ongoing

Explore this Collection

Advances in neurodegenerative diseases

This Collection aims to bring together research from various domains related to neurodegenerative conditions, encompassing novel insights into disease pathophysiology, diagnostics, therapeutic developments, and care strategies. We welcome the submission of all papers relevant to advances in neurodegenerative disease.

Publishing Model: Hybrid

Deadline: Mar 24, 2026

Explore this Collection

Paving the Future of Intelligent Asphalt Defect Detection with Machine Learning

Behind the Paper

The functional role and regulatory mechanism of paeonol in the treatment of liver diseases

Behind the Paper

Pathogenesis of Sex Differences in Autism Risk: Evidence from Cohort and Animal Studies Focused on Maternal Perinatal Depression

Behind the Paper

Unlocking "Invisible Modes": How Metamaterials Help Catch the Dielectric Fingerprints of Cancer Cells

Behind the Paper

Building sustainable futures through CBET: Examining the role of teacher preparedness and leadership in the implementation of education-related SDG policies in Kenyan TVETs

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

ContScout: removing contamination from annotated genomes

Share this post

Share with...

...or copy the link