Using degenerating genes to understand the evolution of rare intact genes across bacteria

An overview and broader look at the paper "Pseudogenes act as a neutral reference for detecting selection in prokaryotic pangenomes" published in Nature Ecology & Evolution (2024). Blog post authors: Gavin M. Douglas and B. Jesse Shapiro
Like

Overview

Bacteriologists in the early 2000’s made an astonishing observation: bacterial strains of (ostensibly) the same species often differ by >50% of their gene content. For instance, Welch and colleagues (2002) observed that only ~40% of genes encoded by three strains of Escherichia coli were shared by all three (Figure 1a). This was an early example of a ‘pangenome’ analysis, which is now routinely applied to hundreds or thousands of genomes. A common observation with such data is that most genes within a bacterial species are rare, observed in just one or a small fraction of the genomes sampled (Figure 1b).

(a) Venn diagram re-plotting data from Welch et al. 2002, showing that most genes are not encoded by all three E. coli strains compared. (b) Example gene prevalence plot (with dummy data) that highlights the typical observation that a high proportion (often the majority) of genes across a bacterial species’ pangenome are observed in a minority of the sampled genomes.

Figure 1: Example plots displaying typical pangenome characteristics across bacterial species. (a) Venn diagram re-plotting data from Welch et al. 2002, showing that most genes are not encoded by all three E. coli strains compared. (b) Example gene prevalence plot (with dummy data) that highlights the typical observation that a high proportion (often the majority) of genes across a bacterial species’ pangenome are observed in a minority of the sampled genomes.

So, what’s driving this extensive strain variation across bacteria? Is there a baseline rate of genetic drift and/or horizontal gene transfer that maintains rare genes? Or are these genes niche-specific adaptations? These possibilities have been debated in recent years, but remain unresolved, partially because it is unclear how to establish an appropriate null hypothesis of gene content diversity driven by neutral processes. 

This gap was the motivation for our work. We investigated pangenome diversity across diverse bacteria, with a key additional step: considering both intact genes and degenerating genes (pseudogenes). We show that pseudogenes can be used as a proxy for how rare elements are distributed in the absence of selection, which provides a null distribution that can be compared to the analogous data for intact genes. Our key finding is that we can reject a neutral model of pangenome variation, suggesting that a non-negligible portion of rare accessory genes provide adaptive value to their bacterial host genomes.

Context: similar observations, but very different conclusions

Recent investigations into the evolutionary forces underlying bacterial pangenome variation have focused on correlations of measures of pangenome size/diversity with measures of the effective population size (Ne). Ne determines the relative impact of genetic drift vs. selection, and thus the efficacy of natural selection. One common measure for this is the ratio of non-synonymous to synonymous substitutions (dN/dS) within core genes: lower values of this ratio are taken as evidence for effective selection against slightly deleterious non-synonymous mutations. A typical observation is shown in Figure 2: the proxy for Ne (inverted dN/dS in this case) and pangenome size are positively associated across bacterial species.

Originally Figure 4c from Bobay and Ochman (2018), where each point is a bacterial species. dS/dN refers to the ratio of the synonymous to non-synonymous substitution rates (an inversion of the standard ratio), which provides a proxy for Ne, the effective population size.

Figure 2: Clear association between dS/dN (a measure of Ne) and pangenome size. Originally Figure 4c from Bobay and Ochman (2018), where each point is a bacterial species. dS/dN refers to the ratio of the synonymous to non-synonymous substitution rates (an inversion of the standard ratio), which provides a proxy for Ne, the effective population size.

So, it seems like rare accessory genes must be adaptive on average, right? Not necessarily. Nucleotide diversity at neutral sites is also commonly used as a proxy for Ne, as overall neutral genetic variation is expected to scale proportionally with Ne (e.g., due to fewer population bottlenecks). Pangenome diversity is also positively associated with this Ne proxy (Figure 3), which Andreani et al. interpreted as evidence for pangenome diversity being primarily neutral.

Originally Figure 1 from Andreani et al. 2017, where each point is a bacterial species.

Figure 3: A similar association between pangenome diversity and ameasure of Ne, but with a different interpretation. Originally Figure 1 from Andreani et al. 2017, where each point is a bacterial species.

These contrasting interpretations of very similar analyses highlight the difficulty of interpreting a measure of genetic diversity (in this case the distribution of accessory genes) as a traditional phenotype. Selection is indeed more effective in species with higher Ne, but we also expect those species to have higher standing levels of neutral genetic variation.

 

Our paper: using pseudogenes as a neutral reference to help distinguish these explanations

As mentioned above, the lack of a true neutral null is one reason why it’s difficult to disentangle the two possible explanations illustrated by Figure 2 and 3. Our contribution is to introduce pseudogenes as a reference point for gauging the evolutionary forces acting upon intact accessory genes.

 As a proof of concept across 10 highly-sampled bacterial species, we showed that ultra-rare intact genes and pseudogenes substantially differed in their functional annotations (Figure 4). Most strikingly, we found that non-redundant genes (i.e., those that cannot be compensated by a gene with the same functional annotation if they are knocked out) were much less likely to be pseudogenes. Under a neutral model redundant and non-redundant genes would be equally likely to become pseudogenes, so we can confidently reject pure neutrality in this case. Rare genes in most functional categories are therefore under selection be retained by the host genome, indicating their adaptive value. In contrast, ‘mobilome’ genes (and transposons in particular) are enriched in pseudogenes, suggesting they may often be deleterious to their host genome.

Originally Figure 2 in our bioRxiv preprint. This is a summary of a model exploring which variables are predictive of whether an ultra-rare element is intact or is a pseudogene. Variables with estimates > 0 are pseudogene-enriched, while those < 0 are pseudogene-depleted. where each point is a bacterial species. “Non-redun.” Indicates genes that are not redundant with a gene of the same COG identifier in the same genome. See our manuscript for details.

Figure 4: Functional categories enriched and depleted for extremely rare pseudogenes. Originally Figure 2 in our bioRxiv preprint. This is a summary of a model exploring which variables are predictive of whether an ultra-rare element is intact or is a pseudogene. Variables with estimates > 0 are pseudogene-enriched, while those < 0 are pseudogene-depleted. where each point is a bacterial species. “Non-redun.” Indicates genes that are not redundant with a gene of the same COG identifier in the same genome. See our manuscript for details.

Building upon these observations, we then explored the distribution of rare intact genes vs pseudogenes across 100s of prokaryotic (mainly bacterial) species. We focused on two metrics: the percent of singleton (i.e., elements in one genome) intact genes and pseudogenes (Figure 5), based on the hypothesis that the ratio of these values would provide insight into the likely forces driving pangenome variation in each species.

Coloured text indicates possible explanations for why points might be at those extremes. Originally Figure 3a in our bioRxiv preprint.

Figure 5: Comparing meant percent intact singletons and pseudogenes across prokaryotic species. Coloured text indicates possible explanations for why points might be at those extremes. Originally Figure 3a in our bioRxiv preprint.

The ratio of these values (si/sp) provides a measure of pangenome diversity normalized by pseudogene content diversity. Assuming that pseudogene content is primarily determined by genetic drift, including pseudogene content in the denominator serves as a normalization for neutral genic diversity. After performing this normalization, si/sp remains significantly associated with dN/dS (as an inverted proxy for Ne; Figure 6), which would not be expected under a purely neutral model of pangenome evolution.

Here, dN/dS is used as a metric of the efficacy of selection, with lower dN/dS ratios indicating more effective selection. That higher si/sp ratios tend to be observed at lower dN/dS ratios across hundreds of prokaryotic species suggests that rare intact genes tend to be retained by natural selection. Originally Figure 4d in our bioRxiv preprint.

Figure 6: The si/sp ratio is higher when natural selection is more effective. Here, dN/dS is used as a metric of the efficacy of selection, with lower dN/dS ratios indicating more effective selection. That higher si/sp ratios tend to be observed at lower dN/dS ratios across hundreds of prokaryotic species suggests that rare intact genes tend to be retained by natural selection. Originally Figure 4d in our bioRxiv preprint.

Open questions

Our work establishes a new framework for exploring bacterial phylogenomic dynamics, by incorporating pseudogenes as neutral references. Granted, our approach relies on the assumption that pseudogene content is mainly determined by genetic drift. Although our overall results are consistent with this interpretation, individual species (such as obligate intracellular pathogens) have distinct pseudogene content dynamics, which should be considered in future work.

Another important point is that we were able to detect differential selection on functional categories, but not on individual genes. Analyzing finer-level groupings could help address this limitation. Nonetheless, tests for selection on gene content variation, especially regarding rare genes, are lacking, and so we hope our framework to identify gene categories under differential selection is useful to researchers studying different groups of microbes and even macrobes.

 

Licensing note

All figures (except for Figure 1, which was created for this blog post) were originally distributed under a Creative Commons Attribution 4.0 International License. Please see the caption of each figure for the original source material.

You can find the published Nature Ecology & Evolution article here.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Evolutionary Biology
Life Sciences > Biological Sciences > Evolutionary Biology
Microbiology
Life Sciences > Biological Sciences > Microbiology
Genomics
Life Sciences > Biological Sciences > Genetics and Genomics > Genomics