Overview
Bacteriologists in the early 2000’s made an astonishing observation: bacterial strains of (ostensibly) the same species often differ by >50% of their gene content. For instance, Welch and colleagues (2002) observed that only ~40% of genes encoded by three strains of Escherichia coli were shared by all three (Figure 1a). This was an early example of a ‘pangenome’ analysis, which is now routinely applied to hundreds or thousands of genomes. A common observation with such data is that most genes within a bacterial species are rare, observed in just one or a small fraction of the genomes sampled (Figure 1b).
So, what’s driving this extensive strain variation across bacteria? Is there a baseline rate of genetic drift and/or horizontal gene transfer that maintains rare genes? Or are these genes niche-specific adaptations? These possibilities have been debated in recent years, but remain unresolved, partially because it is unclear how to establish an appropriate null hypothesis of gene content diversity driven by neutral processes.
This gap was the motivation for our work. We investigated pangenome diversity across diverse bacteria, with a key additional step: considering both intact genes and degenerating genes (pseudogenes). We show that pseudogenes can be used as a proxy for how rare elements are distributed in the absence of selection, which provides a null distribution that can be compared to the analogous data for intact genes. Our key finding is that we can reject a neutral model of pangenome variation, suggesting that a non-negligible portion of rare accessory genes provide adaptive value to their bacterial host genomes.
Context: similar observations, but very different conclusions
Recent investigations into the evolutionary forces underlying bacterial pangenome variation have focused on correlations of measures of pangenome size/diversity with measures of the effective population size (Ne). Ne determines the relative impact of genetic drift vs. selection, and thus the efficacy of natural selection. One common measure for this is the ratio of non-synonymous to synonymous substitutions (dN/dS) within core genes: lower values of this ratio are taken as evidence for effective selection against slightly deleterious non-synonymous mutations. A typical observation is shown in Figure 2: the proxy for Ne (inverted dN/dS in this case) and pangenome size are positively associated across bacterial species.
So, it seems like rare accessory genes must be adaptive on average, right? Not necessarily. Nucleotide diversity at neutral sites is also commonly used as a proxy for Ne, as overall neutral genetic variation is expected to scale proportionally with Ne (e.g., due to fewer population bottlenecks). Pangenome diversity is also positively associated with this Ne proxy (Figure 3), which Andreani et al. interpreted as evidence for pangenome diversity being primarily neutral.
These contrasting interpretations of very similar analyses highlight the difficulty of interpreting a measure of genetic diversity (in this case the distribution of accessory genes) as a traditional phenotype. Selection is indeed more effective in species with higher Ne, but we also expect those species to have higher standing levels of neutral genetic variation.
Our paper: using pseudogenes as a neutral reference to help distinguish these explanations
As mentioned above, the lack of a true neutral null is one reason why it’s difficult to disentangle the two possible explanations illustrated by Figure 2 and 3. Our contribution is to introduce pseudogenes as a reference point for gauging the evolutionary forces acting upon intact accessory genes.
As a proof of concept across 10 highly-sampled bacterial species, we showed that ultra-rare intact genes and pseudogenes substantially differed in their functional annotations (Figure 4). Most strikingly, we found that non-redundant genes (i.e., those that cannot be compensated by a gene with the same functional annotation if they are knocked out) were much less likely to be pseudogenes. Under a neutral model redundant and non-redundant genes would be equally likely to become pseudogenes, so we can confidently reject pure neutrality in this case. Rare genes in most functional categories are therefore under selection be retained by the host genome, indicating their adaptive value. In contrast, ‘mobilome’ genes (and transposons in particular) are enriched in pseudogenes, suggesting they may often be deleterious to their host genome.
Building upon these observations, we then explored the distribution of rare intact genes vs pseudogenes across 100s of prokaryotic (mainly bacterial) species. We focused on two metrics: the percent of singleton (i.e., elements in one genome) intact genes and pseudogenes (Figure 5), based on the hypothesis that the ratio of these values would provide insight into the likely forces driving pangenome variation in each species.
The ratio of these values (si/sp) provides a measure of pangenome diversity normalized by pseudogene content diversity. Assuming that pseudogene content is primarily determined by genetic drift, including pseudogene content in the denominator serves as a normalization for neutral genic diversity. After performing this normalization, si/sp remains significantly associated with dN/dS (as an inverted proxy for Ne; Figure 6), which would not be expected under a purely neutral model of pangenome evolution.
Open questions
Our work establishes a new framework for exploring bacterial phylogenomic dynamics, by incorporating pseudogenes as neutral references. Granted, our approach relies on the assumption that pseudogene content is mainly determined by genetic drift. Although our overall results are consistent with this interpretation, individual species (such as obligate intracellular pathogens) have distinct pseudogene content dynamics, which should be considered in future work.
Another important point is that we were able to detect differential selection on functional categories, but not on individual genes. Analyzing finer-level groupings could help address this limitation. Nonetheless, tests for selection on gene content variation, especially regarding rare genes, are lacking, and so we hope our framework to identify gene categories under differential selection is useful to researchers studying different groups of microbes and even macrobes.
Licensing note
All figures (except for Figure 1, which was created for this blog post) were originally distributed under a Creative Commons Attribution 4.0 International License. Please see the caption of each figure for the original source material.
You can find the published Nature Ecology & Evolution article here.
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in