Genes that arise “de novo” from non-genic sequence seem to be an unlikely happenstance. In fact, it was historically viewed as nearly impossible (Jacob 1977). In a review from 2022, “The Origins and Functions of De Novo Genes: Against All Odds”, Caroline Weisman considers where this view came from and how it stands up to recent work that seems to document bona fide cases of de novo gene birth.
Weisman frames the article by proposing that this assumption of improbability is underpinned by three premises that seem intuitively true about genetic material. First, almost all of sequence space produces no beneficial biological effect (“Sparsity”); second, non-genic sequences are a random sample of sequence space (“Fair play”); and third, evolution can only sample a modest proportion of non-genic sequences for biological effects (“Limited trials”). If de novo genes do in fact exist despite the seemingly low odds that these premises would suggest, Weisman proposes they must do so by violating the premises somehow.
The article first conducts a case study by using rigorous criteria to compile a conservative set of genes that are strongly supported to be of de novo origination and have well characterized biological effects. This set of genes includes protein-coding genes Northern Gagdid AFGP, Saccharomyces cerevisiae MDF1, Saccharomyces cerevisiae BSC4, Homo sapiens PBOV1, Homo sapiens NYCM and Homo sapiens MYEOV, and RNA genes Homo sapiens ELFN1-AS1 and Mus musculus Poldi. Then, the article speculates about what these case studies show us about which premises, and in what way, de novo genes may violate as they beat the odds.
Sparsity Violations support a significant proportion of sequence space produces beneficial biological effects.
- “Some Biological Effects Require Only Small Regions of Watson-Crick Complementarity and So are Common in Sequence Space"
It is expected that there are biological effects that have minimal sequence requirements and are therefore expected to be common; for example, miRNA sponges require a specific sequence only 6-8 nucleotides long.
- “Sequence Space ‘attractors’ Increase the Probability of Function”
Walks in sequence space are not random: mutational biases lead some sequences to be more evolutionarily accessible than others (for example, tandem repeats). Functions that can be performed by such sequences are likelier than they may appear.
- “Interactions are easy to come by”
Proteins often have many lower-affinity binding partners. Perhaps most proteins start as promiscuous and then acquire more specificity.
- “New Proteins Adopt Old Roles in New Contexts”
De novo proteins may perform the same roles as pre-existing ones and therefore bypass the need to forge a new useful function. For example, a de novo transcription factor dimerizes with transcription factors that usually have different interacting partners under different conditions, therefore filling the role of an existing transcription factor but under different conditions.
- “New Proteins Reactivate Existing Pathways in New Contexts”
For example, de novo genes often have oncogenic functions that work through activation of conserved pathways and programs that may have been important its things like development. “New proteins may generally find it easy to flip all kinds of cellular switches; cancer may often be the result.”
- “The Cell offers Many ‘freeloader functions’ that require little more than binding and are abundant in sequence space”
Binding with proteins is often a function in and of itself, perhaps to competitively bind and inhibit its native reaction or stabilize it, whether that’s binding to a kinase, transcription factors or glycosylase.
Fair-Play Violations support de novo genes emerge from sequences that are not truly random but enriched for beneficial biological effects.
- “Basic Structural Properties are Easy to Come By”
Computational and experimental work show that random sequences have a significant amount of secondary structure (alpha helices and beta sheets). Many de novo genes may come from intergenic ORFs which have evidence that they are not entirely random.
- “Overlap with a Conserved Gene Lowers the Barrier to Expression”
Many de novo genes are observed to overlap with a conserved gene, which allows for expression, and some are shown to be regulated by the protein product of their overlapping gene. Therefore, expression may not be as much of a hurdle as we thought because there is a surprisingly high probability of a basal level of promoter activity among random sequences.
- “Noncoding function lowers the barrier to coding expression and function”
Instances of genes that encode functional protein and RNA, where the protein is more recent and the locus is more deeply conserved, suggest RNA came first and then the protein emerged later. Also, non-genic ORFs with no function inside of transcripts are translated at low levels, which cause selection to likely keep them from causing harm when they are translated, which would bias their sequences.
- “New Proteins Inherit Older Noncoding functions”
The protein and RNA often have similar functions, suggesting the protein inherited the function of the host transcript. This could happen because (1) they are at the same locus they could share features that affect function like expression timing, cellular localization and interaction partners (2) creating the new protein could reduce the RNA available for the function so there is selective pressure for the protein to compensate (3) similar to ‘supergenes’, genes at the same locus avoid recombination, leading to the evolution of positive epistasis from their actions in the same pathway. Therefore, a protein born atop an existing RNA more easily evolves a function and is predisposed for a particular function but over time they could evolve novel roles.
Limited Trials Violations support evolution samples a sufficiently large portion of sequence space that it can sample the fraction of sequences that have beneficial biological effects.
- “Pervasive Transcription and translation offer many opportunities for de Novo birth”
Technologies like RNA-seq and ribosome profiling suggest that there may be a lot more “trialing” than previously believed because a lot of non-coding sequences may end up being transcribed and translated at low levels.
Weisman proposes that in the future it’s important to experimentally characterize the properties of random amino acid sequences, but more importantly identify true de novo genes and characterize their origins and functions. Weisman also challenges the notion that genetic novelty drives functional novelty.
Understanding the factors that affect the frequency in de novo gene development, as well as the factors that affect their functions and expression patterns is a key part to uncovering the evolutionary history of our genomes. Through de novo gene research, we can better understand the ways in which selection, mutation, and biological mechanisms play a role in the development of new genes, functional novelty, pathway complexity, genetic divergence, and speciation. We gain a deeper understanding of and appreciation for how and why genomes operate.
Jacob F (1977) Evolution and tinkering. Science, 196(4295):1161–1166.
Weisman CM (2022) The Origins and Functions of De Novo Genes: Against All Odds? Journal of Molecular Evolution, 90, 244–257. https://doi.org/10.1007/s00239-022-10055-3