This paper did not start as a theoretical exercise. Rather, it is a by-product of our long-term project on characterization of partially conserved gene neighborhood in archaeal and bacterial genomes. This type of analysis is extremely important for extracting maximum information from microbial genome sequences, to identify new functional systems. Many years ago, a computational approaches that we developed for genomic neighborhood analysis was pivotal for the identification of what has become known as CRISPR-Cas systems. Nowadays, such analysis has become computationally challenging due to the rapidly increasing amount of genome sequence, necessitating the development of new simplified computational methods. Yuri Wolf, the first author on this paper, undertook a major effort to develop and apply such methods using the current collection of archaeal genomes as the model data set. The choice of archaea over bacteria was made because we have worked with the genomes of these organisms for years and know them rather well, but most importantly, have developed a robust set of orthologous archaeal gene clusters. Reliable identification of orthologs is essential for comparative genomics – otherwise, it is all garbage in, garbage out.
In the course of this work, Yuri was looking at the dependency of gene order conservation (measured in the simplest imaginable way, as the number of identical gene pairs) on the phylogenetic distance between archaea and compared this curve with the corresponding curve for individual genes. The two curves were strikingly similar and showed a peculiar shape: an extremely rapid initial drop in inter-genomic similarity followed by the subsequent much slower decay.
The rapid divergence phase involved about 10% (or a little more) of the genes which was suspiciously close to a number well known in microbial genomics, namely, the number of the so called ORFans, i.e. genes without detectable homologs outside extremely closely related strains (ORFan is pun of sorts, coming from Open Reading Frame [ORF] and Orphan). The persistence of ORFans, the fraction of which does not significantly drop despite the rapidly progressing genome sequencing, is a major puzzle of microbial genomics.
We were impressed by these biphasic curves and decided to investigate theoretically by constructing a mathematical model in which the central variables were the gene replacement and gene shuffling rates. The central role of gene replacement comes from the by now well-recognized fact that microbial genomes are shaped primarily by horizontal gene transfer. The results of this modeling were shockingly unexpected for us. We could not come with any smooth distribution of the replacement and shuffling rates to obtain a good fit with the data. Such fit was possible only if we introduced a class of genes with an ‘infinite’ replacement rate!
Clearly, nothing is infinite in the straight, mathematical sense on our very finite planet. However, our results mean that the genes in the rapidly changing gene class – which we believe consists of the notorious ORFans – are replaced by new genes so rapidly that two diverging genomes differ by some 12% of the genes ‘immediately’ after divergence. Thus, the major lesson here is that ORFans really evolve under different ‘laws’ (if there is such a thing in evolution). Clearly, although they only comprise a little more than 10% in an average microbial genome, collectively, they represent the vast majority of ‘matter’ in the genomic universe. And, as in the actual universe of modern cosmology, that matter is dark. And, we can now estimate the size of that universe, contingent on the estimated number of microbial species. Even under the most conservative estimates, that size is vast: over a billion distinct genes! We are in no danger of running out of genetic diversity…
We also learned something interesting and important about gene shuffling. Although the evolutionary decay of gene order could not be explained by replacement of individual genes, a model in which the replacement and shuffling rates of a gene were proportional to each other yielded the best fit to the data. Thus, the same properties of a gene seem to determine the conservation – or lack thereof - of its presence and position in genomes. We still do not know for sure what those properties are but have reasons to believe they are related to biological importance.
It is surely highly satisfying when a (very) simple theoretical model yields results that fit the data well, and yet, are unexpected and biologically interesting. However, we do realize that this just the beginning: the actual structure and dynamics of the microbial genomic universe, and the forces that act in it (dark and otherwise) remain to be studied, with much more data and more sophisticated models.