The vast amounts of DNA and RNA data being produced by sequencing projects around the world represent an incredible opportunity for making scientific discoveries. Hiding in these data sets are millions of virus sequences, including novel viruses which have not been isolated and virus genomic fossils found in the genomes of their host organisms. The increasing size in these nucleotide data sets (already in the tens of trillions of bases) presents an enormous challenge for data mining and analysis. How can we efficiently query such large data sets and obtain meaningful biological insights?
Studying genomic fossils of viruses
We were especially interested in the possibility of finding novel genomic virus fossils in the genomes of animals with backbones (vertebrates). The scientific term for these virus genomic fossils is endogenous virus elements, 'EVEs' for short. There are many EVEs in the genomes of animal species, and they can give us valuable information about the types of viruses which have infected animals in the past and how they have evolved. Just as the geological fossil record, the genomic fossil record of viruses is incomplete, biased and fragmentary, but it provides essential information about viruses and the hosts that they infect.
Bringing in the cloud
During the year 2022, Camacho and collaborators released an implementation of the BLAST algorithm which could benefit from the distributed computing infrastructure made available by the cloud. BLAST, an acronym for Basic Local Alignment Search Algorithm, is one of the algorithms most used in bioinformatics and most cited in the scientific literature (the original paper has been cited more than 77,200 times). BLAST is used to detect significant similarity between a sequence of interest and a target database that can provide additional information in relation to other known sequences. After some preliminary analyses, one of which resulted in the discovery of novel papillomaviruses infecting critically endangered pangolins, we realised that the cloud BLAST algorithm (elasticBLAST) would be a powerful tool to look for novel EVEs in vertebrate genomes using tens of thousands of virus protein queries.
And voilà!
In a matter of weeks, we were able to mine all available representative vertebrate genomes using 24,478 virus protein sequences. This returned 196,899 hits (significant matches) to our set of viral proteins. We compared this set of sequences in a second round to another database (non-redundant proteins) to obtain more information about them. In particular, we developed an approach based on the taxonomic labels of the sequence annotations to zoom into the viral sequences with the highest chance of being true EVEs. We estimated that this approach had a true positive rate (sensitivity) of 71.3% and a true negative rate (specificity) of 97.1% for shortlisting EVEs. We manually curated the sequences shortlisted resulting in a final set of 2,040 validated EVEs in 295 host genomes.
Surprising discoveries
Among the 2,040 EVEs identified, there were virus sequences from 4 viral families (Chuviridae, Paramyxoviridae, Benyviridae and Nairoviridae) and a genus of flaviviruses (Hepacivirus) which had not been found in the genomes of vertebrates previously: we had discovered novel virus fossils! We identified chuvirus EVEs in fish, amphibians, non-avian reptiles and mammals, strongly indicating that chuviruses can infect a wide diversity of vertebrates. We describe paramyxovirus EVEs in the genomes of a number of fish species including a model species (zebrafish). Our analysis reveals that there is a larger diversity of fish paramyxoviruses which will require further characterisation. Surprisingly, we found benyvirus EVEs in the genomes of sharks, non-avian reptiles, legless amphibians (caecilians) and lungfish. Benyviruses, which infect plants and fungi, have been found associated with a few insects. Our results show that members in this family of plant/fungal viruses can also infect vertebrates, expanding its host range to a new kingdom (Metazoa).
Nairoviruses include highly pathogenic viruses of humans and other animals like Crimean-Congo Haemorrhagic Fever virus (CCHFV). We found a single nairovirus EVE in the genome of the Etruscan shrew (Suncus etruscus), one of the smallest species of mammal. This sequence was found to be the closest nairovirus EVE to the group of CCHFV, with other more distant relatives found in tick genomes (ticks are vectors of nairoviruses). This finding highlights the potential of shrews acting as a reservoir for nairoviruses, which should be further investigated. We also discovered a hepacivirus EVE in the genome of murine rodents (mice and rats), resembling a single hepacivirus sequence discovered in the Ethiopian white-footed mouse (Stenocephalemys albipes). Both are 75% identical and share a series of conserved amino acids in their C-terminal end which are present in multiple rodent hepaciviruses. Thus far, viruses related to hepatitis C virus have been notably absent from the genomic fossil record, this underlines the importance of rodents as hosts of hepaciviruses for at least 11.7-14.2 million years.
Origin of the filovirus and arenavirus ectodomains
Some filoviruses such as Ebola virus and Marburg virus, and reptile-infecting arenaviruses (reptarenaviruses) have surface glycoproteins with a region that resembles the immunosuppressive ectodomain of retroviruses. We found hits closely related to the ectodomains of filoviruses and reptarenaviruses in the genomes of sharks and rays (cartilaginous fish), reptiles such as the Komodo dragon, and in tarsiers. Upon closer inspection, we realised that the hits to animal genomes were surrounded by additional retroviral genes and long terminal repeats: they were hits to endogenous retroviruses. By placing these sequences in an evolutionary tree, we found evidence for three independent captures of retroviral ectodomains (two by filoviruses, one by arenaviruses) over the course of hundreds of millions of years!
Structural models of the ectodomain trimers of filoviruses, reptarenaviruses, and retroviruses found in vertebrate genomes (Models were inferred using Alphafold.)
Conclusion
Our work revealed some really surprising insights including detecting links to host reservoirs for CCHF and hepatitis C-like viruses, discovery of vertebrate-infecting relatives of viruses which infect plants and fungi, showing evidence for infections from chuviruses in a broad range of vertebrates, and resolving the connection of filovirus/reptarenavirus ectodomains to those of retroviruses.
By leveraging cloud technologies and algorithms, we were able to expand the known genomic fossil record of viruses and obtain key insights into their ecology and evolution. Adopting novel computing technologies in our work will be essential for revealing the secrets of the virus world and life on our planet.
Acknowledgements
This work was funded by European Research Council grant no. 101001623-PALVIREVOL to Aris Katzourakis. We would like to thank Google Cloud for providing support in computing credits to Jose Gabriel Nino Barreat (EDU credit 212888085). This work was conducted with support from the Advanced Research Computing (ARC) service at the University of Oxford. The funders had no role in study design, data collection/analysis, decision to publish or preparation of the manuscript.
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in