Where we started
The “viral” interest in virus genome binning started with the question: how well would our deep-learning based binner perform on other ecological domains in the human gut? I asked this question together with Jakob Nissen and Simon Rasmussen, the first and last authors on the Vamb paper (Nissen et al. 2021).
Back in 2019, we were curious if Vamb performed so well that a researcher would be able to process a huge metagenomic dataset (>=1000 samples) and retrieve high quality bacterial genomes (checkmark) but also other entities like bacteriophages. If this setup was possible, additional domains of the gut microbiome could be unlocked for downstream microbiome analysis. Initially, we probed how well our vamb bins, a set of contigs, resembled virus genomes by blasting them to the NCBI virus database composed of roughly 3000-8000 reference genomes, which were mostly eukaryotic viruses. These initial efforts revealed very few viral-bins but very high taxonomic consistency. Whenever one contig of a Vamb bin mapped to i.e. a Bacillus phage genome or a bell pepper phage genome (yes this is a real example) the remaining contigs also did, which sparked further incentive to benchmark and explore this in depth.
When the pieces came together
Together with our colleagues at the Copenhagen Prospective Studies of Asthma in Childhood (COPSAC) we started to benchmark VAMB’s performance on viral genomes based on 662 paired bulk metagenomic samples and viral-like particle (VLP) samples, which was the biggest dataset of its kind at the time. The COPSAC team established a golden/truth set of viral contigs discovered in the VLP dataset. With this golden standard of viruses available we launched a benchmark into how many of these could be recovered in Vamb bins from the bulk metagenomics samples. We were surprised to find that thousands of bins resembled golden standard viruses and a great portion could be retrieved in the bulk metagenomics data. Furthermore, we found that contigs of each bin mapped consistently to the same virus genome and typically contained few unrelated contigs. In order to make the identification of viral Vamb bins more accessible and less time consuming on huge bulk metagenomic datasets, we trained a Random forest (RF) model based on viral protein families and single-contig-prediction scores to identify putative viral Vamb bins. The great thing about dealing with bins of multiple contigs is that a majority-vote or consensus score can be derived to gain higher confidence in a given bins viral-"likeness". If a single virus contig did not achieve a high prediction score, the whole bin was not thrown in the trash as a result.
At the time, we did lack an external validation tool for mass-validation of viral Vamb bins. Fortunately the tool CheckV was put on biorxiv not long after by Nayfach et al, which added whole new facets to the benchmark and quality control. We could then group Vamb virus bins and those used as our golden standard viruses into different tiers of genome quality and completeness level. Most importantly, CheckV allowed us to conduct large scale virus evaluation of the RF predicted bins to a final subset of bona fide viruses. In essence, we could now establish the metavirome directly from bulk metagenomics.
Large scale binning of viruses and MAGs
To evaluate the methods' utility, we applied it to a massive public metagenomic dataset, the Human Microbiome Project 2 (HMP2), from which no virome characterisation had been described before. From HMP2 we mined thousands of High-quality (HQ) viruses and bacterial MAGs via binning, which could be used for further analysis into the bacterial and viral interplay during an agitated state like inflammation and severe dysbiosis. Here we identified 250 temperate viruses that expanded with increasing dysbiosis suggesting an inflammation driven prophage induction that could be aggravating the inflammatory state even further.
Furthermore, in all our benchmarks of the original Vamb paper, Vamb was superior for bacterial genome binning but also for separating highly similar strains from each other, even at 98–99.5% average nucleotide identity, thus eloquently dealing with complex biological diversity. Evidently, this was also the case for viral genomes like the crass-phage that represents a prevalent and abundant virus in the human gut. By overlaying our VAMB cluster labels to a crass-phage phylogenetic tree from the HMP2 dataset, we observed clear monophyletic clades of genomes corresponding to real diversity.
Viral binning is not a perfect computational process and results in many bins of fragmented/incomplete viruses or other mobile genetic elements like plasmids that might be mistaken for a virus. Even though Vamb driven binning is not 100% accurate it is pretty darn good at simultaneously handling binning of viruses with other entities like bacteria. We think that the ultimate binner should abe judged on its capacity to handle the presence of thousands of contigs from unrelated organisms at the same time, here Vamb does a really great job.
To prevent the inclusion of false-positive viruses in downstream analysis, we have gone great lengths to describe ways to handle the output of i.e. CheckV and also cutoffs to filter away contaminated viral bins. A very vital element of binning approaches on metagenomics is post-processing. We believe that careful validation is an element that cannot be ignored and do not recommend downstream analysis of metagenomic datasets without evaluating and classifying VAMB and PHAMB derived bins into confident biological units with dedicated tools. We hope that future evaluation tools can help with categorising the many viral-like bins that we also outlined in the manuscript and extend the downstream analysis beyond known viral diversity.
With these considerations in mind, we believe that our manuscript has sufficiently outlined the immense value of viral binning and the way it provides a greater foundation for future metagenomic analysis with focus on bacterial and viral ecology.
Check out the paper here: https://www.nature.com/articles/s41467-022-28581-5
If you want to check out the HQ virus genomes uncovered from bulk metagenomics data such as HMP2 they can be downloaded here: https://zenodo.org/record/6200656#.YhN5XJPMIeY
Joachim Johansen, Ph. D fellow, The Novo Nordisk Foundation Center of Protein Research (NNFCPR), Faculty of Health and Medical Sciences, University of Copenhagen, Denmark