Motivation
High-accuracy short reads have been the gold standard for the last decade in studies looking to find disease-gene associations or catalog genetic variation. While short reads are cost-effective enough to be run on thousands of samples they aren’t well-suited to resolve repetitive sequences longer than the length of the reads themselves (typically ~ 150 base pairs). This means short reads miss out on important, potentially disease-relevant sources of genetic variation like segmental duplications, tandem repeat expansions, and large structural variants. These regions may be critical for researchers to study in the context of the many complex, polygenic disorders that still haven’t been solved, like Alzheimer’s disease and related dementias (ADRD).
Long reads (10,000 to 1,000,000 base pairs) allow for the resolution of complex variation inaccessible to short reads, but before now it was too expensive or inaccurate to replace them completely. We hope to usher in a new era of long-read sequencing in population-scale genomic studies by presenting a modular and scalable sequencing protocol and computational pipeline (NAPU, for Nanopore Analysis Pipeline) (https://github.com/nanoporegenomics/napu_wf ) (Kolmogorov, M. et al.). NAPU can produce high-quality phased single nucleotide variant (SNV), insertion-deletion (indel) and structural variant (SV) calls, haplotype-resolved de-novo assemblies, and methylation calls all from a single Oxford Nanopore Technologies (ONT) PromethION flow cell.
Sequencing Protocol
Our colleagues at the NIH Center for Alzheimer’s and Related Dementias (CARD, https://card.nih.gov/) fine-tuned a specialized ONT sequencing protocol to maximize read length and yield, achieving over 100Gb of sequence (>30x coverage) and average read N50s of 30kb from a single flow cell (Billingsley et al. 2022, Baker et al. 2023). Classically there is a trade-off between yield and read length, but this protocol is able to strike a balance between the two, which is essential for producing a high-quality assembly and reliable variant calls. At CARD a single batch of DNA processing can extract sheared DNA for up to 16 samples at once in around 20 hours and sequencing on the PromethION runs for three days.
Nanopore Analysis Pipeline (NAPU)
The NAPU pipeline begins by running Shasta (Shafin et al., 2020), a graph-based de novo assembler, to generate an initial haploid assembly which is then phased into a diploid assembly with a novel tool we developed called Hapdup. Hapdup works by first aligning the original ONT reads back to the Shasta assembly, then using PEPPER-Margin-DeepVariant (Shafin et al. 2021) it phases heterozygous SNVs and separates reads into two haplotypes. Hapdup then runs Flye polisher on the two sets of haplo-tagged reads to reconstruct heterozygous variants and uses local realignment to rescue collapsed SVs. The resulting assemblies are highly accurate, QV ~34, and contiguous with NG50s to the CHM13 reference ~25Mbp (Nurk et al. 2022); however, they fall short of reliably assembling highly repetitive regions like centromeres and segmental duplications due to the overall decreased read N50, compared to ultra-long libraries.
NAPU produces small variant calls from the ONT reads aligned to a reference genome with PEPPER-Margin-DeepVariant. Our calls show substantial improvement over Illumina-based SNV calling in regions of structural variation and poor short-read mappability, highlighting the benefits of using long-reads. We also introduce a new method, Hapdiff, which enables SV calling of the de-novo assemblies, and demonstrate that it produces more accurate calls than read-based SV calling approaches. The NAPU pipeline uses a new mode of the tool Margin to jointly phase the small and structural variant calls, producing a conveniently harmonized vcf file. Additionally, we leverage the ability of ONT to detect methylation status to produce phased methylation bed files using ONT’s modbam2bed. The methylation calls NAPU produces are highly concordant with bisulfite experiments, and offer the unprecedented opportunity to study epigenetic variation at haplotype resolution.
NAPU is a WDL workflow, publicly available on dockstore, designed with several options for parallelism and variable inputs and outputs including workflows for both R9 and R10 chemistries. In our paper, we describe how we ran the NAPU workflow on all samples in a Terra workspace for easy accessibility across collaborators and security for the patient data. The WDL workflows can also be run locally for convenient reproducibility and development.
NIH CARD mission and application of NAPU
Our protocol and pipeline were designed with the goal of conducting large-scale long-read sequencing projects that are accessible and affordable. This work was a joint effort between researchers at multiple U.S. institutions, including UC Santa Cruz, National Cancer Institute (NCI), Johns Hopkins University, Baylor College of Medicine, National Human Genome Research Institute (NHGRI), Northeastern University, and NIH CARD.
The human brain samples that were sequenced in this article are part of CARD’s initiative to generate long-read sequencing data for thousands of brain samples from people with and without Alzheimer’s disease and related dementias (ADRD). The goal of this effort is to create a first-of-its-kind publicly available long-read resource based on ADRD with the next cohort of hundreds of brain samples set to be completed in early 2024. We hope this will open up new opportunities to understand the genetic mechanisms underlying ADRD pathology, and ultimately contribute to the discovery of new therapeutic targets for these diseases.
References
Cover illustration by Cassandra Tyson
Baker, B. et al. Processing human frontal cortex brain tissue for population-scale SQK-LSK114 Oxford Nanopore long-read DNA sequencing SOP v1. (2023) doi:10.17504/protocols.io.kxygx3zzog8j/v1
Billingsley, K.J et al., Processing Human Frontal Cortex Brain Tissue for Population-Scale Oxford Nanopore Long-Read DNA Sequencing SOP, (2022) dx.doi.org/10.17504/protocols.io.kxygxzmmov8j/v2
Kolmogorov, M. et al. Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat Methods (2023) doi:10.1038/s41592-023-01993-x.
Nurk, Sergey et al. “The complete sequence of a human genome.” Science (New York, N.Y.) vol. 376,6588 (2022): 44-53. doi:10.1126/science.abj6987
Shafin, Kishwar, Trevor Pesout, Pi-Chuan Chang, Maria Nattestad, Alexey Kolesnikov, Sidharth Goel, Gunjan Baid, et al. 2021. “Haplotype-Aware Variant Calling with PEPPER-Margin-DeepVariant Enables High Accuracy in Nanopore Long-Reads.” Nature Methods 18 (11): 1322–32. https://doi.org/10.1038/s41592-021-01299-w.
Shafin, Kishwar et al. “Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes.” Nature biotechnology vol. 38,9 (2020): 1044-1053. doi:10.1038/s41587-020-0503-6
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in