Cells contain all the genetic information required for protein synthesis for a biological system to function, and the development of high-throughput DNA sequencing has been key in elucidating cellular processes and understanding how variation and mutations play a role in disease states, such as cancer. But it’s not just genetic sequences that control cellular fate and function – epigenetic modifications regulate gene expression in response to changes in behaviour or environment and play a key role in many biological processes such as development and ageing, as well as disease. Variation in an individual’s genetic sequence will be associated with variation in DNA methylation that can functionally determine predisposition for disease. Measuring both genetic and epigenetic variation in the context of the other is therefore important for elucidating dynamic biological processes, allowing a more comprehensive view of cellular function.
The challenge of combined genetic and epigenetic sequencing
However, capturing both genetic and epigenetic information simultaneously comes with significant challenges due to the limitations of current sequencing technology. Next generation sequencing technologies only capture four letters or information states in their readout, the canonical bases A, T, C, and G. Base conversion chemistries can be used to differentiate unmodified C from its most prevalent epigenetic variants, 5mC and 5hmC, but this means sacrificing one of the information states (typically the T state) and missing important genetic sequence information.
So, when looking to identify 5mC or 5hmC using base conversion chemistries like bisulphite sequencing, the unmodified C base is converted to U (and so is read as a T) therefore compromising the detection of C-to-T changes in the genetic sequence. C-to-T changes are by far the most common mutations in mammalian genomes and cancer. Base conversion also causes ambiguity in genomic alignment as 3-letter reads are mapped to a 4-letter genome, resulting in slower, more expensive, and less accurate read mapping. To date, sequencing both genetic and epigenetic information from a single sample has come with a significant time and cost burden, requiring separate sequencing workflows, increased sample requirement and inexact integration of the data, wherein phase information is lost.
5-Letter sequencing for improved biological characterisation
To overcome these challenges of combined genetic and epigenetic analysis, we developed 5-Letter seq, a whole genome methodology that can capture all four canonical bases as well as modified C in a single workflow that is compatible with any sequencing platform. Our platform consists of a pre-sequencing enzymatic workflow that uses standard molecular biology techniques to prepare the sample for sequencing, as well as decoding software to resolve the raw sequencing data. In this work, we used an Illumina NovaSeq NGS instrument, but other sequencing adapters can be substituted, meaning the workflow can be used with any sequencing platform capable of decoding at least four genetic bases.
Two-base coding system for combination sequencing
Standard single-based sequencing results in a four-state (or four-letter) readout – here, we use a two-base coding approach where the combination of two bases denotes the original DNA base, enabling up to 16 information states to be decoded.
To achieve this, already fragmented DNA or sonicated genomic DNA is prepared in a simple workflow prior to sequencing. The workflow begins with the ligation of sample DNA fragments to short, synthetic hairpin adapters at both ends. Each strand is then copied by DNA polymerase to form a construct with the original sample strand connected to its complementary strand via a synthetic hairpin. Sequencing adapters are ligated at each end and modified Cs enzymatically protected by oxidation followed by glycosylation – unmodified Cs are then deaminated to uracil by an APOBEC enzyme. A helicase enables access to a single-stranded DNA for the deaminating APOBEC enzyme activity.
The base converted DNA template can be amplified by PCR and is then ready for sequencing. Our bioinformatic decoding software then performs pairwise alignment of the original and complementary strands (read 1 and read 2 in a pair) and computationally resolves the sequencing data, producing a “permissible base pair” that corresponds to a single genetic or epigenetic letter. Any errors arising from sample prep, amplification, or incorrect base-calling during sequencing will occur independently to cognate bases on each strand and so will result in a “non-permissible base pair” providing inherent proof-reading capability. Non-permissable pairs are resolved as a N. The rest of the read is preserved and genomic alignment of resolved reads occurs in standard 4-base space. The epigenetic states are kept with the reads in the SAM tags.
We performed 5-Letter seq on a mixed B-lymphoblast cell line DNA (one of the genome-in-a-bottle samples NA12878) and generated both genetic and epigenetic data in a single sequence. Having all four genetic states in the read sequences allowed for standard genomic alignment, significantly reducing execution times compared to WGBS and EM-seq. Comparison of data quality to WGBS and EM-seq showed increased specificity and sensitivity for the detection of epigenetic marks, and higher accuracy of read mapping. Simultaneous detection of both genetics and epigenetics in cis on the same DNA molecule also allowed us to detect differential DNA methylation between alleles. Allele-specific methylation (ASM) and methylation quantitative trait loci (methQTLs) can be used to identify the regulatory sequence variation that underpin many diseases.
Detection in low input samples for liquid biopsy
The combination of both genetic and epigenetic information within a single DNA sample can provide key insights on the dynamic interactions that are occurring, offering significant advantages in many research areas such as cellular fate studies, stem cell differentiation, population genomics, and cancer biology. One important area of interest is liquid biopsy for tumour diagnosis and disease monitoring. Combining DNA methylation information with the genetic sequence in cell-free DNA (cfDNA) from blood has shown significantly increased sensitivity to detect tumour DNA. We applied 5-Letter seq to a cell-free DNA sample from a human cancer patient and achieved very high-quality sequencing data on a sample containing only 2ng DNA. This shows that our method can detect both genetic and epigenetic data from the same low input sample, and so could therefore overcome the challenges of working with valuable and low volume biological samples for diagnostics.
The 5hmC modification is a known marker of disease states, including early cancer detection. The two-based coding nature of our platform means we were able to further adapt the platform to facilitate 6-Letter seq, and so differentiate between 5mC and 5hmC. Our system also has the potential to measure additional epigenetic modifications, such as formylcytosine, methyladenine, or carboxycytosine.
We have shown that 5-Letter seq can deliver accurate, phased genetic and epigenetic sequences in a single workflow that is faster and more accurate than bisulfite sequencing or EM-seq. Our workflow allows both genetics and epigenetics to be studied in the context of the other, offering comprehensive biological insight with ease of use and reduced overhead. The 5-Letter seq platform can also be used to generate accurate data from valuable, low input biological samples or cell-free DNA which could feasibly transform cfDNA analysis and liquid biopsy for detection of early cancer.