Unlocking Precision: How Haplotype Sampling Enhances Pangenome Mapping

Discover how haplotype sampling transforms the complexity of growing pangenomes into a straightforward and powerful tool for accurate and personalized genomic analysis.
Unlocking Precision: How Haplotype Sampling Enhances Pangenome Mapping
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Linear reference genome

The traditional linear reference genome, a single representation made up of individual chromosome sequences and used as a standard for a species, has been instrumental in advancing genomic research. While humans have two unique haploid copies of the genome in the cells of their body, the linear human reference genome provides only one to represent the entire species. However, this singularity introduces biases that limit its effectiveness. When trying to identify the variants in an individual’s genome compared to the linear reference, regions in this sample that significantly differ from the reference are difficult to detect. These differences can lead to mismatches or missing alignments, especially in highly variable or structurally diverse areas, causing distortions in the analysis.

Pangenome reference

A pangenome addresses the limitations of the linear reference by encompassing the set of common genomic variations present within a species. Pangenomes include a collection of haplotype-resolved genetic sequences depicting actual genetic differences among individuals. This comprehensive approach allows researchers to study all kinds of genetic variation, from individual base changes to larger structural differences, more reliably. Pangenomes reduce the biases arising from using a single linear reference genome; by integrating more diverse sequences, a pangenome offers a more accurate and broader view of a species' genetic landscape, enabling better insights into its genetic makeup and evolution. One popular way to represent pangenomes is in the form of variation graphs, mathematical graphs that represent the relationships between the component genomes.  Each node in such a graph is labeled with a sequence, and different variants of a sequence are shown as separate nodes. 

 As the number of sequences in a pangenome increases, more variation is added to the graph, capturing more information but making the pangenome graph even more complex and, thus, more challenging to work with. One common application for pangenome graphs is to use them as a reference for read mapping, which is the process of matching short DNA sequences from a sample to a reference genome to determine where they come from. Pangenome-based aligners can be more accurate than aligners using a linear reference. If the graph contains a haplotype close enough to the sample’s genome in that region, the aligner can usually find the correct mapping. On the other hand, sequence variation that is present in the graph but not in the sequenced genome can make the aligner less accurate. Such variation can imitate other regions, making incorrect mappings more likely.

Pangenomes are rapidly expanding, with the Human Pangenome Reference Consortium’s pangenome, for example, growing from 47 individual sequences since its release in May 2023 to 350 this year (2024). As pangenomes grow, the number of misleading variants increases, which can slow down the mapping process and result in more incorrect mappings. The typical method for reducing the number of these uninformative variants is to filter the low-frequency variants from the graph. We tested these filtering approaches and found that, although they improve mapping accuracy in many cases, they also remove some variants that are important for samples containing them. We wanted a method for filtering out non-informative variants from the pangenome while retaining the variants present in the sample.

The frequency filtering method starts with the full graph (A), which includes both 'G' and 'A' variants as possibilities. However, since there is only one observation of 'A' among the haplotypes (B), the 'A' variant is removed in the frequency-filtered graph (C). This poses a problem because some individuals still carry low-frequency variants, and as a result, reads from someone with the 'A' variant (D) become harder to map to the graph.

Haplotype sampling

So, we designed an algorithm that can sample only the parts of the graphs that are more likely related to the reads of an individual’s genome. In our haplotype sampling method, we use the information from the reads of the individual to figure out which variants in the graph are present in them.

The haplotype sampling method starts with the full graph (A), which includes both 'G' and 'A' variants as possibilities. Unlike the frequency filtering method, it uses reads (B) from the individual's genome to identify the relevant haplotypes for that individual (C), and then creates the haplotype-sampled graph (D).

After developing the haplotype sampling method, we evaluated how it works in practice and its effect on genomic analysis. First, we looked at the mapping qualities and runtimes using simulated data. We were primarily interested in knowing how much haplotype sampling adds to the mapping time. Fortunately, using haplotype sampling took at most 25% more time compared to the frequency-filtered graphs. The time for our mapping experiment with haplotype sampling was longer because we needed to create personalized graphs for each sample individually, whereas for frequency-filtered graphs we used the same graph for all the samples. Also, haplotype sampling was faster than using BWA-MEM with the GRCh38 linear reference. BWA-MEM is a popular fast and accurate sequence alignment algorithm that aligns short or long DNA reads to a reference genome. We also wanted to know if haplotype sampling increases the accuracy of small variant calling and structural variant (SV) genotyping. We compared haplotype sampling against frequency-filtered methods. Haplotype sampling showed minor improvements in small variant calling. We were pleasantly surprised by SV calling results using haplotype sampled graphs: the scores for SV calling using short reads with haplotype sampling were comparable to those of long reads methods.

Discussion

Haplotype sampling offers advantages by preserving rare variants and accurately representing them within a sample's personalized graph, which is a subset of the broader pangenome. By reducing the complexity of the pangenome graph, haplotype sampling also improves the characterization of variants. This is especially valuable as pangenomes grow in size and complexity with more assemblies. Handling these large pangenomes can be resource-intensive and may introduce mapping inaccuracies. However, personalized pangenome algorithms, like haplotype sampling, help manage this complexity while retaining the rich genetic information of the original pangenome, making genomic analysis more precise—particularly when studying samples with variants that are less common among pangenome sequences.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Genetics and Genomics
Life Sciences > Biological Sciences > Genetics and Genomics
Bioinformatics
Life Sciences > Biological Sciences > Biological Techniques > Computational and Systems Biology > Bioinformatics
Bioinformatics
Mathematics and Computing > Computer Science > Computer and Information Systems Applications > Bioinformatics
Genomics
Life Sciences > Biological Sciences > Genetics and Genomics > Genomics

Related Collections

With collections, you can get published faster and increase your visibility.

Methods for ecological and evolutionary data analysis

This Collection welcomes primary research articles describing advances in computational and statistical methodology for ecology and evolution.

Publishing Model: Hybrid

Deadline: Oct 31, 2024