Protecting patient privacy in metagenomic sequencing studies
Microbiomes are present in many environments, from the air we breathe to the soil we walk on. Microorganisms, such as bacteria and fungi, facilitate complex and diverse interactions with their environments, such as nutrient cycling, carbon sequestration, and other ecological processes1. Within the human body, it is well established that stable microbiomes are present in the gut, mouth, and surface of the skin, but functional microbial communities have also been identified in more isolated organ systems, such as the lung, stomach, and vaginal tract. Studying the compositional and functional microbiomes of various human organ systems can reveal important clinical insights. For example, considerable research effort has been devoted to understanding how genetic variations in Helicobacter pylori strains colonizing the stomach can modulate gastric cancer risk, with certain bacterial genotypes associated with increased likelihood of cancer progression2. The metabolic diversity and complex growth requirements of human-associated microorganisms make traditional culturing techniques inadequate for comprehensively characterizing microbial communities. Metagenomic next-generation sequencing (mNGS) has increased in popularity over the last decade as a powerful complement to traditional culturing, enabling researchers to directly sequence and analyze the collective genetic material from a clinical or environmental sample, without the need for individual organism isolation.
Our group specializes in the analysis of complex microbial communities derived from sequencing data. Host filtration, the identification and removal of host genomic content from microbial genomic content, is an important preprocessing step during mNGS sequencing analysis, especially for clinically-derived samples where human and microbial cells may be in close proximity or mixed together. Host filtration is traditionally performed by aligning mNGS reads to a human reference genome, then discarding the reads that align successfully, leaving only microbial reads. Following host filtration of a large dataset from tissue-derived samples, principal coordinate analysis (PCoA) of the microbial community composition revealed a distinct, statistically-significant separation of samples based on sex. In the absence of any plausible biological explanation for such an observation, we took a deeper look at the data and re-examined our routine preprocessing steps. We theorized that such a sex-specific effect might be caused by one of the main genomic differences between males and females - the Y chromosome. With the relatively recent publication of the “telomere-to-telomere” (or “T2T”) genome, the full sequence of the human Y chromosome was now available and thus could provide greater human sequence variation for host removal in mNGS3. We included the T2T genome as an additional human reference for host filtration into our data preprocessing pipeline, and the sex-seperation artifact was resolved! We observed that some human reads likely derived from the Y chromosome share sequence homology with microbes from common microbial reference genomes, and thus adequate removal of these human-derived reads is important to preserve accuracy in microbial quantification.
Beyond improving data quality, improving host filtration procedures provides a secondary benefit: protecting patient privacy. Tomofuji et al. recently showed that residual human reads from mNGS experiments can be used to identify patients from stool data, even when properly anonymized4. We worked with Yukinori Okada’s group to validate that our improved host filtration methodologies would disrupt the potential for host reidentification, and indeed they did. Recognizing the drastic effect that a single additional reference genome has on our computational pipeline, we sought to incorporate a more comprehensive collection of genomic references to best capture human genetic diversity. The Human Pangenome Reference Consortium (HPRC) is undertaking an ambitious effort to sequence and assemble high-quality reference genomes from diverse populations spanning the world5. Incorporating the currently available human references from the HPRC further improved our host filtration results, but the computational cost of performing iterative alignment to nearly one-hundred human genomes was quickly overwhelming our computing infrastructure. We worked with Ben Langmead’s group to integrate Movi, a new tool introduced in Zakeri et al., to build a pangenome index over all the human references in our collection6. By performing runtime efficient queries to the index to compute approximate matching statistics, we could sort out human reads much more rapidly than via traditional alignment. To complement the diversity of our expanded human reference collection, we sought to test our approach across a diverse range of biological samples. We leaned on the clinical expertise of Richard Gallo, George Hightower, Sergio Baranzini, and working group members of the Alzheimer’s Gut Microbiome Project to help collect, sequence, and validate novel host filtration methodologies for both low and high biomass samples from the gut, skin, and tissue. Across all biological domains tested, we observe that the inclusion of diverse human references improves host filtration performance, increases biological accuracy and interpretability, and protects patient privacy when using mNGS in clinical contexts. This work demonstrates how collaborative, multidisciplinary approaches in advancing genomic technologies can provide clinical insights, and we hope the reported findings will serve the scientific community by establishing more robust, inclusive methodologies for microbiome research.
References
- Tao, F. et al. Microbial carbon use efficiency promotes global soil carbon storage. Nature 618, 981–985 (2023).
- Backert, S. & Blaser, M. J. The Role of CagA in the Gastric Biology of Helicobacter pylori. Cancer Res 76, 4028–4031 (2016).
- Nurk, S. et al. The complete sequence of a human genome. Science (2022) doi:10.1126/science.abj6987.
- Tomofuji, Y. et al. Reconstruction of the personal information from human genome reads in gut metagenome sequencing data. Nature Microbiology 8, 1079–1094 (2023).
- Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
- Zakeri, M., Brown, N. K., Ahmed, O. Y., Gagie, T. & Langmead, B. Movi: A fast and cache-efficient full-text pangenome index. iScience 27, 111464 (2024).
Follow the Topic
-
Nature Communications
An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.
Related Collections
With collections, you can get published faster and increase your visibility.
Applications of Artificial Intelligence in Cancer
Publishing Model: Open Access
Deadline: Mar 31, 2025
Biology of rare genetic disorders
Publishing Model: Open Access
Deadline: Apr 30, 2025
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in