PheSeq: How Bayesian Deep Learning Conceptualizes the Gene-Disease Associations and Bridges ’em with P-values?
"This study introduces PheSeq, a Bayesian deep learning model designed to integrate p-value data from sequence analysis with phenotype descriptions from literature and network data. It improves the robustness and interpretability of gene-disease association studies."
Published in Genome Medicine
Apr 16, 2024
Behind the paper [1]: (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-024-01330-7)
Highlights:
- The Bayesian deep learning framework successfully bridges the phenotype description perception and association significance(p-value) in the gene-disease association studies.
- Deep learning is used to derive embeddings for phenotype description from literature and network data.
- The framework treats the p-value as a weak supervised signal in the uncertainty inference.
- A probability graphical model effectively bridges the aforementioned heterogeneous data modalities by activating a switch when there is consistency between the association significance and the phenotype description.
In the scenario of genotype-phenotype association studies, p-values from various sequence analyses such as GWAS and RNA-seq provide a measure of significance. However, these p-values often come with high uncertainty and lack of interpretability.
The proposed PheSeq model addresses these challenges by combining p-value data with deep learning-derived phenotype embeddings from literature and network data, and bridging two types of heterogeneous association data, thus enhancing the robustness and interpretability of the results.
The figure outlines the framework of the Bayesian deep learning model, PheSeq.
a. General model input in PheSeq involves p-values for association significance in sequence analysis and phenotypic embeddings for phenotype description from texts or graphs. The associations with p-values are graphically depicted in a Manhattan-style plot. A threshold line with a strict criterion (red line) or a less strict criterion (green line) is then applied. Concurrently, a DL perception module learns the association description of gene-disease association from text or graph. Genes exhibiting significant association descriptions tend to aggregate in the top-left region of the semantic space, as shown in the figure. Analogous patterns emerge in other scenarios. Finally, PheSeq learns the data distributions and performs data fusion for gene-disease associations. b/c Data fusion of association significance and phenotype description for a significant/non-significant gene-disease association by PheSeq. For each gene-disease association, two distinct types of observations, denoted as L for phenotypic embedding and P for p-value, are considered for data fusion. Both sets of observations are input into the PGM inference module, facilitating the learning of dependency relationships among them in conjunction with latent variables. The phenotypic embedding L is initially processed through the DL perception module for semantic training, generating high-quality embeddings denoted as Z. The latent variable T serves a pivotal role in synchronizing the phenotypic embedding data with the p-value data, the latter adhering to a beta distribution indicative of a predisposition toward“small-p-value.” In addition, another latent variable F functions as an association score, establishing connections among model parameters. Conceptually, the switch mechanism activates when both the association significance and phenotype description align, effectively bridging the above heterogeneous data modalities. Part c shows the converse situation, wherein the data indicate non-significance for the gene-disease association. In this case, a uniform distribution is employed to characterize the distribution of the p-value. The remaining configurations of the model remain consistent |
The PheSeq model was tested in three case studies involving Alzheimer’s disease (AD), breast cancer (BC), and lung cancer (LC), using GWAS, transcriptomic, and methylation data respectively. Phenotypic descriptions of the three diseases were collected from disease-related literature downloaded on a PubMed and PMC scale. Sentences that address phenotype description of the gene-disease association are filtered by a biomedical event extraction model on AGAC (Annotation of Genes with Alteration-Centric function changes [2]) corpus.
Finally, PheSeq identified 1024 priority genes for AD and 818 and 566 genes for BC and LC, respectively. Benefiting from data fusion, these findings represent moderate positive rates, high recall rates, and interpretation in gene-disease association studies.
PheSeq holds particular importance in situations where a single sequence analysis may elicit systematic bias and flawed predictions of crucial genes. In such instances, PheSeq serves as an effective tool for establishing a connection between phenotype descriptions and association significance in sequence analysis and helps to recall the significant genes.
In conclusion, this research performs a worth-trying attempt at heterogeneous association data fusion. This framework successfully bridges the phenotype description perception and p-value uncertainty inference. The association significance is utilized as a fine-grained weak signal for the association significance. Overall, it is an inspiring idea to unveil genotype-phenotype associations and investigate the potential relation dependency through data perception, data fusion, and probabilistic inference in a novel Bayesian framework.
Finally, we are delighted to share our work with the scientific community and domain experts in the prestigious journal, Genome Medicine. We sincerely hope that this resource can provide valuable research groundwork and further insights for the community.
References
- Yao, X., Ouyang, S., Lian, Y., Peng, Q., Zhou, X., Huang, F., ... & Xia, J. (2024). PheSeq, a Bayesian deep learning model to enhance and interpret the gene-disease association studies. Genome Medicine, 16(1), 56.
- Wang, Y., Zhou, K., Gachloo, M., & Xia, J. (2019, November). An overview of the active gene annotation corpus and the BioNLP OST 2019 AGAC track tasks. In Proceedings of The 5th workshop on BioNLP open shared tasks(pp. 62-71).
The blog is written by Yanhong He, Fumin Chen, Yawen Liu, Xinzhi Yao, and Jingbo Xia.
Follow the Topic
-
Genome Medicine
This is an open access journal publishing outstanding research in the application of genetics, genomics and multi-omics to understand, diagnose and treat disease, bridging the basic science and clinical research communities.
Related Collections
With collections, you can get published faster and increase your visibility.
Translating the microbiome in health and disease
Genome Medicine is calling for submission to a new Collection on translating the microbiome in health and disease.
Over the past two decades progress in the study of the human microbiome has accelerated. Critical to this advancement is the advent of novel technologies which have allowed the interrogation of the complex host-microbiome relationship. These tools, techniques and methods have helped accomplish the phylogenomic characterization of diverse microbiomes, permitted the functional profiling of microbiome data, and have given us a deeper understanding of how the microbiome relates to different disease states and mechanisms. Particularly intriguing is how the analysis of microbiome and multi-omics data can be utilized for clinical benefit and improve patient care.
To capture advances in this growing area, Genome Medicine is pleased to announce a call for papers for our upcoming special collection on ‘Translating the microbiome in health and disease,’ guest edited by Peggy Lai from Harvard University We are particularly interested in encouraging collaboration between basic and clinical researchers. The Guest Editor may be able to provide guidance on fostering such collaborations; please contact the editorial team to discuss further.
We are now inviting the submission of Research, Method, Software, Database and Guideline manuscripts of outstanding interest describing insights into all aspects of the human microbiome in health and disease including:
Metagenomics and integrative multi-omics
Microbiome analysis tools and technologies
Artificial intelligence approaches
Single-cell tools and technologies
Long-read sequencing
Healthy microbiome
The aging microbiome
The maternal microbiome
The fetal, preterm and infant microbiome
Trans-ethnic microbiome diversity
Metabolic disorders and the role of diet
Microbiome-immune homeostasis and regulation of the immune response
Host-microbiome interactions
Microbiome-encoded disease phenotypes
The cancer and tumor microbiome
Microbiome in tumor immunity and therapeutic response
Translational interventions, clinical trials and therapies
Infectious disease microbiome
Gut-brain axis
The human virome
The human mycobiome
In vivo microbiome construction
Microbiome diagnostics
FMT dynamics and efficacy
Precision editing/modulation of the microbiome
Translational interventions, clinical trials and therapies
All manuscripts submitted to this journal, including those submitted to collections and special issues, are assessed in line with our editorial policies and the journal’s peer-review process. Reviewers and editors are required to declare competing interests and can be excluded from the peer review process if a competing interest exists.
Publishing Model: Open Access
Deadline: May 19, 2025
Neurogenomics: from the bench to the clinic
Publishing Model: Open Access
Deadline: Sep 16, 2025
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in