"This study introduces PheSeq, a Bayesian deep learning model designed to integrate p-value data from sequence analysis with phenotype descriptions from literature and network data. It improves the robustness and interpretability of gene-disease association studies."
Published in Genome Medicine
Apr 16, 2024
Behind the paper [1]: (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-024-01330-7)
Highlights:
- The Bayesian deep learning framework successfully bridges the phenotype description perception and association significance(p-value) in the gene-disease association studies.
- Deep learning is used to derive embeddings for phenotype description from literature and network data.
- The framework treats the p-value as a weak supervised signal in the uncertainty inference.
- A probability graphical model effectively bridges the aforementioned heterogeneous data modalities by activating a switch when there is consistency between the association significance and the phenotype description.
In the scenario of genotype-phenotype association studies, p-values from various sequence analyses such as GWAS and RNA-seq provide a measure of significance. However, these p-values often come with high uncertainty and lack of interpretability.
The proposed PheSeq model addresses these challenges by combining p-value data with deep learning-derived phenotype embeddings from literature and network data, and bridging two types of heterogeneous association data, thus enhancing the robustness and interpretability of the results.
The figure outlines the framework of the Bayesian deep learning model, PheSeq.
a. General model input in PheSeq involves p-values for association significance in sequence analysis and phenotypic embeddings for phenotype description from texts or graphs. The associations with p-values are graphically depicted in a Manhattan-style plot. A threshold line with a strict criterion (red line) or a less strict criterion (green line) is then applied. Concurrently, a DL perception module learns the association description of gene-disease association from text or graph. Genes exhibiting significant association descriptions tend to aggregate in the top-left region of the semantic space, as shown in the figure. Analogous patterns emerge in other scenarios. Finally, PheSeq learns the data distributions and performs data fusion for gene-disease associations. b/c Data fusion of association significance and phenotype description for a significant/non-significant gene-disease association by PheSeq. For each gene-disease association, two distinct types of observations, denoted as L for phenotypic embedding and P for p-value, are considered for data fusion. Both sets of observations are input into the PGM inference module, facilitating the learning of dependency relationships among them in conjunction with latent variables. The phenotypic embedding L is initially processed through the DL perception module for semantic training, generating high-quality embeddings denoted as Z. The latent variable T serves a pivotal role in synchronizing the phenotypic embedding data with the p-value data, the latter adhering to a beta distribution indicative of a predisposition toward“small-p-value.” In addition, another latent variable F functions as an association score, establishing connections among model parameters. Conceptually, the switch mechanism activates when both the association significance and phenotype description align, effectively bridging the above heterogeneous data modalities. Part c shows the converse situation, wherein the data indicate non-significance for the gene-disease association. In this case, a uniform distribution is employed to characterize the distribution of the p-value. The remaining configurations of the model remain consistent |
The PheSeq model was tested in three case studies involving Alzheimer’s disease (AD), breast cancer (BC), and lung cancer (LC), using GWAS, transcriptomic, and methylation data respectively. Phenotypic descriptions of the three diseases were collected from disease-related literature downloaded on a PubMed and PMC scale. Sentences that address phenotype description of the gene-disease association are filtered by a biomedical event extraction model on AGAC (Annotation of Genes with Alteration-Centric function changes [2]) corpus.
Finally, PheSeq identified 1024 priority genes for AD and 818 and 566 genes for BC and LC, respectively. Benefiting from data fusion, these findings represent moderate positive rates, high recall rates, and interpretation in gene-disease association studies.
PheSeq holds particular importance in situations where a single sequence analysis may elicit systematic bias and flawed predictions of crucial genes. In such instances, PheSeq serves as an effective tool for establishing a connection between phenotype descriptions and association significance in sequence analysis and helps to recall the significant genes.
In conclusion, this research performs a worth-trying attempt at heterogeneous association data fusion. This framework successfully bridges the phenotype description perception and p-value uncertainty inference. The association significance is utilized as a fine-grained weak signal for the association significance. Overall, it is an inspiring idea to unveil genotype-phenotype associations and investigate the potential relation dependency through data perception, data fusion, and probabilistic inference in a novel Bayesian framework.
Finally, we are delighted to share our work with the scientific community and domain experts in the prestigious journal, Genome Medicine. We sincerely hope that this resource can provide valuable research groundwork and further insights for the community.
References
- Yao, X., Ouyang, S., Lian, Y., Peng, Q., Zhou, X., Huang, F., ... & Xia, J. (2024). PheSeq, a Bayesian deep learning model to enhance and interpret the gene-disease association studies. Genome Medicine, 16(1), 56.
- Wang, Y., Zhou, K., Gachloo, M., & Xia, J. (2019, November). An overview of the active gene annotation corpus and the BioNLP OST 2019 AGAC track tasks. In Proceedings of The 5th workshop on BioNLP open shared tasks(pp. 62-71).
The blog is written by Yanhong He, Fumin Chen, Yawen Liu, Xinzhi Yao, and Jingbo Xia.
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in