Behind the Paper

PheSeq: How Bayesian Deep Learning Conceptualizes the Gene-Disease Associations and Bridges ’em with P-values?

"This study introduces PheSeq, a Bayesian deep learning model designed to integrate p-value data from sequence analysis with phenotype descriptions from literature and network data. It improves the robustness and interpretability of gene-disease association studies."

Published in Genetics & Genomics

Jun 08, 2024

Jingbo Xia

Researcher, Huazhong Agricultural University

Liked by India Ambler and 1 other

Explore the Research

Published in Genome Medicine

Apr 16, 2024

Behind the paper [1]: (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-024-01330-7)

Highlights:

The Bayesian deep learning framework successfully bridges the phenotype description perception and association significance(p-value) in the gene-disease association studies.
Deep learning is used to derive embeddings for phenotype description from literature and network data.
The framework treats the p-value as a weak supervised signal in the uncertainty inference.
A probability graphical model effectively bridges the aforementioned heterogeneous data modalities by activating a switch when there is consistency between the association significance and the phenotype description.

In the scenario of genotype-phenotype association studies, p-values from various sequence analyses such as GWAS and RNA-seq provide a measure of significance. However, these p-values often come with high uncertainty and lack of interpretability.

The proposed PheSeq model addresses these challenges by combining p-value data with deep learning-derived phenotype embeddings from literature and network data, and bridging two types of heterogeneous association data, thus enhancing the robustness and interpretability of the results.

The figure outlines the framework of the Bayesian deep learning model, PheSeq.

a. General model input in PheSeq involves p-values for association significance in sequence analysis and phenotypic embeddings for phenotype description from texts or graphs. The associations with p-values are graphically depicted in a Manhattan-style plot. A threshold line with a strict criterion (red line) or a less strict criterion (green line) is then applied. Concurrently, a DL perception module learns the association description of gene-disease association from text or graph. Genes exhibiting significant association descriptions tend to aggregate in the top-left region of the semantic space, as shown in the figure. Analogous patterns emerge in other scenarios. Finally, PheSeq learns the data distributions and performs data fusion for gene-disease associations.

b/c Data fusion of association significance and phenotype description for a significant/non-significant gene-disease association by PheSeq. For each gene-disease association, two distinct types of observations, denoted as L for phenotypic embedding and P for p-value, are considered for data fusion. Both sets of observations are input into the PGM inference module, facilitating the learning of dependency relationships among them in conjunction with latent variables. The phenotypic embedding L is initially processed through the DL perception module for semantic training, generating high-quality embeddings denoted as Z. The latent variable T serves a pivotal role in synchronizing the phenotypic embedding data with the p-value data, the latter adhering to a beta distribution indicative of a predisposition toward“small-p-value.” In addition, another latent variable F functions as an association score, establishing connections among model parameters. Conceptually, the switch mechanism activates when both the association significance and phenotype description align, effectively bridging the above heterogeneous data modalities. Part c shows the converse situation, wherein the data indicate non-significance for the gene-disease association. In this case, a uniform distribution is employed to characterize the distribution of the p-value. The remaining configurations of the model remain consistent

The PheSeq model was tested in three case studies involving Alzheimer’s disease (AD), breast cancer (BC), and lung cancer (LC), using GWAS, transcriptomic, and methylation data respectively. Phenotypic descriptions of the three diseases were collected from disease-related literature downloaded on a PubMed and PMC scale. Sentences that address phenotype description of the gene-disease association are filtered by a biomedical event extraction model on AGAC (Annotation of Genes with Alteration-Centric function changes [2]) corpus.

Finally, PheSeq identified 1024 priority genes for AD and 818 and 566 genes for BC and LC, respectively. Benefiting from data fusion, these findings represent moderate positive rates, high recall rates, and interpretation in gene-disease association studies.

PheSeq holds particular importance in situations where a single sequence analysis may elicit systematic bias and flawed predictions of crucial genes. In such instances, PheSeq serves as an effective tool for establishing a connection between phenotype descriptions and association significance in sequence analysis and helps to recall the significant genes.

In conclusion, this research performs a worth-trying attempt at heterogeneous association data fusion. This framework successfully bridges the phenotype description perception and p-value uncertainty inference. The association significance is utilized as a fine-grained weak signal for the association significance. Overall, it is an inspiring idea to unveil genotype-phenotype associations and investigate the potential relation dependency through data perception, data fusion, and probabilistic inference in a novel Bayesian framework.

Finally, we are delighted to share our work with the scientific community and domain experts in the prestigious journal, Genome Medicine. We sincerely hope that this resource can provide valuable research groundwork and further insights for the community.

References

Yao, X., Ouyang, S., Lian, Y., Peng, Q., Zhou, X., Huang, F., ... & Xia, J. (2024). PheSeq, a Bayesian deep learning model to enhance and interpret the gene-disease association studies. Genome Medicine, 16(1), 56.
Wang, Y., Zhou, K., Gachloo, M., & Xia, J. (2019, November). An overview of the active gene annotation corpus and the BioNLP OST 2019 AGAC track tasks. In Proceedings of The 5th workshop on BioNLP open shared tasks(pp. 62-71).

The blog is written by Yanhong He, Fumin Chen, Yawen Liu, Xinzhi Yao, and Jingbo Xia.

Jingbo Xia

Researcher, Huazhong Agricultural University

Research Interests
- BioNLP （生物医药自然语言处理）
- Data mining （数据挖掘）
- Bioinformatics (生物信息学)
Research Projects
- Corpus design and Biomedical knowledge discovery based on BioNLP (语料库设计和基于BioNLP的知识挖掘)
- Data mining for geno-phenotype association (针对表型-基因型关联的生物信息数据挖掘)

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Genetics and Genomics

Life Sciences > Biological Sciences > Genetics and Genomics

Genome Medicine

Genome Medicine

This is an open access journal publishing outstanding research in the application of genetics, genomics and multi-omics to understand, diagnose and treat disease, bridging the basic science and clinical research communities.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Precision medicine

Genome Medicine is calling for submissions to a Collection on precision medicine, guest edited by Dr Razelle Kurzrock and Dr Digna R Velez Edwards. Precision medicine represents a transformative approach to healthcare, focused on tailoring medical treatments and interventions to the individual characteristics of each patient. By integrating various data types including genetic, genomic, multi-omic, and environmental, such as patient lifestyle and health, we can better understand disease mechanisms, informing targeted treatment and improving health outcomes.

Advances in multi-omic and sequencing technologies have significantly enhanced our ability to identify biomarkers and stratify patients, thereby facilitating more effective personalized treatment options. As we continue to harness the power of big data and bioinformatics, the application of precision medicine is poised to become more efficient within healthcare systems. Moreover, the evolving landscape of artificial intelligence may enable more accurate predictive models, empowering healthcare professionals to make informed decisions improving patient care.

We are now inviting the submission of Research, Method, Software, Database, and Guideline manuscripts presenting innovative work in the field of Precision Medicine. The collection aims to span a wide variety of disease indications with topics of interest including but not limited to:

• Application of new sequencing technologies

• Multi-omics profiling and approaches

• Integrative analysis and incorporation of healthcare data

• Association studies for risk assessment

• Genomic approaches in clinical practice

• Gene therapy and CRISPR approaches

• Preclinical models for precision targeting

• Biomarker-based monitoring and combination treatments in cancer

• Antibody-drug conjugates and immunotherapy in cancer

• Biomarkers and drug repurposing

• AI and Big data in the clinic

All manuscripts submitted to this journal, including those submitted to collections and special issues, are assessed in line with our editorial policies and the journal’s peer review process. Reviewers and editors are required to declare competing interests and can be excluded from the peer review process if a competing interest exists.

This Collection supports and amplifies research related to SDG 3: Good Health and Well-being.

Publishing Model: Open Access

Deadline: Sep 18, 2026

Explore this Collection

Respiratory viruses: surveillance, diagnostics, and emergence

Genome Medicine is inviting submissions to a new Collection on the surveillance, diagnostics, and emergence of respiratory viruses. Included topics are virus surveillance and genomics, host-pathogen interactions and immune evasion, and diagnostic, treatment, and vaccine development.

The COVID-19 pandemic and the recent re-emergence of avian influenza A(H5N1) have brought the importance of respiratory viruses to global attention. Respiratory viruses remain a threat to public health and research on emerging and re-emerging respiratory viruses remains a high priority. This Collection calls for submissions to advance our understanding of respiratory viral infections, insights into viral genomics, host-pathogen interactions, immune response and evasion, interspecies transmission as emerging human infections, and the development of novel diagnostics, treatments, and vaccines. Submitted research will highlight the intersection of genome medicine and viruses, encompassing diverse areas such as viral genomics, vaccine development, and the molecular basis of viral pathogenesis.

Emerging research in this area is crucial for developing targeted interventions against viral infections and understanding the molecular mechanisms underlying viral pathogenesis. Recent advances have highlighted the importance of genomic diversity of respiratory viruses in a range of hosts, have facilitated the development of diagnostic tools and vaccines, as well as contributing to predictions of emerging human-pathogenic respiratory viruses.

We are now inviting the submission of Research, Method, Software, Database, and Guideline manuscripts, including but not limited to the following topics:

Viral genomics and evolution

Surveillance and pandemic preparedness for emerging and re-emerging respiratory viruses

Host-pathogen interactions

Host switching for emerging viruses

Immune reactions and immune evasion

Long-term effects of viral infections

Novel diagnostic technologies

New antiviral treatments

Vaccine development

All manuscripts submitted to this journal, including those submitted to collections and special issues, are assessed in line with our editorial policies and the journal’s peer-review process. Reviewers and editors are required to declare competing interests and can be excluded from the peer review process if a competing interest exists.

Publishing Model: Open Access

Deadline: Mar 04, 2026

Explore this Collection

Cancer-Alterome, how literature resources contribute to the refined interpretation of cancer pathology ?

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

PheSeq: How Bayesian Deep Learning Conceptualizes the Gene-Disease Associations and Bridges ’em with P-values?

Share this post

Share with...

...or copy the link