Physics-informed machine learning predicts protein function

Published in Protocols & Methods
Physics-informed machine learning predicts protein function
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Understanding protein function is essential for unraveling biological processes, disease mechanisms, and evolutionary pathways at the molecular level. Despite advances in sequencing and computational methods, accurately annotating protein functions, particularly at the residue level, remains challenging. The majority of proteins lack detailed functional annotations, hindering comprehensive insights into their roles in cellular activities. Classical methods for function annotation are limited by sequence complexity, prompting the development of computational approaches, including deep learning, which excel in predicting protein structures but struggle with function prediction. In a recent publication Nature Communications, we developed a novel physics-informed learning approach, leverages evolutionary data through graph convolutional networks to enhance the precision of function annotation at the residue level. We showed that, by capturing coevolutionary relationships between residues, PhiGnet not only identifies functional sites within proteins but also quantifies the significance of individual residues in specific biological functions.

The Problem

Proteins are the workhorses of biological systems, playing indispensable roles in virtually every cellular process, from catalyzing reactions to transmitting signals and providing structural support. Understanding their functions is crucial for deciphering fundamental biological mechanisms, addressing diseases, and engineering novel therapeutics. A protein's amino acid sequence contains the necessary information for its three-dimensional structure and governs how it interacts with other molecules, thereby enabling it to carry out its specific functions within cells. Despite the monumental efforts in genome sequencing that have yielded an immense database of protein sequences, functional annotation remains a significant challenge. As of recent estimates, the UniProt database contains over 356 million protein entries, with approximately 80% lacking detailed functional annotations beyond their primary sequences. This gap underscores a critical bottleneck in translating genomic data into actionable biological knowledge.

Computational approaches have emerged as promising alternatives to address these limitations. Deep learning methods have revolutionized protein structure prediction by learning from vast datasets without a priori assumptions about sequence-structure relationships. These methods leverage neural networks with millions of parameters to predict protein structures with unprecedented accuracy, often rivaling experimental methods. Yet, accurately predicting protein functions remains elusive, primarily due to the complex and multifaceted nature of functional diversity encoded within protein sequences.

The challenge lies not only in predicting functions accurately but also in interpreting the biological significance of these predictions. Computational tools confront the challenging task of distinguishing between residues crucial for protein function and those that are merely structurally conserved. This delineation is crucial for understanding the mechanisms underlying protein activity, identifying disease-associated variants, and engineering proteins with desired functionalities for biotechnological applications. Moreover, the disparity between the abundance of sequenced proteins and the scarcity of experimentally determined structures further complicates function prediction efforts. While computational models can predict structures with high accuracy, the reliability of these predictions in translating into accurate function annotations varies significantly. Factors such as confidence scores of predicted structures and the inherent variability in computational modeling contribute to the challenge of achieving consistent and reliable function predictions across diverse protein families.

Our Method 

To address these challenges, we introduce PhiGnet, a physics-informed learning approach devised to annotate protein functions at the residue level. PhiGnet leverages evolutionary couplings between residues across diverse protein sequences, which reflect coevolutionary relationships shaped by functional constraints over evolutionary time scales. These coevolutionary signals are indicative of residues that interact or collaborate to maintain protein structure and function, even across evolutionary distances. PhiGnet centers around two stacked graph convolutional networks (GCNs) that are specifically designed to capture intricate relationships within evolutionary couplings and hierarchical couplings within residue communities. In the context of PhiGnet, the first GCN extracts features from the protein sequence and its evolutionary couplings, encapsulating the coevolutionary patterns that underpin functional relationships. The second GCN then integrates its hierarchical couplings for identifying functional sites. Combining these features, PhiGnet learns to generalize across diverse protein sequences and accurately predict functional annotations.

Furthermore, PhiGnet introduces interpretability into its predictions by quantifying the significance of each residue with respect to specific biological functions. This capability not only aids in prioritizing functionally important residues for further experimental validation but also provides insights into the molecular mechanisms governing protein activity.

Overall, PhiGnet represents a novel approach that bridges the gap between sequence and function by harnessing evolutionary insights. By combining advanced machine learning techniques with evolutionary data, PhiGnet offers a promising pathway towards enhancing our understanding of protein function diversity and complexity, thereby advancing biomedical research and biotechnological applications.

Our Results

PhiGnet demonstrates remarkable performance in accurately assigning function annotations to proteins. Through rigorous evaluation on benchmark datasets and comparison with existing methods, PhiGnet consistently outperforms state-of-the-art (SOTA) approaches in predicting functional annotations. This improvement is attributed to PhiGnet's ability to leverage evolutionary data. Moreover, PhiGnet can identify functionally relevant residues within proteins. This capability provides valuable insights into the molecular basis of protein activities, making it useful for pinpointing crucial residues involved in catalytic sites, ligand-binding pockets, and allosteric sites—fundamental aspects in drug discovery and enzyme engineering. Overall, by harnessing evolutionary information effectively, PhiGnet not only improves the accuracy of function prediction but also contributes to quantifying the significance of individual residues.

Outlook

Looking forward, future developments could focus on enhancing PhiGnet's interpretability, scalability, and application across various biological contexts. Integrating multi-omics data and refining evolutionary insights could further boost its predictive power and expand its applicability in understanding complex biological systems. 

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Protein Function Predictions
Life Sciences > Biological Sciences > Biological Techniques > Computational and Systems Biology > Protein Function Predictions

Related Collections

With collections, you can get published faster and increase your visibility.

Biology of rare genetic disorders

This cross-journal Collection between Nature Communications, Communications Biology, npj Genomic Medicine and Scientific Reports brings together research articles that provide new insights into the biology of rare genetic disorders, also known as Mendelian or monogenic disorders.

Publishing Model: Open Access

Deadline: Oct 30, 2024

Carbon dioxide removal, capture and storage

In this cross-journal Collection, we bring together studies that address novel and existing carbon dioxide removal and carbon capture and storage methods and their potential for up-scaling, including critical questions of timing, location, and cost. We also welcome articles on methodologies that measure and verify the climate and environmental impact and explore public perceptions.

Publishing Model: Open Access

Deadline: Mar 22, 2025