Behind the Paper

Domain-PFP allows protein function prediction using function-aware domain embedding representations.

Domain-PFP employs self-supervised learning to generate functionally-aware embedding presentations for protein domains, which has demonstrated remarkable performance in several protein function prediction benchmarks.

Published in Protocols & Methods

Nov 02, 2023

Nabil Ibtehaz and Daisuke Kihara

2 contributors

Domain-PFP allows protein function prediction using function-aware domain embedding representations.

Liked by Daisuke Kihara and 3 others

Explore the Research

Proteins act as the building blocks of life and govern the diverse activities in living organisms. From catalyzing biochemical reactions and transmitting signals within the body to providing structural support and transporting nutrients, proteins regulate biological processes and are the key players in the intricate orchestra of life (Fig. 1a). As a result, predicting protein function has been a longstanding problem in bioinformatics, given its plethora of downstream applications involving drug discovery and the understanding of disease and organic mechanisms.

Figure 1. Protein domains and functions. a, proteins perform diverse functions in living organisms. b, domains are functional and structural units of proteins, responsible specific tasks. — **Figure 1.** **Protein domains and functions. a**, proteins perform diverse functions in living organisms. b, domains are functional and structural units of proteins, responsible specific tasks.

Domains are the structural and functional units of proteins (Fig. 1b). Therefore, it is motivating to infer the functions a protein performs from the set of domains it comprises. Ongoing projects such as InterPro2GO, Pfam2GO, assign probable functions to the different domains through experimental validation. However, this approach is yet to be widely adopted for practical usage, primarily due to the lack of sufficient coverage, i.e., only a small subset of the domain-function relationships has been identified. As a result, machine learning-based methods have been developed to infer protein functions from the domains present within them.

The existing domain-based function prediction methods take the set of domains as input and aim to predict the set of functions in the form of Gene Ontology (GO) terms. However, both of these sets are quite large; for example, the InterPro database has approximately 14K domains, and there are more than 26K Gene Ontology terms. The high dimensionality of both the input and output space, coupled with insufficient functionally annotated proteins, hinders deep learning models from being sufficiently complex. Consequently, this results in high degrees of compression-expansion ratios within the network.

In order to avoid such limitations, we turn to self-supervised learning. Instead of predicting GO terms from domains, we aim to correlating them. Using a data-driven protocol, we learn the protein domain and GO terms co-occurrence relationships in the form of conditional probabilities p(GO_j|domain_i), i.e., the probability that a protein containing domain_iperforms the function GO_j . At the core of our method is a simple yet effective neural network, what we call DomainGO prob. Fig. 2a shows the network architecture. This model maps the domains and GO terms into two separate embedding spaces and strives to establish correlation between them. Instead of directly regulating the dot product or similarity of the two embeddings, we apply Hadamard product, which is element-wise multiplication between two embedding vectors, i.e, the individual components of the dot product. This enables us to exploit higher-dimensional relationships between the two embeddings, in contrast to an overall aggregation obtained from dot product. We intentionally kept the network architecture shallow and simple, so that the functional information is contained within the actual embeddings and not in the auxiliary network layers facilitating the learning. While it was possible to design a more complex network architectures with deeper and wider layers, in those cases, the networks tended to perform the heavy lifting, making the embeddings less informative in the process. As a result, the embeddings of the domains become functionally informed, through optimizing the domain-function conditional joint probability prediction.

Fig. 2b explains how we developed the dataset to compute domain-GO associations with the network in Fig. 2a. We collected GO functional annotations of proteins from a database, including Swiss-Prot and extracted the protein domains using InterPro Scan. Based on the sets of obtained GO terms and domains, we compute the aforementioned conditional probability p(GO_j|domain_i) through a conventional counting-based approach.

Fig. 2c shows how the computed DomainGOProb is used for predicting function of a query protein.. During inference, from a query protein we identify domains using InterPro Scan and obtain the corresponding domain embeddings. Upon averaging the domain embeddings a functionally informed representation for the protein can be devised, and a supervised classifier can be applied to predict the functions of the protein. We name this overall protocol Domain-PFP.

**Figure 2.** **Overview of Domain-PFP. a**, the network architecture used for self-supervised learning of domain embeddings. b, the overall pipeline of learning the functionally aware domain embeddings. c, the steps of computing the embeddings of a protein and inferring the functions.

Fig. 3 shows how protein embeddings in DomainPFP (Fig. 2c) correlates with functional similarity, measured with funSim Score. As shown, proteins at close proximity in the embedding space have a high funSim score greater than 0.4, which gradually falls as we consider proteins that are further apart. This score sharply falls when the embedding distance between proteins exceeds 5 unit.

**Figure 3. Correlation between the distance of protein embeddings and functional similarity.**

In our work, we also showed that that our developed embedding contains much richer functional information than state-of-the-art protein language models, which are orders of magnitudes more complex and trained on massive volumes of sequences compared to our model. We have achieved up to 20% higher weighted F1-Score (a score for function prediction accuracy) for rare and specific GO terms through a simple linear classifier, which indicates that our representation is so much functionally informed that it is almost linearly separable.

Finally, we compared Domain-PFP (Fig. 2c) with state-of-the-art function prediction methods. These methods involve complex network architectures for instance, convolutional neural networks, graph neural networks, and transformers. We have observed that Domain-PFP, outperforms all other existing methods with similar level of provided information. Domain-PFP substantially improved by incorporating complementary data sources, such as BLAST and Protein-Protein Interaction (PPI). Moreover, our method demonstrated competence in CAFA3 evaluation along with structure-based function prediction datasets.

To contribute to protein function prediction research, we have released our source codes at https://github.com/kiharalab/Domain-PFP, and we have also developed an online platform for easy access at https://bit.ly/domain-pfp-colab. If you have any questions or potential ideas to further improve Domain-PFP, please contact Daisuke Kihara (dkihara@purdue.edu).

Multiple Contributors

Nabil Ibtehaz and Daisuke Kihara

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Biological Techniques

Life Sciences > Biological Sciences > Biological Techniques

Communications Biology

Communications Biology

An open access journal from Nature Portfolio publishing high-quality research, reviews and commentary in all areas of the biological sciences, representing significant advances and bringing new biological insight to a specialized area of research.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Signalling Pathways of Innate Immunity

In this cross-journal Collection, we invite research into the complex signalling pathways of innate immunity, emphasising the activation and regulation of pattern recognition receptors in response to microbial and endogenous triggers.

Publishing Model: Hybrid

Deadline: Feb 28, 2026

Explore this Collection

Forces in Cell Biology

Cell generate forces to maintain normal tissue morphology and function. Cells can also sense and process forces appropriate to their correct tissue context. With this cross-journal Collection between Communications Biology and Nature Communications, we welcome the submission of primary research articles exploring molecular mechanisms underlying how cells react to external mechanical stimuli, to forces between cells, and to intercellular forces

Publishing Model: Open Access

Deadline: Apr 30, 2026

Explore this Collection

Paving the Future of Intelligent Asphalt Defect Detection with Machine Learning

Behind the Paper

The functional role and regulatory mechanism of paeonol in the treatment of liver diseases

Behind the Paper

Pathogenesis of Sex Differences in Autism Risk: Evidence from Cohort and Animal Studies Focused on Maternal Perinatal Depression

Behind the Paper

Unlocking "Invisible Modes": How Metamaterials Help Catch the Dielectric Fingerprints of Cancer Cells

Behind the Paper

Building sustainable futures through CBET: Examining the role of teacher preparedness and leadership in the implementation of education-related SDG policies in Kenyan TVETs

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Domain-PFP allows protein function prediction using function-aware domain embedding representations.

Share this post

Share with...

...or copy the link