Behind the Paper

ProRefiner: A deep learning model for effective and efficient protein sequence design

We introduce ProRefiner, a deep learning model for structure-based protein sequence design through effective and efficient residue interaction learning.

Published in Computational Sciences

Nov 17, 2023

Xinyi ZHOU

PhD Student, The Chinese University of Hong Kong

ProRefiner: A deep learning model for effective and efficient protein sequence design

Liked by India Ambler

Explore the Research

Structure-based protein design, i.e. Inverse Protein Folding (IPF), aims to design protein sequences from given structures. It has important applications in many protein design and engineering tasks, as adopting a certain conformation is often the prerequisite of performing certain functions. As the name indicates, IPF tackles the reverse process of protein folding, where complex interactions among different components guide the linear protein chain to fold into three-dimensional structure. Therefore, to translate a given structure back into sequences, it is important and helpful to analyze the interplay between residues. Fortunately, deep learning offers an effective approach to model and learn these intricate non-linear many-body relationships. In light of this, we aim to address the question: How can we design a model that can learn and extract residue relationships within 3D structures? Additionally, how can we design a sequence generation pipeline to facilitate the model's representation of residue relationships?

To address these challenges, we propose ProRefiner, a model that can learn to represent global residue interplays effectively and efficiently, and a sequence design pipeline that extracts high quality residue environment to aid the model prediction.

**Figure 1.** The model architecture of ProRefiner and the sequence design pipeline.

Sequence design pipeline with ProRefiner

Our proposed ProRefiner contains a stack of memory-efficient global attention layers. We first represent protein structures as graphs and construct edges between nearby residues to represent their connections. In the model, attention weights are computed between every pair of residues to quantify the strength of their interactions. Different from previous works, we compute the attention weights from the residue features, the edge features and layer-specific pseudo edge features if two residues are not connected by an edge. The learnable pseudo-edge features enable the model to conduct global attention operation in an memory-efficient way. Residues can gather information from the whole protein structure even when two residues are not directly connected on the graph.

We further propose an entropy-based residue selection method to select high-quality and meaningful residue information which serves as a reference for ProRefiner's predictions. Specifically, an existing IPF model is employed to predict sequences from structures in the form of probability distributions over 20 amino acid types. Entropy of each residue's prediction is computed as a measure of model confidence, with lower entropy indicating higher confidence levels. We mask out the predictions with high entropy to reduce the noisy residue information. Then ProRefiner can complete and refine the partial residue environment and generate sequences more compatible with the corresponding 3D structure.

Experiment results — **Figure 2.** Some experiment results. a-c. Important inter-residue relationships captured by ProRefiner's attention operation. d. Results on Inverse Protein Folding of 2KCD. e. Single point mutant design results on TnpB.

Learning residue interactions with ProRefiner

To assess ProRefiner's ability to learn and capture residue interactions, we analyzed the average attention weights that each residue assigns to others. We observed that many residues form important chemical bonds with those they attend to the most. Figure 2 a-c show some selected case study results, where central residues (in blue) and residues with the highest attention weights (in orange) are highlighted. For 2KCD in Figure 2 a, ProRefiner accurately identifies two hydrogen bonds: one between HIS 9 and LEU 5 on the helix, and another between ILE 70 and ASN 54 on the sheet. Similarly, in the case of T4-lysozyme (1LYD), ASP 70 forms a hydrogen bond with LEU 66 and a salt bridge with HIS 31, and both residues are among its most attended residues. ProRefiner also captures the presence of a disulfide bond between CYS 99 and CYS 94 in human Ero1-alpha (Q96HE7).

Improving protein sequence design

We conducted two sequence design tasks to evaluate the performance of ProRefiner. Firstly, we employ ProRefiner as an add-on module to refine the sequences generated by existing models, using the proposed pipeline introduced above. We experimented with multiple recent Inverse Folding models. ProRefiner demonstrates its ability to significantly refine sequence quality and improve the recovery of native sequences. Detailed results and discussions can be found in our paper. In Figure 2 d, we present the Inverse Folding results obtained on protein 2KCD. We utilized ESM-IF1 model to generate the baseline sequence, and subsequently employed ProRefiner to refine its quality. The resulting sequence can better recover the native protein structure, as assessed by Alphafold2.

Additionally, we apply ProRefiner to design single point mutants of Transposon-associated transposase B to improve its editing activity. This design scenario be seen as a special case of Inverse Protein Folding, where only one residue could be modified and the others are fixed and provided as design references. We leverage ProRefiner's predicted probabilities for the mutation site to measure the mutant stability. Amino acid types with higher probabilities are considered more stable and compatible with surrounding structure context. Following the prediction of mutant stability, we ranked the mutants accordingly and selected the top 20 mutants for experimental validation. Experiments show that 6 variants designed by ProRefiner exhibit above 1.2-fold improvement in indel activity relative to TnpB WT. Figure 2 e demonstrates the improvement of variants recommended by ProRefiner in indel activity relative to TnpB WT, as well as the indel formation at the on-target and off-target sites observed for TnpB WT and the variant with the highest activity, TnpB K84R.

Conclusion and discussion

In this work, we aim to improve the modeling and understanding of inter-body interactions within protein structures by deep learning models. While we focus on the task of structure-based sequence design, potential future research directions could involve the application of proposed model to other protein-related tasks and the examination of other biomolecules.

Xinyi ZHOU

PhD Student, The Chinese University of Hong Kong

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Computer Science

Mathematics and Computing > Computer Science

Nature Communications

Nature Communications

An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Women's Health

A selection of recent articles that highlight issues relevant to the treatment of neurological and psychiatric disorders in women.

Publishing Model: Hybrid

Deadline: Ongoing

Explore this Collection

Advances in neurodegenerative diseases

This Collection aims to bring together research from various domains related to neurodegenerative conditions, encompassing novel insights into disease pathophysiology, diagnostics, therapeutic developments, and care strategies. We welcome the submission of all papers relevant to advances in neurodegenerative disease.

Publishing Model: Hybrid

Deadline: Mar 24, 2026

Explore this Collection

Latest Content

Behind the Paper

Is the Blue Dye Still Worth the Mess? What Our Three-Year Study Tells Us About Finding Early Cancer in IBD Patients

Behind the Paper

Enabling advanced all-vdW plate-type beam splitters with vdW GaSeTe

Behind the Paper, From the Editors

Interfacial Evolution and Accelerated Aging Mechanism for LiFePO4/Graphite Pouch Batteries Under Multi-Step Indirect Activation

Behind the Paper

Beyond Viral Suppression: How the Right HIV Drug Helps the Gut Heal Itself

Events

RNA Regulation: Mechanisms to Medicine

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

ProRefiner: A deep learning model for effective and efficient protein sequence design

Share this post

Share with...

...or copy the link