Behind the Paper

PTMGPT2: An Interpretable Protein Language Model for Enhanced Post-Translational Modification Prediction

PTMGPT2 uses GPT-based architecture and prompt-based fine-tuning to predict post-translational modifications. It outperforms existing methods across 19 PTM types, offering interpretability and mutation analysis. This advances the understanding of protein function and disease research.

Published in Cell & Molecular Biology and Computational Sciences

Aug 09, 2024

Hilal Tayara and Palistha Shrestha

2 contributors

Liked by India Ambler and 1 other

Explore the Research

Proteins are essential for cellular functions, and their activity is regulated by over 400 types of post-translational modifications (PTMs) [1], expanding the human proteome to over a million unique species from about 20,000 genes [2]. Utilizing Natural Language Processing (NLP) techniques, particularly generative pre-trained transformers (GPT) models, can help decode and predict the intricate patterns of PTMs, merging computational linguistics with molecular biology.

PTMGPT2 utilizes a prompt-based approach for predicting post-translational modifications (PTMs) in protein sequences. The framework relies on fine-tuning the PROTGPT2 model [3] using informative prompts, which help generate accurate sequence labels by expanding the model's vocabulary. During training, PTMGPT2 learns token relationships in an unsupervised manner, while in inference, it predicts labels by filling in blanks within the prompts. This method ensures high prediction accuracy by preserving the biological relevance of protein sequences (Fig. 1).

Fig. 1 Schematic Representation of the PTMGPT2 Framework. A) Preparation of inputs for PTMGPT2 B) Method-specific data preparation process for benchmark C) Architecture of the PTMGPT2 model and the training and inference processes.

PTMGPT2’s performance is benchmarked against the DBPTM database [4], which includes a broad spectrum of experimentally verified PTMs. Utilizing 19 distinct datasets from DBPTM, each with at least 500 data points, PTMGPT2’s ability is thoroughly assessed to identify modified amino acid residues. The comparative analysis, using the Matthews Correlation Coefficient (MCC) as the metric, showed improvements across various PTM types. For instance, PTMGPT2 achieved a 7.94% improvement in lysine succinylation, 5.91% in lysine sumoylation, 12.74% in arginine methylation, and demonstrated robust performance across other PTMs, often surpassing existing methods. These results highlight PTMGPT2’s advancements in PTM site prediction, establishing it as a leading tool in proteomics research.

PTMGPT2 uses an attention-driven framework to identify critical sequence determinants for protein modifications, by extracting attention scores from its final decoder layer. This mechanism provides a granular view of the model's focus on specific amino acids or motifs within the sequence, as evidenced by the Position Specific Probability Matrix (PSPM) which reveals the importance of each amino acid. Analysis of attention heads identified various motifs linked to different PTMs, including motifs for lysine acetylation and kinase families, which align with previously validated experimental data. This detailed attention-based analysis enables PTMGPT2 to uncover intricate patterns and preferences in protein sequences, offering valuable insights into the underlying mechanisms of protein modifications.

PTMGPT2 effectively identifies mutation hotspots around phosphoserine sites in key genes like TP53, BRAF, and RAF1, crucial for understanding PTM-related mutations and their impact on protein function. Analyzing the TP53 gene, PTMGPT2 highlighted mutation patterns near phosphosites, consistent with dbSNP data [5], indicating significant mutation clusters in this tumor suppressor protein.

GPT models have revolutionized NLP with their transformer architecture and pre-training methods, leading to significant advancements in various tasks. PTMGPT2 leverages these models for accurate protein PTM site prediction by reformulating classification as label generation. It outperforms existing methods in most PTM types and sets the stage for future work refining prompt designs.

References:

[1] Hong, X. et al. PTMint database of experimentally verified PTM regulation on protein-protein interaction. Bioinformatics 39, btac823 (2023).

[2] Pray, L. Eukaryotic genome complexity. Nature Education 1, 96 (2008).

[3] Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun 13, 4348 (2022).

[4] Li, Z. et al. dbPTM in 2022: an updated database for exploring regulatory networks and functional associations of protein post-translational modifications. Nucleic Acids Res 50, D471–D479 (2022).

[5] Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308–311 (2001).

Multiple Contributors

Hilal Tayara and Palistha Shrestha

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Protein Biochemistry

Life Sciences > Biological Sciences > Molecular Biology > Protein Biochemistry

Artificial Intelligence

Mathematics and Computing > Computer Science > Artificial Intelligence

Nature Communications

Nature Communications

An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Women's Health

A selection of recent articles that highlight issues relevant to the treatment of neurological and psychiatric disorders in women.

Publishing Model: Hybrid

Deadline: Ongoing

Explore this Collection

Advances in neurodegenerative diseases

This Collection aims to bring together research from various domains related to neurodegenerative conditions, encompassing novel insights into disease pathophysiology, diagnostics, therapeutic developments, and care strategies. We welcome the submission of all papers relevant to advances in neurodegenerative disease.

Publishing Model: Hybrid

Deadline: Dec 24, 2025

Explore this Collection

Latest Content

Behind the Paper

Post-approval safety studies: why they matter

Behind the Paper

From corkscrews to causes: unravelling the chirality of malaria parasites

Behind the Paper

Behind the paper: Tracing a heart disease mutation to the kidney

Cost-effective drone toolkits for stream habitat monitoring — enabling scalable, high-resolution assessments

Behind the Paper, News and Opinion

Food security under climatic extremes in the Asia-Pacific region

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

PTMGPT2: An Interpretable Protein Language Model for Enhanced Post-Translational Modification Prediction

Share this post

Share with...

...or copy the link