PTMGPT2: An Interpretable Protein Language Model for Enhanced Post-Translational Modification Prediction

PTMGPT2 uses GPT-based architecture and prompt-based fine-tuning to predict post-translational modifications. It outperforms existing methods across 19 PTM types, offering interpretability and mutation analysis. This advances the understanding of protein function and disease research.
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Proteins are essential for cellular functions, and their activity is regulated by over 400 types of post-translational modifications (PTMs) [1], expanding the human proteome to over a million unique species from about 20,000 genes [2]. Utilizing Natural Language Processing (NLP) techniques, particularly generative pre-trained transformers (GPT) models, can help decode and predict the intricate patterns of PTMs, merging computational linguistics with molecular biology.

PTMGPT2 utilizes a prompt-based approach for predicting post-translational modifications (PTMs) in protein sequences. The framework relies on fine-tuning the PROTGPT2 model [3] using informative prompts, which help generate accurate sequence labels by expanding the model's vocabulary. During training, PTMGPT2 learns token relationships in an unsupervised manner, while in inference, it predicts labels by filling in blanks within the prompts. This method ensures high prediction accuracy by preserving the biological relevance of protein sequences (Fig. 1).

Schematic Representation of the PTMGPT2 Framework
Fig. 1 Schematic Representation of the PTMGPT2 Framework. A) Preparation of inputs for PTMGPT2 B) Method-specific data preparation process for benchmark C) Architecture of the PTMGPT2 model and the training and inference processes.

PTMGPT2’s performance is benchmarked against the DBPTM database [4], which includes a broad spectrum of experimentally verified PTMs. Utilizing 19 distinct datasets from DBPTM, each with at least 500 data points, PTMGPT2’s ability is thoroughly assessed to identify modified amino acid residues. The comparative analysis, using the Matthews Correlation Coefficient (MCC) as the metric, showed improvements across various PTM types. For instance, PTMGPT2 achieved a 7.94% improvement in lysine succinylation, 5.91% in lysine sumoylation, 12.74% in arginine methylation, and demonstrated robust performance across other PTMs, often surpassing existing methods. These results highlight PTMGPT2’s advancements in PTM site prediction, establishing it as a leading tool in proteomics research.

PTMGPT2 uses an attention-driven framework to identify critical sequence determinants for protein modifications, by extracting attention scores from its final decoder layer. This mechanism provides a granular view of the model's focus on specific amino acids or motifs within the sequence, as evidenced by the Position Specific Probability Matrix (PSPM) which reveals the importance of each amino acid. Analysis of attention heads identified various motifs linked to different PTMs, including motifs for lysine acetylation and kinase families, which align with previously validated experimental data. This detailed attention-based analysis enables PTMGPT2 to uncover intricate patterns and preferences in protein sequences, offering valuable insights into the underlying mechanisms of protein modifications.

PTMGPT2 effectively identifies mutation hotspots around phosphoserine sites in key genes like TP53, BRAF, and RAF1, crucial for understanding PTM-related mutations and their impact on protein function. Analyzing the TP53 gene, PTMGPT2 highlighted mutation patterns near phosphosites, consistent with dbSNP data [5], indicating significant mutation clusters in this tumor suppressor protein.

GPT models have revolutionized NLP with their transformer architecture and pre-training methods, leading to significant advancements in various tasks. PTMGPT2 leverages these models for accurate protein PTM site prediction by reformulating classification as label generation. It outperforms existing methods in most PTM types and sets the stage for future work refining prompt designs.

References:

[1] Hong, X. et al. PTMint database of experimentally verified PTM regulation on protein-protein interaction. Bioinformatics 39, btac823 (2023).
[2] Pray, L. Eukaryotic genome complexity. Nature Education 1, 96 (2008).
[3] Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun 13, 4348 (2022).
[4] Li, Z. et al. dbPTM in 2022: an updated database for exploring regulatory networks and functional associations of protein post-translational modifications. Nucleic Acids Res 50, D471–D479 (2022).
[5]   Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308–311 (2001).

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Protein Biochemistry
Life Sciences > Biological Sciences > Molecular Biology > Protein Biochemistry
Artificial Intelligence
Mathematics and Computing > Computer Science > Artificial Intelligence

Related Collections

With collections, you can get published faster and increase your visibility.

Biology of rare genetic disorders

This cross-journal Collection between Nature Communications, Communications Biology, npj Genomic Medicine and Scientific Reports brings together research articles that provide new insights into the biology of rare genetic disorders, also known as Mendelian or monogenic disorders.

Publishing Model: Open Access

Deadline: Oct 30, 2024

Carbon dioxide removal, capture and storage

In this cross-journal Collection, we bring together studies that address novel and existing carbon dioxide removal and carbon capture and storage methods and their potential for up-scaling, including critical questions of timing, location, and cost. We also welcome articles on methodologies that measure and verify the climate and environmental impact and explore public perceptions.

Publishing Model: Open Access

Deadline: Mar 22, 2025