Proteins are essential for cellular functions, and their activity is regulated by over 400 types of post-translational modifications (PTMs) [1], expanding the human proteome to over a million unique species from about 20,000 genes [2]. Utilizing Natural Language Processing (NLP) techniques, particularly generative pre-trained transformers (GPT) models, can help decode and predict the intricate patterns of PTMs, merging computational linguistics with molecular biology.
PTMGPT2 utilizes a prompt-based approach for predicting post-translational modifications (PTMs) in protein sequences. The framework relies on fine-tuning the PROTGPT2 model [3] using informative prompts, which help generate accurate sequence labels by expanding the model's vocabulary. During training, PTMGPT2 learns token relationships in an unsupervised manner, while in inference, it predicts labels by filling in blanks within the prompts. This method ensures high prediction accuracy by preserving the biological relevance of protein sequences (Fig. 1).
PTMGPT2’s performance is benchmarked against the DBPTM database [4], which includes a broad spectrum of experimentally verified PTMs. Utilizing 19 distinct datasets from DBPTM, each with at least 500 data points, PTMGPT2’s ability is thoroughly assessed to identify modified amino acid residues. The comparative analysis, using the Matthews Correlation Coefficient (MCC) as the metric, showed improvements across various PTM types. For instance, PTMGPT2 achieved a 7.94% improvement in lysine succinylation, 5.91% in lysine sumoylation, 12.74% in arginine methylation, and demonstrated robust performance across other PTMs, often surpassing existing methods. These results highlight PTMGPT2’s advancements in PTM site prediction, establishing it as a leading tool in proteomics research.
PTMGPT2 uses an attention-driven framework to identify critical sequence determinants for protein modifications, by extracting attention scores from its final decoder layer. This mechanism provides a granular view of the model's focus on specific amino acids or motifs within the sequence, as evidenced by the Position Specific Probability Matrix (PSPM) which reveals the importance of each amino acid. Analysis of attention heads identified various motifs linked to different PTMs, including motifs for lysine acetylation and kinase families, which align with previously validated experimental data. This detailed attention-based analysis enables PTMGPT2 to uncover intricate patterns and preferences in protein sequences, offering valuable insights into the underlying mechanisms of protein modifications.
PTMGPT2 effectively identifies mutation hotspots around phosphoserine sites in key genes like TP53, BRAF, and RAF1, crucial for understanding PTM-related mutations and their impact on protein function. Analyzing the TP53 gene, PTMGPT2 highlighted mutation patterns near phosphosites, consistent with dbSNP data [5], indicating significant mutation clusters in this tumor suppressor protein.
GPT models have revolutionized NLP with their transformer architecture and pre-training methods, leading to significant advancements in various tasks. PTMGPT2 leverages these models for accurate protein PTM site prediction by reformulating classification as label generation. It outperforms existing methods in most PTM types and sets the stage for future work refining prompt designs.
References:
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in