Understanding the biological functions of proteins and their complex interaction pathways is the key to developing effective therapeutics to counteract disease. These functions are a direct consequence of protein structures and conformational changes over time; therefore, we require efficient tools to characterize these properties. Some experimental techniques, such as single-molecule Förster Resonance Energy Transfer spectroscopy, can measure properties determined by conformational averages, yet they cannot provide a detailed microscopic description of the conformational distribution of the system. In contrast, structure determination methods, such as X-ray crystallography, can often only resolve a static structure of a protein in its most stable state. If we could better characterize the full conformational distributions of protein systems, including transitions and interactions with other proteins, we can improve the treatment of many types of diseases.
Computational techniques are an effective way to bridge the gap between experimental observables and the underlying dynamics of biomolecular systems. Recent computational breakthroughs in structure prediction using AI models such as AlphaFold [1] are still limited to predicting stable structures and do not provide information on protein dynamics; however, these conformations, along with experimentally-determined protein structures, can be used as input to molecular dynamics (MD) simulation techniques. By modeling protein dynamics at atomistic resolution, MD can accurately characterize protein motions, including conformational changes and interactions with other biological molecules. It does so by computing energy end forces between the system’s atoms using classical energy interaction potentials that have been developed and refined during the last 50 years. While these methods can yield accurate results for many systems, they are computationally expensive and limited to time- and length-scales much shorter than those relevant to biological functions. Sometimes valuable insights can be gained from small (up to millions of atoms) protein models simulated on short (up to milliseconds) length scales, but many biologically relevant problems require to characterize protein systems on larger length- and longer time-scales.
There is increasing evidence that not every atom is essential in determining the long timescale properties of many biomolecular systems [2]. As such, we can greatly simplify our representation of a protein of interest by grouping atoms together into “beads” in a method known as Coarse Grained (CG) modeling. Owing to their simplified representation, CG models are orders of magnitude faster than atomistic MD, and can thus be used to simulate much larger systems on longer timescales. Yet, a simplified CG model must be described by an interaction potential significantly more complex than the atomistic one, to reproduce the same results. While each unit in atomistic MD is a single atom, in a CG model the beads consist of multiple atoms grouped together; this reduced representation involves “integrating out” atomistic details, leading to the emergence of multibody interactions between the CG beads. The multi-body interactions are not straightforward to model using physical intuition, despite decades of research in the field. In fact, while many models have been proposed to study specific systems, a reliable, general-purpose CG model for the efficient simulation of large biomolecules has, up to now, been missing.
Previous work from our group has shown that it is possible to express a CG energy interaction potential with a deep neural network [3,4]; neural networks are known as “universal function approximators” and can be used to model functions such as the multi-bodied energy potential describing a CG protein model to a high degree of accuracy. In this article, we show that, using state-of-the-art machine-learning tools such as graph neural networks, the multi-body CG
interactions can be learned and a chemically transferable CG protein model can be defined. The CG model is trained on a large dataset of atomistic simulation data and can be used to accurately reproduce the conformational distributions of proteins outside of our training set. This work represents the first successful instance of sequence-transferability for a ML-based CG protein model.
The success of our model can be attributed to a few key factors: the selection of proteins in our training dataset to encompass a wide array of sequences and structure types, to be able to extrapolate to unseen proteins; the addition of a prior energy function to enforce physics constraints; and the choice of the neural network as the underlying potential, with an optimized set of hyperparameters.
The capabilities of the CG model are demonstrated by accurately reproducing the energy landscape and metastable states of a set of small and medium-size proteins and by capturing the conformational distribution of a disordered protein consistent with its NMR-resolved structures. The model is also successful in folding a disordered peptide upon interaction with its correct binding partner, and in accurately predicting the changes in energy of mutations on a protein outside of the training dataset. Altogether, this study provides an exciting advance in the development of a “universal” CG model for proteins and validates the use of machine-learned potentials in the field of biomolecular simulation.
Link to the article: Charron et al., “Navigating Protein Landscapes with a Machine-Learned Transferable Coarse-Grained Model”. https://www.nature.com/articles/s41557-025-01874-0
Contact: Cecilia Clementi
Freie Universität Berlin
Email: cecilia.clementi@fu-berlin.de
1. J. Jumper et al. Nature 596.7873 (2021), pp. 583–589.
2. F. Noé and C. Clementi. Curr. Opin. Struct. Biol. 43 (2017), pp. 141–147.
3. J. Wang et al. ACS Cent. Sci. 5.5 (2019), pp. 755–767.
4. B. E. Husic et al. J. Chem. Phys. 153.19 (2020), p. 194101.