Can protein language models reveal the physical rules of protein structural emergence?

Turning (Protein) language model spaces into laboratories for Scientific (biological) discovery.
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Universal physical principles govern the deterministic genesis of protein structure

ProtGenesis: A unified methodological framework for quantifying protein genesis in structural space

We recently posted a preprint introducing ProtGenesis:

https://www.biorxiv.org/content/10.64898/2026.02.20.706798v1

A framework that mirrors biological processes into protein language model embedding space and uses mathematical and physical analysis to extract interpretable rules of protein genesis and structural emergence.

Core idea: Biological processes such as amino acid assembly, sequence elongation, mutation and design can be represented as quantitative trajectories in a structure-aware protein language model embedding space. By analysing the geometry of these trajectories, ProtGenesis turns latent AI representations into a searchable space for discovering biological rules, moving protein space from an abstract continuum toward a measurable, principle-governed system.

  1. This assembly–perturbation strategy offers a new route to AI interpretability, because every movement in embedding space corresponds to a defined biological (can be extented to other scientific area) operation.

  2. As a demonstration, we identify three organizing principles of protein structural emergence: Assembly, Emergence and Phase Transition. These principles suggest that protein genesis is not a random walk, but a constrained and measurable process with interpretable coordinates.

  3. More broadly, we hope ProtGenesis stimulates discussion on a larger question: can AI models serve not only as predictors, but also as computational laboratories for scientific discovery?

Abstract

The origin of functional proteins remains a fundamental biological enigma. Although Anfinsen’s dogma established sequence as the determinant of structure, and deep learning models can predict structures with high fidelity, the physical principles governing protein genesis itself, from prebiotic condensation to functional protein emergence, remain unresolved. This gap leaves a critical disconnect between mechanistic biological insights and artificial intelligence. Herein, we introduce a unified methodological framework ProtGenesis that recasts genesis of protein as a structured, deterministic navigation within a discrete structural space. We identify three universal principles governing this hierarchical organization: the Assembly Principle directs amino acids condensation into multilayer fractal-like architectures; the Emergence Principle ensures nascent peptides’ emergence follow deterministic spatial trajectories; and the Phase-Transition Principle describes wherein incremental residue accrual or mutations drives precise topological phase shifts from short-range to long-range order. By quantifying these trajectories with novel tripartite spatial metrics, we reveal that protein genesis is not an abstract continuum but a principle-governed physical process with measurable coordinates. ProtGenesis thus provides an universal interpretable mathematical foundation for decoding “black-box” of deep learning models and establishes a rigorous basis for exploring, understanding, and engineering the molecular blueprint of life.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Artificial Intelligence
Mathematics and Computing > Computer Science > Artificial Intelligence
Spotlight on Research from China
Research Publishing > Spotlight on Research from China
Biotechnology
Life Sciences > Biological Sciences > Biotechnology
Structural Biology
Life Sciences > Biological Sciences > Structural Biology
Machine Learning
Mathematics and Computing > Computer Science > Artificial Intelligence > Machine Learning