Behind the Paper

Exploring the sequence-function space of microbial fucosidases: a dive into the application of protein language models to CAZymes

Published in Cell & Molecular Biology

Jun 25, 2024

Haiyang Wu and Nathalie Juge

2 contributors

Liked by Nathalie Juge and 1 other

Explore the Research

A protein sequence is composed of an ordered set of amino acids (like a person’s ID card number), revealing secrets about the structure and function of the protein (akin to personality). Carbohydrate-active enzymes (CAZymes) are responsible for the synthesis, breakdown, modification and synthesis of all carbohydrates on earth. CAZymes are classified into families based on amino acid sequence homology (www.cazy.org). Among them, glycoside hydrolase family (GH) 29 contains α-L-fucosidases that cleave the nonreducing terminal α-L-fucose from fucosylated oligosaccharides and glycoconjugates commonly found in mammalian, insect, microbial and plant glycans. There is therefore great interest in studying the function of these enzymes to advance our understanding of microbe-host interactions and for biotechnological or biomedical applications. Further, the tens of thousands of GH29 sequences covering broad substrate preference make this family a good target to investigate protein sequence-function relationships, and ultimately, predict the function/substrate specificity directly from the amino acid sequence.

The similarity of two individual protein sequences often relates to their function closeness. Various methods have been developed to analyse sequence similarity, for example, phylogenetic approaches, sequence similarity network (SSN), or conserved unique peptide pattern (CUPP) where protein sequences clustered based on sequence similarity can be visualized under the form of trees, networks, or other formats. Among them, the SSN network representation is convenient when dealing with thousands of protein sequences. However, considering the speed of protein sequence expansion, analysing tens of thousands or more protein sequences at a time is labour-intensive. Moreover, since the above-mentioned methods are sequence alignment-based, any newly discovered protein sequence requires starting the process of sequence alignment all over again to evaluate its closeness to known protein sequences.

This prompted us to turn towards artificial intelligence (AI) and machine-learning models. The fast development of natural language processing led to the emergence of protein language models (pLMs), which convert the computationally unreadable sequence information to explicable vector representations. A good pLM is expected to extract as much information as possible from the original protein sequence in a computationally cost-effective way. Usually, the effectiveness of a pLM is evaluated by learning tasks, such as structure and function prediction. Using pLMs for protein function prediction requires a mathematical representation of the enzyme fine specificities, which can be difficult in terms of substrate preferences considering the complexity and diversity of substrates used for biochemical characterization.

In this paper, we selected 11 novel GH29 fucosidase sequences which were analysed along with 85 previously functionally characterized GH29 fucosidases by SSN. We then enzymatically characterized the novel GH29 fucosidases and found that the substrate specificity of these enzymes was in line with their SSN-cluster allocations. Notably, we determined the structural basis for a GH29 fucosidase from Bifidobacterium asteroides towards α1,6 linkages and FA2G2 N-glycan.

Next, we explored the idea to use the allocated SSN cluster IDs as enzyme fine specificity label. Based on this, a downstream classification task-training was established to evaluate the performance of different pLMs in cluster assignment. We compared two state-of-the-art pLMs, ESM-2 and ProtT5-XL-U50, with a self-supervised pre-trained model GH29BERT, which is 30 times smaller, and found that the two large-scale pLMs demonstrated >99% performance compared to 98% for GH29BERT. The highest accuracy of 99.64% was achieved by ProtT5-XL-U50 which was then used for cluster assignment of 34,258 non-redundant GH29 sequences collected so-far, providing a function map across this family.

The advantage of this approach, considering the continuing expansion of GH29 family, is that any newly discovered GH29 sequences can be assigned to SSN clusters via a user-friendly interface which we have made available at https://huggingface.co/spaces/Oiliver/GH29BERT.

Future studies investigating what the GH29BERT model “learned” during task-training would help identify the substrate specificity amino acid fingerprints for each SSN cluster, deepening our understanding of the sequence-function relationship within the GH29 family. Furthermore, expanding this combined SSN/pLM approach to other CAZyme families will greatly help define the substrate specificity of CAZymes underpinning the fine-tuned landscape of carbohydrate metabolic capacity of microbes and microbial communities.

Multiple Contributors

Haiyang Wu and Nathalie Juge

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Protein Biochemistry

Life Sciences > Biological Sciences > Molecular Biology > Protein Biochemistry

Communications Chemistry

Communications Chemistry

An open access journal from Nature Portfolio publishing high-quality research, reviews and commentary in all areas of the chemical sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

f-block chemistry

This Collection aims to highlight recent progress in f-element chemistry, encompassing studies on fundamental electronic structure, advances in separation chemistry, advances in coordination and organometallic chemistry, and the application of f-element compounds in materials science and environmental technologies.

Publishing Model: Open Access

Deadline: Feb 28, 2026

Explore this Collection

Experimental and computational methodology in structural biology

This cross-journal Collection highlights methodological developments in instrument design, sample preparation, data acquisition, data analysis, interpretation and integration from different techniques.

Publishing Model: Open Access

Deadline: Apr 30, 2026

Explore this Collection

Behind the paper: Protic ionic liquids as binders for carbon paste electrode fabrication

Behind the Paper

Serine family proteases in RiPP biosynthesis: S9 protease WprP

Behind the Paper

Unveiling the Mystery of Vanadium: How High-Throughput Crystallography Cracked a Metallodrug Code

News and Opinion

Quarterly Highlights from the Mathematics, Physical and Applied Sciences Communities  

Behind the Paper

AI vs. the Olympiad: Can Multimodal LLMs Truly 'See' Chemistry?

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Exploring the sequence-function space of microbial fucosidases: a dive into the application of protein language models to CAZymes

Share this post

Share with...

...or copy the link