Behind the Paper

From hundreds to thousands of genes: new advances in identifying small proteins in a minimal cell

Published in Microbiology, Protocols & Methods, and Cell & Molecular Biology

Mar 08, 2024

Samuel Miravet Verde

From hundreds to thousands of genes: new advances in identifying small proteins in a minimal cell

Liked by India Ambler

Explore the Research

The function and structure of a living organism are determined by its proteome, which consists of its proteins. Predicting a proteome is possible through the genetic information stored in an organism's genome. This process starts with DNA, where genes are encoded in nucleotide sequences. Genes are transcribed into messenger RNAs (mRNAs), which act as templates for protein synthesis during translation, combining amino acids. Advances in sequencing technologies have transformed our ability to study whole genomes and decipher their genetic information, allowing researchers to identify protein-coding genes, predict their functions, and uncover genetic compositions relevant to traits or diseases.

The identification of gene sequences encoding proteins within a whole genome relies on the universal genetic code, where specific 3-nucleotide combinations, namely codons, encode protein starts and stops (e.g., AUG and UAG, respectively). Identifying these initiation and stop positions delineates protein Open Reading Frames (ORFs) corresponding to genes. Using simple probabilistic models, we can demonstrate that an ORF will likely encode a protein when it exceeds 100 amino acids in length. This assumption forms the basis of the initial step in every genome annotation process, where proteins are mapped. However, one might inquire: what about proteins shorter than 100 amino acids? The answer: we typically disregard small ORFs (smORFs) to mitigate false positive gene calls as smORFs are significantly more prevalent than regular ORFs, around 40 smORFs for each ORF, but more frequently not translated into proteins (Figure 1). Therefore, distinguishing whether a smORF will encode a small protein is far from straightforward.

Nevertheless, although challenging to validate, smORFs can encode proteins. Initially identified serendipitously in screening assays, smORF-encoded proteins (SEPs) represent a large pool of little-understood molecules with diverse and relevant functions in cells. For example, in bacteria, SEPs play essential roles in processes such as sporulation, influx inhibition, photosynthesis, cell division, stress sensing, and antibiotic resistance. Additionally, secreted SEPs can contribute to communication and competition in microbial communities, synchronizing cellular reactions between microbes or acting as antimicrobial peptides to eliminate other individuals.

**Figure 1.** ORFs and smORFs found in ~5% of the genome of *Mycoplasma pneumoniae*. There are six possible reading frames in a genome (three per strand). While large ORFs (top) almost always encode proteins, there is a plethora of smORFs, of which only a minimum fraction will produce a small protein or SEP.

While specific bioinformatic tools have been developed to predict which smORFs translate to SEPs (increasing the number of genes by up to 40% depending on the study and organism), these proteins are often overlooked in genomic analysis due to a lack of experimental evidence. Advances in methods like mass spectrometry and ribosome profiling have enabled high-throughput analysis of proteomes. However, mass spectrometry relies on digesting proteins into unique smaller fragments, which makes it challenging to identify smaller proteins with certainty. Ribosome profiling, while useful, can be noisy and may struggle to identify SEPs translated from smORFs that overlap with larger ones.

Here we introduce ProTInSeq, a method that uses random transposon mutagenesis and sequencing (Tn-Seq) to explore a proteome of interest. In a regular Tn-Seq experiment, a pool of cells undergoes a process where specific positions in the genome are randomly disrupted. Living cells are then selected based on their growth, as those with an affected essential gene will disappear from the population. Notably, a gene's essentiality in an organism can change depending on various factors, such as genetics or the environment, like during an infection. In ProTInSeq, we use special transposons carrying mutated genes that are only expressed when inserted in-frame to a protein-coding ORF (Figure 2).

**Figure 2.** At the cell level (top), a reporter is expressed independently in a Tn-Seq protocol. In contrast, its initiation codon is mutated in ProTInSeq, so that the reporter is only expressed when inserted in-frame with an endogenous protein. At the population level (bottom), individual transposition events occur in the population (orange cells denote insertion in-frame, while blue and purple cells denote insertion in non-coding frames). Only cells expressing the reporter are viable when growing with an antibiotic.

In the minimal genome of the bacterium Mycoplasma pneumoniae, ProTInSeq identifies 80% of known proteins, along with 153 small proteins (SEPs) and 5 proteins larger than 100 amino acids that other methods missed. We also confirmed the reliability of our method by selecting examples of SEPs expressed from regions previously thought to be non-coding, as well as those overlapping with larger genes. Our method also revealed a higher number of SEPs with predicted functions (32%) compared to other experimental approaches, indicating its effectiveness in identifying valuable targets for further investigation. Notably, many SEPs show antimicrobial properties, suggesting their potential in addressing antibiotic resistance. Furthermore, ProTInSeq allows us to measure protein levels, uncover details about their stability and half-life, and identify the topology of proteins in the cell membrane.

When we integrate our method with other computational and experimental techniques, we observe that the genome of M. pneumoniae, previously thought to encode 690 proteins, actually encodes 997 proteins, which comprises a 43% increase in the number of coding genes (Figure 3). To put this in perspective, Escherichia coli, a commonly used bacterial model, has approximately 4,000 proteins in its proteome, while a human's proteome is predicted to contain around 20,000 proteins. If the trends observed in M. pneumoniae extend to these organisms, the number of SEPs awaiting annotation in their proteomes could be in the thousands. Overall, future research on SEPs holds the potential to significantly enhance the functional capabilities of organisms and address challenges such as antibiotic resistance, while also fostering biotechnological innovation.

**Figure 3.** Summary of the increase in both the number and location of genes within the genome of *M. pneumoniae* following the integration of ProTInSeq-identified SEPs, along with complementary experimental techniques and bioinformatics.

In summary, ProTInSeq represents a flexible, cost-effective alternative experimental method for exploring SEPs through DNA sequencing. This technique can be implemented in other living systems, providing a quantitative tool to characterize SEPs and the structural and physical parameters of a proteome of interest. We envision ProTInSeq as a tool to guide the discovery of novel protein sequences, serving as the first step in prioritizing SEPs for further functional characterization by applying this method in different experimental conditions. In an era where available genome sequences grow exponentially, and considering the pivotal roles of SEPs, we believe ProTInSeq can lead to the discovery of novel SEPs with a direct impact on essential microbial processes and the health of animals and plants.

Samuel Miravet Verde (He/Him)

I am a computational biologist with expertise in microbiology and artificial intelligence-based bioinformatic approaches to understand complex microbiological genomic and functional aspects. I study the prevalence, function and importance of small proteins in biological systems such as microbes and microbiomes.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Proteins

Life Sciences > Biological Sciences > Chemical Biology > Biochemistry > Protein Biochemistry > Proteins

Sequencing

Life Sciences > Biological Sciences > Biological Techniques > Genomic Analysis > Sequencing

Bacterial Genes

Life Sciences > Biological Sciences > Microbiology > Bacteria > Bacterial Genetics > Bacterial Genes

Microbiology

Life Sciences > Biological Sciences > Microbiology

Genetics and Genomics

Life Sciences > Biological Sciences > Genetics and Genomics

Nature Communications

Nature Communications

An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Women's Health

A selection of recent articles that highlight issues relevant to the treatment of neurological and psychiatric disorders in women.

Publishing Model: Hybrid

Deadline: Ongoing

Explore this Collection

Advances in neurodegenerative diseases

This Collection aims to bring together research from various domains related to neurodegenerative conditions, encompassing novel insights into disease pathophysiology, diagnostics, therapeutic developments, and care strategies. We welcome the submission of all papers relevant to advances in neurodegenerative disease.

Publishing Model: Hybrid

Deadline: Mar 24, 2026

Explore this Collection

A fuzzy set-based hybrid SWARA-CoCoSo-William Fine framework for safety risk assessment in a ceramic granule preparation unit

Behind the Paper

Comprehensive risk profiling of occupational harmful factors in the ceramic industry: a case study from Iran

Behind the Paper

How to select the best candidate or the key factors? Hierarchical topological clustering can help

Behind the Paper

Insights into hyperuricemia amelioration mechanisms of Lactobacillus rhamnosus GG may enable probiotics therapy

Behind the Paper

Paving the Future of Intelligent Asphalt Defect Detection with Machine Learning

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

From hundreds to thousands of genes: new advances in identifying small proteins in a minimal cell

Share this post

Share with...

...or copy the link