From hundreds to thousands of genes: new advances in identifying small proteins in a minimal cell

From hundreds to thousands of genes: new advances in identifying small proteins in a minimal cell

Share this post

Choose a social network to share with, or copy the shortened URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

The function and structure of a living organism are determined by its proteome, which consists of its proteins. Predicting a proteome is possible through the genetic information stored in an organism's genome. This process starts with DNA, where genes are encoded in nucleotide sequences. Genes are transcribed into messenger RNAs (mRNAs), which act as templates for protein synthesis during translation, combining amino acids. Advances in sequencing technologies have transformed our ability to study whole genomes and decipher their genetic information, allowing researchers to identify protein-coding genes, predict their functions, and uncover genetic compositions relevant to traits or diseases.

The identification of gene sequences encoding proteins within a whole genome relies on the universal genetic code, where specific 3-nucleotide combinations, namely codons, encode protein starts and stops (e.g., AUG and UAG, respectively). Identifying these initiation and stop positions delineates protein Open Reading Frames (ORFs) corresponding to genes. Using simple probabilistic models, we can demonstrate that an ORF will likely encode a protein when it exceeds 100 amino acids in length. This assumption forms the basis of the initial step in every genome annotation process, where proteins are mapped. However, one might inquire: what about proteins shorter than 100 amino acids? The answer: we typically disregard small ORFs (smORFs) to mitigate false positive gene calls as smORFs are significantly more prevalent than regular ORFs, around 40 smORFs for each ORF, but more frequently not translated into proteins (Figure 1). Therefore, distinguishing whether a smORF will encode a small protein is far from straightforward.

Nevertheless, although challenging to validate, smORFs can encode proteins. Initially identified serendipitously in screening assays, smORF-encoded proteins (SEPs) represent a large pool of little-understood molecules with diverse and relevant functions in cells. For example, in bacteria, SEPs play essential roles in processes such as sporulation, influx inhibition, photosynthesis, cell division, stress sensing, and antibiotic resistance. Additionally, secreted SEPs can contribute to communication and competition in microbial communities, synchronizing cellular reactions between microbes or acting as antimicrobial peptides to eliminate other individuals.

Figure 1. ORFs and smORFs found in ~5% of the genome of Mycoplasma pneumoniae. There are six possible reading frames in a genome (three per strand). While large ORFs (top) almost always encode proteins, there is a plethora of smORFs, of which only a minimum fraction will produce a small protein or SEP.

While specific bioinformatic tools have been developed to predict which smORFs translate to SEPs (increasing the number of genes by up to 40% depending on the study and organism), these proteins are often overlooked in genomic analysis due to a lack of experimental evidence. Advances in methods like mass spectrometry and ribosome profiling have enabled high-throughput analysis of proteomes. However, mass spectrometry relies on digesting proteins into unique smaller fragments, which makes it challenging to identify smaller proteins with certainty. Ribosome profiling, while useful, can be noisy and may struggle to identify SEPs translated from smORFs that overlap with larger ones.

Here we introduce ProTInSeq, a method that uses random transposon mutagenesis and sequencing (Tn-Seq) to explore a proteome of interest. In a regular Tn-Seq experiment, a pool of cells undergoes a process where specific positions in the genome are randomly disrupted. Living cells are then selected based on their growth, as those with an affected essential gene will disappear from the population. Notably, a gene's essentiality in an organism can change depending on various factors, such as genetics or the environment, like during an infection. In ProTInSeq, we use special transposons carrying mutated genes that are only expressed when inserted in-frame to a protein-coding ORF (Figure 2).

Figure 2. At the cell level (top), a reporter is expressed independently in a Tn-Seq protocol. In contrast, its initiation codon is mutated in ProTInSeq, so that the reporter is only expressed when inserted in-frame with an endogenous protein. At the population level (bottom), individual transposition events occur in the population (orange cells denote insertion in-frame, while blue and purple cells denote insertion in non-coding frames). Only cells expressing the reporter are viable when growing with an antibiotic.

In the minimal genome of the bacterium Mycoplasma pneumoniae, ProTInSeq identifies 80% of known proteins, along with 153 small proteins (SEPs) and 5 proteins larger than 100 amino acids that other methods missed. We also confirmed the reliability of our method by selecting examples of SEPs expressed from regions previously thought to be non-coding, as well as those overlapping with larger genes. Our method also revealed a higher number of SEPs with predicted functions (32%) compared to other experimental approaches, indicating its effectiveness in identifying valuable targets for further investigation. Notably, many SEPs show antimicrobial properties, suggesting their potential in addressing antibiotic resistance. Furthermore, ProTInSeq allows us to measure protein levels, uncover details about their stability and half-life, and identify the topology of proteins in the cell membrane.

When we integrate our method with other computational and experimental techniques, we observe that the genome of M. pneumoniae, previously thought to encode 690 proteins, actually encodes 997 proteins, which comprises a 43% increase in the number of coding genes (Figure 3). To put this in perspective, Escherichia coli, a commonly used bacterial model, has approximately 4,000 proteins in its proteome, while a human's proteome is predicted to contain around 20,000 proteins. If the trends observed in M. pneumoniae extend to these organisms, the number of SEPs awaiting annotation in their proteomes could be in the thousands. Overall, future research on SEPs holds the potential to significantly enhance the functional capabilities of organisms and address challenges such as antibiotic resistance, while also fostering biotechnological innovation.

Figure 3. Summary of the increase in both the number and location of genes within the genome of M. pneumoniae following the integration of ProTInSeq-identified SEPs, along with complementary experimental techniques and bioinformatics.

In summary, ProTInSeq represents a flexible, cost-effective alternative experimental method for exploring SEPs through DNA sequencing. This technique can be implemented in other living systems, providing a quantitative tool to characterize SEPs and the structural and physical parameters of a proteome of interest. We envision ProTInSeq as a tool to guide the discovery of novel protein sequences, serving as the first step in prioritizing SEPs for further functional characterization by applying this method in different experimental conditions. In an era where available genome sequences grow exponentially, and considering the pivotal roles of SEPs, we believe ProTInSeq can lead to the discovery of novel SEPs with a direct impact on essential microbial processes and the health of animals and plants.

Read more about this in our recent publication entitled ProTInSeq: transposon insertion tracking by ultra-deep DNA sequencing to identify translated large and small ORFs | Nature Communications.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Subscribe to the Topic

Life Sciences > Biological Sciences > Chemical Biology > Biochemistry > Protein Biochemistry > Proteins
Life Sciences > Biological Sciences > Biological Techniques > Genomic Analysis > Sequencing
Bacterial Genes
Life Sciences > Biological Sciences > Microbiology > Bacteria > Bacterial Genetics > Bacterial Genes
Life Sciences > Biological Sciences > Microbiology
Genetics and Genomics
Life Sciences > Biological Sciences > Genetics and Genomics

Related Collections

With collections, you can get published faster and increase your visibility.

Applied Sciences

This collection highlights research and commentary in applied science. The range of topics is large, spanning all scientific disciplines, with the unifying factor being the goal to turn scientific knowledge into positive benefits for society.

Publishing Model: Open Access

Deadline: Ongoing

Cancer and aging

This cross-journal Collection invites original research that explicitly explores the role of aging in cancer and vice versa, from the bench to the bedside.

Publishing Model: Hybrid

Deadline: Jul 31, 2024