Behind the Paper

How Do You Find Patterns in Biological “Noise”? A Journey into Low-Complexity Regions

Proteins may look like orderly code—but hidden within are simple, repetitive stretches once dismissed as “junk.” These low-complexity regions (LCRs) are now known to shape function, disease, and cellular organization. But detecting them isn’t straightforward—and that’s where the real story begins.

Published in Protocols & Methods

Apr 15, 2026

Nagarjun Vijay and Anirjit Chatterjee

2 contributors

How Do You Find Patterns in Biological “Noise”? A Journey into Low-Complexity Regions

Liked by Nagarjun Vijay and 2 others

Explore the Research

If you zoom into a protein sequence, what you see is a long string of letters—amino acids arranged in a precise order. At first glance, it looks like well-organised code.

But look closer, and things get… messy.

You’ll start noticing stretches where the same few letters repeat again and again. Regions that look almost too simple compared to the rest of the sequence. For a long time, scientists weren’t quite sure what to make of these segments.

These are called low-complexity regions (LCRs)—and they’re far more important than they appear.

From “Junk” to Functional Gold

LCRs were once dismissed as evolutionary leftovers—quirks of DNA replication with little real function. But over time, that view has changed dramatically.

We now know that these regions:

Help proteins interact with other molecules
Contribute to structural flexibility
Play roles in phase separation—a process essential for organising cellular components
Are linked to diseases like Huntington’s, where repeat expansions disrupt normal function

In fact, nearly 20% of human proteins contain these regions.

So the real question is no longer “Do LCRs matter?”
It’s “How do we reliably find and study them?”

The Problem: Everyone Detects Them Differently

Here’s where things get tricky.

There isn’t just one way to detect LCRs—there are many computational tools, each built on a different idea of what “low complexity” actually means.

Some tools look for:

Repetitive patterns (like tandem repeats)
Compositional bias (too much of one amino acid)
Statistical deviation from randomness
Even visual self-similarity using dotplots

Naturally, they don’t always agree.

Run different tools on the same protein, and you might get completely different answers.

So which one is right?

Benchmarking the Chaos

This is the core problem my research addresses.

Instead of trusting a single method, I built a comprehensive benchmarking framework to compare multiple LCR detection tools across the entire human proteome (over 20,000 proteins).

The idea was simple:

Let’s not ask which tool is best—let’s understand how they differ.

To do this, I:

Standardised outputs from different tools into a common format
Compared them across multiple dimensions:

Length of detected regions
Coverage within proteins
Amino acid composition
Sequence complexity (via Shannon entropy)

Analysed how much different tools agree with each other

What We Found (And Why It Matters)

The results were revealing.

1. There is no “one-size-fits-all” tool

Some tools are very sensitive, detecting large portions of proteins as low-complexity. Others are extremely strict, picking only the most obvious repeats.

This means:

The tool you choose directly shapes the biology you “see.”

2. Agreement between tools tells a story

When multiple tools detect the same region, something interesting happens:

These regions tend to be longer
More repetitive
More compositionally pure
And have lower entropy (i.e., less randomness)

In other words:

The more tools agree, the more “classical” the LCR becomes.

3. Low complexity isn’t black-and-white

One of the most important insights:

LCRs don’t exist as a strict category—they form a continuum.

Instead of a sharp boundary between “low” and “high” complexity, we see a gradual transition. Some regions sit in a grey area—part structured, part repetitive.

This challenges the idea that we can neatly classify sequences.

4. Different tools capture different biology

Some methods are better at finding highly repetitive motifs
Others capture subtle, compositionally biased regions
Some focus on structure-like patterns, others on statistical anomalies

So rather than competing, these tools are actually:

Complementary lenses looking at the same biological landscape.

Why This Work Matters

If you’re studying proteins—whether in evolution, disease, or structural biology—LCRs are impossible to ignore.

But using the wrong detection approach can:

Overestimate their presence
Miss biologically relevant regions
Or bias downstream analysis

This benchmarking framework provides:

A unified way to compare tools
Guidelines on which tools to use and when
A deeper understanding of what “low complexity” actually means

The Bigger Picture

Science often progresses by refining definitions.

Low-complexity regions are a perfect example:

Once ignored
Then loosely defined
Now understood as diverse, functionally important, and method-dependent

What looks like “simple repetition” at first glance turns out to be a rich and nuanced feature of biological sequences.

Final Thought

When we look at biological data, we often assume complexity is where the meaning lies.

But sometimes, it’s the simplest patterns—the repeats, the biases, the irregularities—that carry the deepest insights.

The challenge is not just detecting them.

It’s learning how to interpret them correctly.

Multiple Contributors

Nagarjun Vijay and Anirjit Chatterjee

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Bioinformatics

Life Sciences > Biological Sciences > Biological Techniques > Computational and Systems Biology > Bioinformatics

Scientific Reports

Scientific Reports

An open access journal publishing original research from across all areas of the natural sciences, psychology, medicine and engineering.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Phytochemicals and health

This Collection welcomes original research investigating the mechanisms underlying phytochemical health effects, therapeutic potential, and challenges related to bioavailability and clinical application.

Publishing Model: Open Access

Deadline: Jul 28, 2026

Explore this Collection

Infectious disease diagnostics

This Collection welcomes original research into current challenges and advances within the field of infectious disease diagnostics.

Publishing Model: Open Access

Deadline: Sep 23, 2026

Explore this Collection

When gene loss is not the end of the story: Behind our GPRC6A study

Behind the Paper

From Hazard to Asset: How Evolution Turned a Viral Weapon Into a Dietary Tool

Behind the Paper

Looking for What Isn’t There: Exploring Gene Loss in Squamate Lineages

Behind the Paper

Behind the scenes: when a “metabolic master” keeps disappearing

Behind the Paper

Embracing Contradictions: How Scientific Discrepancies Spark Discovery and Innovation

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

How Do You Find Patterns in Biological “Noise”? A Journey into Low-Complexity Regions

Share this post

Share with...

...or copy the link