How Do You Find Patterns in Biological “Noise”? A Journey into Low-Complexity Regions

Proteins may look like orderly code—but hidden within are simple, repetitive stretches once dismissed as “junk.” These low-complexity regions (LCRs) are now known to shape function, disease, and cellular organization. But detecting them isn’t straightforward—and that’s where the real story begins.

Published in Protocols & Methods

How Do You Find Patterns in Biological “Noise”? A Journey into Low-Complexity Regions
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

If you zoom into a protein sequence, what you see is a long string of letters—amino acids arranged in a precise order. At first glance, it looks like well-organised code.

But look closer, and things get… messy.

You’ll start noticing stretches where the same few letters repeat again and again. Regions that look almost too simple compared to the rest of the sequence. For a long time, scientists weren’t quite sure what to make of these segments.

These are called low-complexity regions (LCRs)—and they’re far more important than they appear.

From “Junk” to Functional Gold

LCRs were once dismissed as evolutionary leftovers—quirks of DNA replication with little real function. But over time, that view has changed dramatically.

We now know that these regions:

  • Help proteins interact with other molecules
  • Contribute to structural flexibility
  • Play roles in phase separation—a process essential for organising cellular components
  • Are linked to diseases like Huntington’s, where repeat expansions disrupt normal function

In fact, nearly 20% of human proteins contain these regions.

So the real question is no longer “Do LCRs matter?”
It’s “How do we reliably find and study them?”

The Problem: Everyone Detects Them Differently

Here’s where things get tricky.

There isn’t just one way to detect LCRs—there are many computational tools, each built on a different idea of what “low complexity” actually means.

Some tools look for:

  • Repetitive patterns (like tandem repeats)
  • Compositional bias (too much of one amino acid)
  • Statistical deviation from randomness
  • Even visual self-similarity using dotplots

Naturally, they don’t always agree.

Run different tools on the same protein, and you might get completely different answers.

So which one is right?

Benchmarking the Chaos

This is the core problem my research addresses.

Instead of trusting a single method, I built a comprehensive benchmarking framework to compare multiple LCR detection tools across the entire human proteome (over 20,000 proteins).

The idea was simple:

Let’s not ask which tool is best—let’s understand how they differ.

To do this, I:

  • Standardised outputs from different tools into a common format
  • Compared them across multiple dimensions:
    • Length of detected regions
    • Coverage within proteins
    • Amino acid composition
    • Sequence complexity (via Shannon entropy)
  • Analysed how much different tools agree with each other

What We Found (And Why It Matters)

The results were revealing.

1. There is no “one-size-fits-all” tool

Some tools are very sensitive, detecting large portions of proteins as low-complexity. Others are extremely strict, picking only the most obvious repeats.

This means:

The tool you choose directly shapes the biology you “see.”

2. Agreement between tools tells a story

When multiple tools detect the same region, something interesting happens:

  • These regions tend to be longer
  • More repetitive
  • More compositionally pure
  • And have lower entropy (i.e., less randomness)

In other words:

The more tools agree, the more “classical” the LCR becomes.

3. Low complexity isn’t black-and-white

One of the most important insights:

LCRs don’t exist as a strict category—they form a continuum.

Instead of a sharp boundary between “low” and “high” complexity, we see a gradual transition. Some regions sit in a grey area—part structured, part repetitive.

This challenges the idea that we can neatly classify sequences.

4. Different tools capture different biology

  • Some methods are better at finding highly repetitive motifs
  • Others capture subtle, compositionally biased regions
  • Some focus on structure-like patterns, others on statistical anomalies

So rather than competing, these tools are actually:

Complementary lenses looking at the same biological landscape.

Why This Work Matters

If you’re studying proteins—whether in evolution, disease, or structural biology—LCRs are impossible to ignore.

But using the wrong detection approach can:

  • Overestimate their presence
  • Miss biologically relevant regions
  • Or bias downstream analysis

This benchmarking framework provides:

  • A unified way to compare tools
  • Guidelines on which tools to use and when
  • A deeper understanding of what “low complexity” actually means

The Bigger Picture

Science often progresses by refining definitions.

Low-complexity regions are a perfect example:

  • Once ignored
  • Then loosely defined
  • Now understood as diverse, functionally important, and method-dependent

What looks like “simple repetition” at first glance turns out to be a rich and nuanced feature of biological sequences.

Final Thought

When we look at biological data, we often assume complexity is where the meaning lies.

But sometimes, it’s the simplest patterns—the repeats, the biases, the irregularities—that carry the deepest insights.

The challenge is not just detecting them.

It’s learning how to interpret them correctly.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Bioinformatics
Life Sciences > Biological Sciences > Biological Techniques > Computational and Systems Biology > Bioinformatics

Related Collections

With Collections, you can get published faster and increase your visibility.

Dementia

This cross-journal Collection welcomes submissions that explore all aspects of dementia, including incidence, prevention, diagnosis and treatment of dementia and the impact on carers and society.

Publishing Model: Hybrid

Deadline: Apr 30, 2026

Advances in neurodegenerative diseases

This Collection aims to bring together research from various domains related to neurodegenerative conditions, encompassing novel insights into disease pathophysiology, diagnostics, therapeutic developments, and care strategies. We welcome the submission of all papers relevant to advances in neurodegenerative disease.

Publishing Model: Hybrid

Deadline: Jun 30, 2026