Behind the Paper

Rethink Sparsity from the bottom

This story will explain why sparse AI Is so hard and how we’re fixing It with brain-inspired hardware.

Sparsity is an intrinsic property of synaptic connections in the human brain. During development, more than half of all synapses are pruned in a fine-grained and unstructured manner. This pruning process is one of the key factors behind the brain’s remarkable energy efficiency.

Inspired by this, the concept of sparse neural networks was proposed as early as the 1990s. In theory, sparsity could improve energy efficiency by up to two orders of magnitude. It has since become one of the mainstream approaches to model compression and efficiency in AI.

But decades later, we still haven’t fully realized its theoretical promise. Why? The main reason lies in the mismatch between sparse neural networks and existing hardware. If we use the game of Jenga as an analogy: the brain is natively sparse — like a structure with a lot of gaps — while our AI hardware is natively dense, like a tightly stacked tower. So mapping sparsity onto dense hardware inevitably involves a cumbersome process: telling the hardware which parts need to be computed, and which can be skipped.

Sparse Human Brain vs. Sparse Neural Network

The Hidden Cost: Indexing

Let’s add some technical detail. This process of telling hardware where the important weights are is called indexing. In sparse neural networks, more than 90% of weights can often be pruned. But because the remaining weights are distributed in irregular, unstructured patterns, accessing them requires repeated memory reads and writes. This indexing process consumes most of the system’s energy and latency budget.

In the era of dense networks, we talked about the von Neumann bottleneck — separation of memory and computation. Now, in the sparse era, indexing has become the new bottleneck.

Indexing bottleneck makes inefficient sparse hardware

The current mainstream industrial approach embeds structured and/or coarse-grained pruning support directly into GPUs. With certain pattern constraints, sparse matrices can be compressed back into dense ones and processed efficiently. But this comes at the cost of accuracy. It’s essentially sacrificing some model precision to reduce indexing overhead.

This leads us to what we call the granularity dilemma:

  • Coarse-grained, structured sparsity offers high energy efficiency but low accuracy
  • Fine-grained, unstructured sparsity offers high accuracy but low efficiency

Clearly, this is far from how biological brains operate. What we need is not just smarter algorithms, but fundamentally new hardware.

The accuracy-efficiency dilemma of sparse neural networks with different granularities.


A Biological Clue and In-Memory Sparsity

To break this impasse, we turned back to neuroscience. In the brain, the dynamic behaviors of synapse creation, pruning, and regrowth are not managed by the neurons themselves, but by the surrounding astrocytes and microglial cells.

Inspired by this, we believe that in hardware, sparsity information should be stored as close as possible to weight information, ideally co-located in the same computing unit and directly involved in computation.

The idea of In-Memory Sparsity, inspired by neuroscience

Based on this idea, we propose a new hardware architecture: In-Memory Sparsity. We abstract the training process of a sparse neural network as the Hadamard product between a sparsity matrix and a weight matrix, and physically integrate both into the same unit. Our hardware is built using ferroelectric transistors based on Hf0.5Zr0.5O2, a CMOS-compatible ferroelectric material, and MoS₂, a two-dimensional semiconductor. Each computing unit includes two ferroelectric transistors:

  • One analog ferroelectric transistor to store the weight value
  • One digital ferroelectric transistor to encode sparsity, that is, whether that weight has been pruned

Thanks to the back-end compatibility of 2D materials with silicon processes, this design supports monolithic 3D integration with mature CMOS circuits. This overcomes the interconnect density limitations of advanced packaging and further improves energy efficiency in In-Memory Computing architectures.

Of course, readers don’t need to worry about the material details — just remember that emerging devices like these give us unprecedented access to fine-grained control over both data and addresses at the bottom level. The key point here is that sparsity is programmed in advance, removing the need for external indexing. That’s what makes this architecture fundamentally different.

The cell design of In-Memory Sparsity hardware

From Cell to Array: VAU

But when we scale up from a single unit to an array, a new challenge appears. In an array, some synaptic units are retained and others are pruned. Yet we want to train the network without knowing the exact locations or magnitudes of these retained weights.

Sounds impossible? It’s not. In fact, this is exactly the core demand of sparse hardware. If we need to know positions at the array level, then we’re not truly leveraging the index-free nature of the cell.

From the Index-Free cell to the Index-Free array

Let’s use a metaphor.

Imagine an apartment building where several rooms are on fire. The fires vary in intensity, and our number of firefighters is limited. How do we extinguish all the flames as quickly and efficiently as possible?

In AI training, the burning rooms are the weights (W), the fire size corresponds to the update magnitude (ΔW), and the firefighters represent energy and latency.

  • We could send a firefighter to each window to put out individual fires — precise but extremely slow.
  • Or we could bring in a fire truck and spray water across entire rows — much faster, but many rooms don’t need water (and you’ll get complaints from your wet neighbors).

In AI terms, this is the trade-off between cell-by-cell updates (accurate but slow) and vectorial updates (fast but inaccurate).

Our approach? We first let the residents in unaffected rooms close their windows (local programming). Then we use the fire truck to douse the flames in parallel. Sure, this is an approximation — every room gets the same water pressure regardless of how big the fire is. But we found that it’s dramatically accurate enough, and more efficient.

We call this method Vectorial Approximate Updating (VAU).


Three AI training methods

The core of VAU is simple:

  • Instead of updating each weight individually, we apply one ΔW value per row or column.
  • Each weight will be updated twice (once for row, once for column)

This inevitably leads to a lot of incorrect updates, especially when updates in the same row have opposite signs. But here’s the clever part: in sparse neural networks, most weights have already been pruned. These pruned units are equipped with sparsity transistors, which automatically mask out any erroneous updates. This masking is local, spontaneous, and requires no external control. Thus in our experiments, VAU achieved accuracy comparable to traditional, fully indexed updates — while being orders of magnitude more efficient.

The details of VAU

Experimental Demo: On-Chip Sparse Training

To validate the architecture, we designed a mini sparse conventional neural network (CNN) with around 1,000 weights and mapped it onto our fabricated hardware array.

We completed several on-chip sparse training procedures, including:

  • Pre-training
  • Pruning
  • Over-pruning
  • Regrowth

During pruning and regrowth epochs, indexing was necessary to know which weights to prune or regrow. But in all other sparse training phases — despite fine-grained and unstructured weight distributions — indexing was completely eliminated.

At 75% sparsity, our hardware achieved 98.4% accuracy on the EMNIST handwritten letter classification task.

On-chip sparse training based on Index-Free hardware

Is It Scalable? Yes!

Due to lab constraints and the non-mature nature of emerging devices, our fabricated arrays were small and not yet capable of running large-scale models. To demonstrate scalability, we simulated the classic VGG-8 CNN on three types of hardware:

  1. Dense hardware
  2. Traditional sparse hardware
  3. Our index-free sparse hardware

The result: for the first time, our architecture achieved an order-of-magnitude improvement in energy and latency, even under ultra-fine-grained and unstructured sparsity.


In-Memory Sparsity makes theoretical benefits of sparse neural network

Conclusion

Inspired by the human brain, we present the first in-memory sparse computing architecture for neural networks. Built on MoS2 ferroelectric transistors, this design eliminates external indexing, overcomes the accuracy-efficiency trade-off, and supports real-time sparse training.

More importantly, it shows how emerging materials and device-level innovations can enable fine-grained programmability at the hardware level — something traditional silicon-based technologies simply can’t offer.