A Long and Winding Road towards and away from Predictive Coding

Wait, what just happened? I wasn't planning to work on Predictive Coding!

Jochen Triesch Oct 08, 2025

Dana H. Ballard was a giant in the field when I joined his lab as a post-doc in 1999. He had just published a paper with his graduate student Rajesh P.N. Rao, which was destined to have a huge impact on Neuroscience. It was the origin of theories of Predictive Coding (PC). Their paper entitled „Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects“ was published in the journal Nature Neuroscience in 1999. In 2000 it was cited (only) 23 times according to google scholar. But by now it has amassed over 6700 citations. Last year alone there have been almost 700 and the numbers are still going up - over 25 years after the original publication. So what was their revolutionary contribution?

Dana was always very impressed by how the brain could operate so quickly and efficiently. Sure, even today we still marvel at the brain’s efficiency, but keep in mind that 25 years ago computers were orders of magnitude slower. Brain-like intelligence was a holy grail that seemed very far out of reach. The truly bold and brilliant idea of Dana and Rajesh was that if a sensory input can already be predicted, then there’s really no need to encode it anymore. It should be enough to only encode deviations from these predictions, i.e. prediction errors. This could save precious resources and make the brain much more efficient. They realized that this could be done in a processing hierarchy, where „higher“ levels of processing predict inputs to „lower“ levels of processing. This allowed them to give a new functional interpretation to some well-known neurobiological findings from visual cortex. A little while later, Karl Friston took this beautiful idea and gave it a new twist by „marrying“ it to a specific way of performing Bayesian inference and learning. Since then, many others have contributed to the development of these theoretical ideas or how they might be implemented in the brain. But the central idea has remained the same: only transmit prediction errors to higher processing stages and use these errors for learning a hierarchical model of the world.

Back in 1999, I had come to Dana’s lab to work on a very different topic (which was a ton of fun and also a success, albeit a much more modest one). Nevertheless, I appreciated what excited him about his work with Rajesh: the idea of only encoding those things that cannot already be predicted is both bold and elegant, two adjectives that I feel very well reflect Dana’s general approach to science. Yet, for some reason I could not fully grasp, I never really warmed up to the idea and never fully embraced it - despite its gigantic success over the coming years. I remember that at the time, Dana also had his own struggles with the idea. He wanted to understand how it could be efficiently implemented with spiking neurons, i.e., neurons that communicate via individual discrete voltage pulses called action potentials or spikes - like most of the neurons in our brains. A particular challenge was how populations of spiking neurons could calculate and encode both positive and negative prediction errors. Although Dana tried hard during those and subsequent years, publishing a couple of papers in this area, his efforts weren’t met with the same success as his groundbreaking original publication.

It was only years after I left Dana’s lab, that I started to develop my own line of research on learning in recurrent spiking neural networks. My lab focused on learning with local synaptic plasticity rules, where synapses are changing their efficacies based only on locally available information. While such learning with local plasticity rules is considered much more biologically plausible than today’s state-of-the-art machine learning methods, it tends to not work as well in practice - at least if the goal is to engineer, e.g., a functioning computer vision system. Also, learning with such local rules in a recurrent network is a notoriously difficult topic. It tends to render such networks quite unstable, because the most popular learning rules for which their is biological evidence, such as forms of so-called Hebbian learning, embody a positive feedback mechanism that tends to cause instabilities. Often Computational Neuroscience researchers try to „tame“ such networks by adding large amounts of noise to their simulations - another idea that I never really warmed up to: Adding noise simply doesn’t seem to be a great strategy if the goal is to be highly efficient.

So at some point we started to simply leave out recurrent excitatory connections from our networks („for now“), leaving only feedforward connectivity that was purely excitatory and recurrent and top-down connectivity that was purely inhibitory. It had long been argued by some authors that recurrent connections in neocortex tend to have a mostly inhibitory net effect. Obviously, this was a gross over-simplification relative to the Neurobiology of mammalian brains and it felt and still feels like a step backwards. But it made learning much easier and we were interested in building neuromorphic vision systems that would work in real applications.

Initially, we did not think of these networks as having any connection to predictive coding. In fact, they do not attempt to explicitly predict the next input and they do not calculate an explicit prediction error. But after a while of working with these networks we realized that the inhibitory connections learned to do something rather simple and intuitive: they learned to suppress the most predictable spikes. And that’s when we made the connection: Suppressing a spike that’s highly predictable is somewhat analogous to subtracting a predicted input from an actual sensory input to calculate a prediction error. It’s an alternative simple way of making sensory coding more efficient by suppressing predictable information - spike by spike. But it’s not the same as PC. It’s only a „light“ variant of PC, because only the most easily predicted spikes are removed. Hence we’ve called it Predictive Coding Light (PCL). There’s an important difference to Dana’s and Rajesh’s revolutionary idea. While PC only transmits prediction errors to higher processing stages in the hierarchy, PCL transmits a compressed representation of the input to higher areas. This idea is not as new, bold and flamboyant as the seminal work by Dana and Rajesh. It’s a modest new take on their landmark discovery - a quarter century later. But this time I’ve been warming up to it quickly.

(Image credit: Cyber-brain, Kohji Asakawa / Pixabay)