Good Forecast, Missing Physics: Looking Inside of AI Weather Ensembles

"AI tools can make mistakes. Double-check important info." We all know that AI is fallible. But what about AI for science? How much can we trust scientific AI models? This study reveals that today's AI weather forecast models such as GenCast carry a systematic, noise-like bias at the mesoscale.
Good Forecast, Missing Physics: Looking Inside of AI Weather Ensembles
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

    Artificial intelligence has begun performing tasks that once required expert judgment: reading medical scans, drafting legal briefs, and, increasingly, predicting the global weather. Yet, each new application raises a haunting question: How would we know if the model were doing right or not?

    In the world of large language models, we have a name for it: hallucination. We have learned that a confident answer from an AI can quickly dissolve into fiction. Now, a new study led by Hisu Kim and Jin-Ho Yoon at the Gwangju Institute of Science and Technology (GIST), published in npj Climate and Atmospheric Science, asks what an AI "hallucination" looks like in a weather forecast, and whether our usual scoring metrics would even detect it.

    To find the answer, the research team launched a year’s worth of forecasts throughout 2021. They compared three forecast outputs: (1) IFS-HRES: The ECMWF's high-resolution deterministic model, (2) IFS-ENS: The industry-standard operational ensemble, and (3) GenCast: Google DeepMind’s state-of-the-art AI model that uses a "diffusion process" to generate its ensemble forecasts. The team tracked kinetic energy (KE) at 300 hPa, the altitude of the jet stream, where tiny errors rapidly amplify into major shifts in the weather, a phenomenon known as the butterfly effect.

The Diagnosis: A Tale of Two Scales

Figure 2: Kinetic energy spectra at 300 hPa for three forecast systems: IFS-HRES (top row), IFS-ENS (middle row), and GenCast (bottom row). The left column shows each system's spectrum at every forecast lead time, color-coded from blue (12 hours) to red (10 days); black lines mark the initial condition. The right column shows how each spectrum changes relative to that starting point. In the numerical models as forecast lead time grows, energy drains from the smallest scales, exactly as atmospheric dissipation requires. In GenCast, energy at the smallest scales drifts upward rather than downward, the visual signature of the physics missing from the model. (Figure 2 from the paper)
Figure 1: Kinetic energy spectra at 300 hPa for three forecast systems: IFS-HRES (top row), IFS-ENS (middle row), and GenCast (bottom row). The left column shows each system's spectrum at every forecast lead time, color-coded from blue (12 hours) to red (10 days); black lines mark the initial condition. The right column shows how each spectrum changes relative to that starting point.   (Figure 2 from the paper)

    By decomposing the wind fields into different spatial scales, the researchers "turned the forecasts inside out." The results revealed a fundamental divergence in how AI "sees" the atmosphere:

  •  Success in the synoptic scales: At large scales, the realm of highs, lows, and typhoons, GenCast is remarkable. Its ensemble spread grows much like the real atmosphere, following the established laws of turbulence. 
  • Break in the smaller scales, i.e., mesoscale: Below the 400-kilometer mark (the mesoscale), the physical realism collapses.

    In traditional numerical models, energy drains from the smallest scales as forecast lead time grows—process called atmospheric dissipation. In GenCast, the energy at these scales drifts upward instead. Its energy spectrum stops following physical laws and settles into a featureless plateau: the mathematical signature of white noise rather than weather.

The "Noise" Fingerprint

Figure 2: Rotaional and divergent components of Kinetic Energy (KE) spectra. Solid lines indicate rotational components and dashed lines indicate divergent components of KE for (a) IFS-HRES, (b) the first ensemble member of IFS-ENS, and (c) GenCast. Black lines for each model present decomposed spectra of the initial condition. Gray dashed lines with slopes of -3 and -5/3 are shown as reference turbulence power laws.
Figure 2: Rotaional and divergent components of Kinetic Energy (KE) spectra. Solid lines indicate rotational components and dashed lines indicate divergent components of KE for (a) IFS-HRES, (b) the first ensemble member of IFS-ENS, and (c) GenCast. Black lines for each model present decomposed spectra of the initial condition. Gray dashed lines with slopes of -3 and -5/3 are shown as reference turbulence power laws. (Figure 4 from the paper)

    A second test confirmed the suspicion. Using Helmholtz decomposition, the team split the wind into its rotating and diverging parts. In the real atmosphere, these remain strictly separated at certain scales. In GenCast, they collapsed into equal magnitude—exactly what happens in pure, random noise.

    "The moment we saw those two components converge at the mesoscale, we stopped thinking about turbulence," says Hisu Kim, the study's lead author. "We started wondering whether what we were looking at was the diffusion noise itself (the very mechanism that built the ensemble), leaving its fingerprint in the forecast."

    This wasn't just a GenCast quirk. The team found the same "flat mesoscale" in four different versions of GenCast and in AIFS-ENS, the European Centre’s own AI ensemble model. It appears to be a systemic "fingerprint" of current noise-utilizing ensemble forecasting methods.

Physical Consequence: Smearing the Jet Stream

FIG 1: Figure 8 from the paper. The magnitude of the kinetic energy gradient at 300 hPa, for (a) ERA5 reanalysis, (b) IFS-HRES, (c) IFS-ENS, and (d) GenCast forecasts. Red filaments mark the sharp edges of the jet stream. (a-c) shows clear filament structure of the jet stream around the midlatitude, while a noise-like pattern of kinetic energy gradient covered (d) GenCast's result.
Figure 2: The magnitude of the kinetic energy gradient at 300 hPa, for (a) ERA5 reanalysis, (b) IFS-HRES, (c) IFS-ENS, and (d) GenCast forecasts. Red filaments mark the sharp edges of the jet stream. (Figure 8 from the paper)

    The practical result of these artifacts is visible in the KE Gradient, essentially a "sharpness filter" for the wind. In the traditional view, the jet stream has crisp boundaries that mark the edge of high-speed winds. However, GenCast produces a "static" texture reminiscent of an old TV screen. It fails to produce the sharp, filamentary structures required to model the jet stream core accurately. 

A Physical Conscience for AI

    Does this mean AI forecasts are unreliable? Not at all. GenCast’s performance on conventional metrics is a genuine advance and often superior to traditional models. But as corresponding author Jin-Ho Yoon notes, "The diversity we see inside today’s AI ensembles is limited by statistical properties rather than grounded in physical law." The study, titled "A Spectral Test of the Butterfly Effect and Physical Consistency in the Diffusion-Based GenCast’s Ensembles," serves as a diagnosis for the next generation of AI development.

    If large language models taught us that fluency is not truth, this work suggests a parallel for weather AI: at the smallest scales, the shape of variance can outrun the physics beneath it. The road toward trustworthy scientific AI runs through tests like this, which ask a model not just whether it predicts the weather, but whether it is "speaking the language of the atmosphere."

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Atmospheric Science
Physical Sciences > Earth and Environmental Sciences > Earth Sciences > Atmospheric Science
Meteorology
Physical Sciences > Earth and Environmental Sciences > Earth Sciences > Atmospheric Science > Meteorology
Machine Learning
Mathematics and Computing > Computer Science > Artificial Intelligence > Machine Learning
Artificial Intelligence
Mathematics and Computing > Computer Science > Artificial Intelligence
Atmospheric Dynamics
Physical Sciences > Earth and Environmental Sciences > Earth Sciences > Atmospheric Science > Atmospheric Dynamics

Related Collections

With Collections, you can get published faster and increase your visibility.

Atmosphere-Biosphere Interactions

This Collection invites original Research articles, as well as Reviews, Perspectives, and Comments, that explore atmosphere-biosphere interactions across various temporal and spatial scales.

Publishing Model: Open Access

Deadline: Oct 31, 2026

Atmospheric rivers

In this cross journal collection, we have assembled articles that further our understanding of the impacts and dynamics of atmospheric rivers. We invite complementary submissions that help understand these remarkable if often destructive events.

Publishing Model: Hybrid

Deadline: May 31, 2026