The Missing Counterfactual for Terrestrial Water Storage Attribution

Detecting the human and warming fingerprint on terrestrial water storage requires knowing how much the land water system could vary on its own. We show that this baseline does not yet exist.

Published in Earth & Environment

Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Explore the Research

wiley.com wiley.com

Just a moment...

The Land Remembers What the Atmosphere Forgets

Terrestrial water storage is the water held on and beneath the land surface: soil moisture, groundwater, snow and ice, rivers, lakes, wetlands, and canopy storage. Its anomaly, TWSA, is inferred from changes in Earth's gravity field and records the state left behind after a basin has stored, routed, evaporated, withdrawn, or released water through time. It is the memory of the land water system. That memory is what makes TWSA both powerful and difficult. The same atmospheric sequence can produce different storage outcomes depending on basin properties and antecedent conditions. A wet pulse over depleted storage enters recharge. The same pulse over saturated storage becomes a flood anomaly. The atmosphere moves on; the land carries the residue. TWSA therefore compresses drought severity, flood persistence, groundwater depletion, and human water use into one evolving quantity. Beginning with GRACE in 2002 and continuing with GRACE-FO, satellite gravimetry has provided the global constraint from which basin-integrated TWSA is inferred. That record is one realized history of a nonlinear system shaped by internal climate variability, long-term warming, and human water management whose effects overlap in space and time ((Figure 1).

Figure 1. GRACE/GRACE-FO-inferred TWSA patterns used to motivate the attribution problem. Left map shows the long-term TWSA trend over 2002-2025, while the middle and right maps show La Niña and El Niño TWSA composites during 2002-2025 after detrending and deseasonalization. Brown/red indicates storage deficit or drying; blue indicates storage surplus or wetting. The maps show why TWSA is a basin-integrated, memory-bearing variable: long-term redistribution and ENSO-conditioned extremes coexist in the same GRACE-inferred record.

Figure 1. GRACE/GRACE-FO-inferred TWSA patterns used to motivate the attribution problem. Left map shows the long-term TWSA trend over 2002-2025, while the middle and right maps show La Niña and El Niño TWSA composites during 2002-2025 after detrending and deseasonalization. Brown/red indicates storage deficit or drying; blue indicates storage surplus or wetting. The maps show why TWSA is a basin-integrated, memory-bearing variable: long-term redistribution and ENSO-conditioned extremes coexist in the same GRACE-inferred record.

The Baseline Comes First

Every attribution claim begins with a comparison. A temperature rise, flood persistence, drought severity, groundwater depletion trend, rainfall shift, or sea-level acceleration gains causal meaning only after we ask: how large could this change have been under internal variability alone? That question supplies the baseline. Without it, the observation tells us what happened; it cannot tell us what caused it.

For TWSA, this comparison is harder than for most hydroclimate variables. Temperature, precipitation, and sea level have pre-industrial baselines that are imperfect but conceptually clean. TWSA is both a climate outcome and a managed water variable. Pumping removes groundwater. Irrigation redistributes water among rivers, aquifers, soils, and the atmosphere. Reservoirs store, delay, and release runoff. These interventions enter the same GRACE-inferred record that also carries droughts, pluvials, and long-term warming. No preprocessing step separates them cleanly.

The order of reasoning must therefore be strict: first characterize the observed record, then test whether the proposed baseline can contain it, and only then interpret any remaining departure as evidence of external forcing. The prior question is not attribution. It is adequacy.

SMILEs attempt to supply that baseline (Figure 2). Run the same climate model many times under identical external forcing, each time from a slightly different starting state. The chaotic climate system amplifies those differences, and the members diverge into distinct histories of internal variability. For TWSA, each member yields a different sequence of droughts, pluvials, and recoveries. The collection of those histories is the proposed null: the range of storage behavior that internal variability alone could generate. If GRACE falls outside it, something beyond internal variability is required. But that inference holds only if the null is adequate — only if the simulated storage histories are energetic enough, persistent enough, and extreme enough to contain what the land water system actually produces.

Figure 2. Conceptual framing of the TWSA attribution problem. GRACE/GRACE-FO does not measure terrestrial water storage directly; it provides a gravity constraint from which basin-integrated TWSA is inferred. The inferred trajectory is one realized history of the land water system. A single-model large ensemble samples many internally generated histories under the same land physics, producing a model counterfactual envelope. The attribution question is whether the GRACE-inferred history is a typical draw from that model world or falls outside it.

Figure 2. Conceptual framing of the TWSA attribution problem. GRACE/GRACE-FO does not measure terrestrial water storage directly; it provides a gravity constraint from which basin-integrated TWSA is inferred. The inferred trajectory is one realized history of the land water system. A single-model large ensemble samples many internally generated histories under the same land physics, producing a model counterfactual envelope. The attribution question is whether the GRACE-inferred history is a typical draw from that model world or falls outside it.

Testing the Null: 18,258 Simulated Years Versus 23 Years of GRACE-Inferred TWSA

We treated the SMILE histories as hypotheses about the missing baseline, asking whether 80 members from CESM2-LENS2 and 18 members from IPSL-CM6A-LR can supply the null distribution required for TWSA attribution. The test was direct: after removing the model-estimated forced component and applying identical preprocessing to both records, the GRACE trajectory should behave like a typical draw from the SMILE internal-variability distribution.

We ran this test across 184 global river basins, drawing on 18,258 simulated model-years. We compared amplitude, variance, drought depth, pluvial height, persistence, recovery, and spectral power: how large the storage swings are, how severe the extremes become, how long anomalies persist, and how variance is distributed across timescales.

The null failed repeatedly (Figure 3). A calibrated envelope should contain the GRACE-inferred values in roughly 90% of basins. CESM2 enclosed the GRACE amplitude in 47% of basins and variance in 53%. IPSL enclosed the amplitude in 37% and variance in 42%. The simulated storage worlds were too narrow and too weakly energized.

The failure sharpened for extremes. CESM2 enclosed only 25% of GRACE drought depths and 26% of pluvial heights. IPSL enclosed 39% and 30% respectively. In several basins, no member-window reached the observed severity at all. The models also distorted memory: GRACE anomalies were persistent, yet the models produced even longer memory with weaker variance, the signature of over-smoothed storage dynamics. Spectral structure was sometimes right in period but consistently damped in power.

The multivariate test gave the plainest verdict. In the joint space of amplitude, standard deviation, maximum drought depth, and maximum pluvial height, 82% of basins were incompatible with CESM2 and 78% with IPSL. The GRACE trajectory sat farther from the model ensemble center than nearly all simulated member-windows. These ensembles do not provide a reliable null for TWSA attribution. That is the prior question, and it must be answered before the forcing question can be asked.

Figure 3. Basin-scale compatibility of GRACE-inferred TWSA with the model counterfactual envelope. The map classifies 184 river basins by whether the joint GRACE-inferred TWSA diagnostics are compatible, marginal, or incompatible with CESM2-LENS2 and IPSL-CM6A-LR internal variability. The Amazon example illustrates a basin outside the model envelopes: the black GRACE/GRACE-FO trajectory leaves the simulated spread, especially during the recent record. The Danube inset shows a contrasting case where the inferred trajectory remains within the envelopes. Together the panels show that the failure is not isolated to one metric or one basin; it is spatially widespread and tied to variance, extremes, and storage memory.

Figure 3. Basin-scale compatibility of GRACE-inferred TWSA with the model counterfactual envelope. The map classifies 184 river basins by whether the joint GRACE-inferred TWSA diagnostics are compatible, marginal, or incompatible with CESM2-LENS2 and IPSL-CM6A-LR internal variability. The Amazon example illustrates a basin outside the model envelopes: the black GRACE/GRACE-FO trajectory leaves the simulated spread, especially during the recent record. The Danube inset shows a contrasting case where the inferred trajectory remains within the envelopes. Together the panels show that the failure is not isolated to one metric or one basin; it is spatially widespread and tied to variance, extremes, and storage memory.

Implication and What Comes Next

The limitation is structural, not statistical. Multiplying members of the same model cannot recover variance that its land physics suppress. If groundwater coupling is shallow, floodplain storage absent, and drainage too rapid, no ensemble size repairs that. The null stays too narrow regardless of how many members repeat it.

A narrow null has a direct attribution cost. When GRACE exceeds the model envelope, the excess is not evidence of forcing. It is evidence of an inadequate baseline. Attributing it to human water use or long-term warming would be a mistake the design of the experiment made inevitable.

The fix is not larger SMILEs. It is better land physics, ensembles that sample land-process uncertainty alongside atmospheric initial conditions, and GRACE used as a calibration target during model development rather than a check applied after. Until simulated storage histories can reproduce the amplitude, variance, persistence, and extremes that GRACE observes, they cannot serve as a defensible null for TWSA attribution. That is the standard CMIP7 should require.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Climate Sciences
Physical Sciences > Earth and Environmental Sciences > Earth Sciences > Climate Sciences
Climate-Change Impacts
Physical Sciences > Earth and Environmental Sciences > Earth Sciences > Climate Sciences > Climate Change > Climate-Change Impacts
Water
Physical Sciences > Earth and Environmental Sciences > Environmental Sciences > Water
Environmental Impact
Physical Sciences > Earth and Environmental Sciences > Environmental Sciences > Environmental Social Sciences > Environmental Impact