Behind the Paper - A New Hope: Studying Reproducibility at Scale

Over the past decade, debates about reproducibility have become more visible. A growing number of studies have examined specific disciplines, journals, or research designs, offering valuable insights on how reproducible particular corners of the literature are.
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

As social scientists, watching and participating in these developments, we felt that a systematic and broad analysis was missing. 

We wanted to know what would happen if we stepped back and looked across a wide range of recent, leading research in economics and political science using a common framework. Our goal was not to revisit old debates or deliver a simple verdict, but rather to document at scale how reproducible recent published work is.  

However, re-running existing code on provided datasets was the straightforward part. The real difficulty was organizational and intellectual: How can we assess many papers in a way that is consistent while still respecting the context in which the research was conducted and analysed? Scale could not become our sole aim, since a large-scale “box-checking” exercise would quickly become superficial and careless.

At the Institute for Replication, we addressed this problem by developing what we call “Replication Games”. These are intensive, hackathon-style events in which small teams of researchers work on a single published paper within their own area of expertise. Subfields such as comparative politics, health economics, or development economics each have their own preferred analytical approaches, standards, and reporting conventions, which is why expertise match between published papers and reproduction teams is valuable.

To coordinate this effort, we provided reporting templates, and provided guidance on coding practices, documentation standards, and tone when writing reproduction reports. This allows us to combine the results of over one hundred replication projects into a coherent meta-database. 

A crucial and time-intensive component of this process is communication with the original authors. In most cases, we reach out directly to clarify details, request missing materials or resolve discrepancies uncovered during reproduction. As a general rule, we always share the report with the (lead) authors of the published paper prior to disseminating the reproduction report. Naturally, engagement levels vary, but overall this dialogue has been productive and has often led to improved documentation and replication packages, and, occasionally, to corrections to published work.

One of our most striking findings was the relatively high rate of computational reproducibility. In the large majority of cases, reproduction teams were able to reproduce the main tables and figures using the original authors’ data and code. There were hurdles such as incomplete documentation, minor coding inconsistencies, ambiguities regarding data processing, but outright failures were rare. To the extent our sample of reproduced articles reflects the rapidly improving data-availability policies and spread of open science practices, this is reassuring.

When we moved from computational reproduction to robustness, the picture became more nuanced. Reproduction teams explored how results changed under alternative, defensible analytical choices: slightly different sample definitions, reasonable adjustments to control variables or estimation strategies. Here, heterogeneity was common: Some findings were highly stable across specifications, while others were more sensitive. Often, this variation appeared within the same paper.

This nuance is central to how we interpret our findings. Reproducibility – the ability to recreate published results from the original data and code – is not the same as robustness – the stability of results under alternative approaches. Reproducibility thus speaks to transparency and documentation. In contrast, robustness speaks to the sensitivity of an empirical claim. Both are important, but they capture different dimensions of scientific quality.

We are however careful about the scope of our conclusions. Our analysis focuses on recent articles in leading journals that require authors to share data and code. As a result, we cannot draw conclusions about older research, about outlets without such policies, or about the full diversity of subfields within economics and political science. Expanding the coverage of our meta-database to other (sub)fields and disciplines remains our ongoing priority.

Why start with this relatively narrow slice of the literature? One reason is normative. We see this initiative as part of raising standards within our own profession. By focusing on recent work and engaging constructively with authors, we aim to strengthen norms around transparency and reproducibility going forward. This forward-looking orientation has helped the project gain broad support and participation.

On a more practical level, our Replication Games provide a unique opportunity for more junior scholars to learn from top researchers in their field: What are the standards regarding how data is cleaned, which methods are used and how empirical decisions are justified, and how are results presented? In that sense, reproduction is not only about verification; it is also about learning and professional development. Making data and code available benefits young researchers and strengthens the discipline as a whole. 

Finally, our experience has shaped how we think about reproduction reports themselves. They are rarely entirely positive or entirely negative. Most contain nuance: they reproduce central findings while identifying specific sensitivities, clarifying or challenging interpretations, or extending the analysis to understand mechanisms behind the results. This kind of detailed engagement is a sign of healthy scientific exchange, since it reflects the reality that empirical research involves judgment calls and trade-offs.

We would in fact encourage authors to see reports as complements to their work and to make them visible alongside the original publications on their personal websites. After all, reproducibility and robustness extends the scientific conversation; it does not end it.

Looking ahead, we see our project as a model for how reproductions can be organized at scale while preserving expertise, respect and constructive dialogue. Our broader goal is to contribute to a culture in which reproducibility is an integral and expected part of how economists and other social scientists produce and evaluate knowledge.

~ Abel Brodeur, Derek Mikola, Nikolai Cook, Lenka Fiala

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Political Science
Humanities and Social Sciences > Politics and International Studies > Political Science
Econometrics
Humanities and Social Sciences > Economics > Quantitative Economics > Econometrics
Quantitative Economics
Humanities and Social Sciences > Economics > Quantitative Economics
  • Nature Nature

    A weekly international journal publishing the finest peer-reviewed research in all fields of science and technology on the basis of its originality, importance, interdisciplinary interest, timeliness, accessibility, elegance and surprising conclusions.