Behind the Paper

From COVID Crisis to ChatGPT: How Urgent Evidence Needs Led to Rigorous AI Evaluation

Public health institutes aim to deliver reliable, timely evidence to stakeholders—policymakers, practitioners, and the public. Evidence synthesis sits at the heart of this mission. During COVID-19, this challenge became acute.

At the height of the COVID-19 pandemic, systematic reviewers at the Norwegian Institute of Public Health (NIPH) were overwhelmed with health authorities’ needs for updated evidence to help them decide on infection prevention and control policies. Traditional review methods couldn't keep pace. For example, health authorities needed to know within a week whether they should close schools, not 12 to 24 months later, the traditional timeframe for important evidence syntheses. That pushed us to try something new.

We began experimenting with machine learning (ML) tools to support evidence synthesis, hoping to work faster without compromising quality. Results were promising enough that we established a dedicated ML team—not only to continue emergency-era innovation, but to embed ML into our routine evidence synthesis work.

This work has continued post-pandemic through several team iterations. While each had slightly different aims, all combined three core activities: horizon scanning to spot promising tools and technologies early; evaluation work to assess automated tools' benefits and harms; and implementation work to help reviewers adopt such tools effectively.

Through horizon scanning, we were quick to identify the potential of generative artificial intelligence (AI) and large language models (LLMs). However, evidence synthesis is highly consequential—NIPH's commissioned work directly influences Norwegian healthcare and welfare systems and the systematic reviews it publishes via Cochrane and in other journals can inform decision making globally. Adopting new tools without understanding how well they work would be unwise. We therefore designed a rigorous comparison of human and LLM performance on a core evidence synthesis task.

Why Risk of Bias Assessment?

When synthesizing evidence, systematic reviewers don't just summarize results—we assess how trustworthy they are. Risk of bias (RoB) assessment examines whether trial design, conduct, or reporting could systematically distort findings. This means asking questions like: Was randomization robust? Were participants and investigators blinded? Were outcome data complete and fully reported? A trial that is at risk of bias is likely to over- or underestimate benefits and harms. Thus a RoB assessment helps reviewers and stakeholders use evidence appropriately. Unfortunately, RoB assessment is time-consuming and best practice requires that at least two highly trained, experienced researchers assess papers and come to a consensus on a trial's risk of bias.

RoB assessment was an ideal test case for comparing human and LLM performance: it's clearly defined, labor-intensive, and could potentially be completely automated—if LLMs can perform well enough. 

We therefore wrote a detailed protocol describing our design, analysis plan, and how we would interpret and report results.

Our Approach

The generative AI field moves rapidly, but academic publishing does not. We posted our protocol as a preprint, committing to publish whatever we found, and submitted it to BMC Medical Research Methodology for peer review, where it was recently published.

We planned to randomly select 100 Cochrane systematic reviews and one randomized trial from each that had been assessed for RoB by at least two human reviewers. We would develop the best prompt for ChatGPT that we could, using 25 pairs of reviews and trials to assess how well the LLM’s RoB assessments agreed with the human reviewers. We would then apply that best-performing prompt to the remaining 75 trials, again comparing ChatGPT's assessments with human assessments, to obtain our final estimate of agreement.

The protocol development attracted coauthors from across Norway and researchers with affiliations in Sweden, the UK, Canada, and Colombia.

The Prespecification Dilemma

We weren't alone in spotting this research opportunity. While we carefully followed our protocol, other teams began publishing studies on human-LLM agreement in RoB assessment. However, unlike our work, these studies were generally not prespecified. Some reported excellent agreement; others the opposite. This heterogeneity may reflect genuine differences in study objectives, but some likely results from lack of prespecification—for example, iteratively changing methods after seeing the data. Systematic reviewers familiar with the Cochrane RoB2 tool will recognize this as a "risk of bias in selection of the reported result". Prespecification in an openly available protocol is considered best practice, and the considerable lack of prespecification highlights that much research in this area should be considered exploratory.

Publishing a protocol in a high-interest field is risky because it exposes plans to potential competitors. But it can also encourage collaboration, strengthen research, and reduce duplicated effort to minimize research waste.

A Call for Better Incentives

Those relying on evidence syntheses should be confident that synthesis methods are fit for purpose and that the evidence underpinning the adoption of those methods is sound. Unfortunately, incentives to publish first discourage prespecification and peer review. 

More journals should adopt the Registered Report model: fast-track peer reviews of protocols and guarantee rapid publication of results papers, whatever the findings, provided authors report and justify any deviations from the published plan. Funders of methodological research could explicitly favor grant proposals that plan to prespecify and publish protocols from applicants with track records of prespecification, and could release funding in tranches, contingent on publishing protocols for subsequent research. Universities and research institutes could weigh prespecification in hiring and promotion.

The stakes are high. If LLMs can reliably perform RoB assessment or other evidence synthesis tasks, they could free human reviewers for higher-value work, accelerate review production, and—most importantly—help get safe, effective, affordable treatments into routine use sooner. But only rigorous evaluations can determine the degree to which LLMs help and harm, and what is traded off when we adopt them.

Our results are now in, and we look forward to sharing them soon. Our work on this study has reinforced our conviction that in fast-moving fields like AI applications in research, the academic community needs better mechanisms to balance speed with rigor—and that prespecification and appropriate research methods constitute a solid defense against the temptation of being misled by our own enthusiasm for promising new tools.