From COVID Crisis to ChatGPT: How Urgent Evidence Needs Led to Rigorous AI Evaluation

Public health institutes aim to deliver reliable, timely evidence to stakeholders—policymakers, practitioners, and the public. Evidence synthesis sits at the heart of this mission. During COVID-19, this challenge became acute.
From COVID Crisis to ChatGPT: How Urgent Evidence Needs Led to Rigorous AI Evaluation
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Explore the Research

BioMed Central
BioMed Central BioMed Central

Using a large language model (ChatGPT) to assess risk of bias in randomized controlled trials of medical interventions: protocol for a pilot study of interrater agreement with human reviewers - BMC Medical Research Methodology

Background Risk of bias (RoB) assessment is an essential part of systematic reviews that requires reading and understanding each eligible trial and RoB tools. RoB assessment is subject to human error and is time-consuming. Machine learning-based tools have been developed to automate RoB assessment using simple models trained on limited corpuses. ChatGPT is a conversational agent based on a large language model (LLM) that was trained on an internet-scale corpus and has demonstrated human-like abilities in multiple areas including healthcare. LLMs might be able to support systematic reviewing tasks such as assessing RoB. We aim to assess interrater agreement in overall (rather than domain-level) RoB assessment between human reviewers and ChatGPT, in randomized controlled trials of interventions within medical interventions. Methods We will randomly select 100 individually- or cluster-randomized, parallel, two-arm trials of medical interventions from recent Cochrane systematic reviews that have been assessed using the RoB1 or RoB2 family of tools. We will exclude reviews and trials that were performed under emergency conditions (e.g., COVID-19), as well as public health and welfare interventions. We will use 25 of the trials and human RoB assessments to engineer a ChatGPT prompt for assessing overall RoB, based on trial methods text. We will obtain ChatGPT assessments of RoB for the remaining 75 trials and human assessments. We will then estimate interrater agreement using Cohen’s κ. Results The primary outcome for this study is overall human-ChatGPT interrater agreement. We will report observed agreement with an exact 95% confidence interval, expected agreement under random assessment, Cohen’s κ, and a p-value testing the null hypothesis of no difference in agreement. Several other analyses are also planned. Conclusions This study is likely to provide the first evidence on interrater agreement between human RoB assessments and those provided by LLMs and will inform subsequent research in this area.

At the height of the COVID-19 pandemic, systematic reviewers at the Norwegian Institute of Public Health (NIPH) were overwhelmed with health authorities’ needs for updated evidence to help them decide on infection prevention and control policies. Traditional review methods couldn't keep pace. For example, health authorities needed to know within a week whether they should close schools, not 12 to 24 months later, the traditional timeframe for important evidence syntheses. That pushed us to try something new.

We began experimenting with machine learning (ML) tools to support evidence synthesis, hoping to work faster without compromising quality. Results were promising enough that we established a dedicated ML team—not only to continue emergency-era innovation, but to embed ML into our routine evidence synthesis work.

This work has continued post-pandemic through several team iterations. While each had slightly different aims, all combined three core activities: horizon scanning to spot promising tools and technologies early; evaluation work to assess automated tools' benefits and harms; and implementation work to help reviewers adopt such tools effectively.

Through horizon scanning, we were quick to identify the potential of generative artificial intelligence (AI) and large language models (LLMs). However, evidence synthesis is highly consequential—NIPH's commissioned work directly influences Norwegian healthcare and welfare systems and the systematic reviews it publishes via Cochrane and in other journals can inform decision making globally. Adopting new tools without understanding how well they work would be unwise. We therefore designed a rigorous comparison of human and LLM performance on a core evidence synthesis task.

Why Risk of Bias Assessment?

When synthesizing evidence, systematic reviewers don't just summarize results—we assess how trustworthy they are. Risk of bias (RoB) assessment examines whether trial design, conduct, or reporting could systematically distort findings. This means asking questions like: Was randomization robust? Were participants and investigators blinded? Were outcome data complete and fully reported? A trial that is at risk of bias is likely to over- or underestimate benefits and harms. Thus a RoB assessment helps reviewers and stakeholders use evidence appropriately. Unfortunately, RoB assessment is time-consuming and best practice requires that at least two highly trained, experienced researchers assess papers and come to a consensus on a trial's risk of bias.

RoB assessment was an ideal test case for comparing human and LLM performance: it's clearly defined, labor-intensive, and could potentially be completely automated—if LLMs can perform well enough. 

We therefore wrote a detailed protocol describing our design, analysis plan, and how we would interpret and report results.

Our Approach

The generative AI field moves rapidly, but academic publishing does not. We posted our protocol as a preprint, committing to publish whatever we found, and submitted it to BMC Medical Research Methodology for peer review, where it was recently published.

We planned to randomly select 100 Cochrane systematic reviews and one randomized trial from each that had been assessed for RoB by at least two human reviewers. We would develop the best prompt for ChatGPT that we could, using 25 pairs of reviews and trials to assess how well the LLM’s RoB assessments agreed with the human reviewers. We would then apply that best-performing prompt to the remaining 75 trials, again comparing ChatGPT's assessments with human assessments, to obtain our final estimate of agreement.

The protocol development attracted coauthors from across Norway and researchers with affiliations in Sweden, the UK, Canada, and Colombia.

The Prespecification Dilemma

We weren't alone in spotting this research opportunity. While we carefully followed our protocol, other teams began publishing studies on human-LLM agreement in RoB assessment. However, unlike our work, these studies were generally not prespecified. Some reported excellent agreement; others the opposite. This heterogeneity may reflect genuine differences in study objectives, but some likely results from lack of prespecification—for example, iteratively changing methods after seeing the data. Systematic reviewers familiar with the Cochrane RoB2 tool will recognize this as a "risk of bias in selection of the reported result". Prespecification in an openly available protocol is considered best practice, and the considerable lack of prespecification highlights that much research in this area should be considered exploratory.

Publishing a protocol in a high-interest field is risky because it exposes plans to potential competitors. But it can also encourage collaboration, strengthen research, and reduce duplicated effort to minimize research waste.

A Call for Better Incentives

Those relying on evidence syntheses should be confident that synthesis methods are fit for purpose and that the evidence underpinning the adoption of those methods is sound. Unfortunately, incentives to publish first discourage prespecification and peer review. 

More journals should adopt the Registered Report model: fast-track peer reviews of protocols and guarantee rapid publication of results papers, whatever the findings, provided authors report and justify any deviations from the published plan. Funders of methodological research could explicitly favor grant proposals that plan to prespecify and publish protocols from applicants with track records of prespecification, and could release funding in tranches, contingent on publishing protocols for subsequent research. Universities and research institutes could weigh prespecification in hiring and promotion.

The stakes are high. If LLMs can reliably perform RoB assessment or other evidence synthesis tasks, they could free human reviewers for higher-value work, accelerate review production, and—most importantly—help get safe, effective, affordable treatments into routine use sooner. But only rigorous evaluations can determine the degree to which LLMs help and harm, and what is traded off when we adopt them.

Our results are now in, and we look forward to sharing them soon. Our work on this study has reinforced our conviction that in fast-moving fields like AI applications in research, the academic community needs better mechanisms to balance speed with rigor—and that prespecification and appropriate research methods constitute a solid defense against the temptation of being misled by our own enthusiasm for promising new tools.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Public Health
Life Sciences > Health Sciences > Public Health
Artificial Intelligence
Mathematics and Computing > Computer Science > Artificial Intelligence
Randomized Controlled Clinical Trials
Life Sciences > Health Sciences > Biomedical Research > Clinical Research > Clinical Trials > Clinical Trial Design > Randomized Controlled Clinical Trials

Related Collections

With Collections, you can get published faster and increase your visibility.

Causal inference and observational data vol. 2

BMC Medical Research Methodology is calling for submissions to our Collection on causal inference methods using observational data in medical contexts.

Randomized controlled trials represent the methodological gold standard for establishing causal effects; however, their implementation is frequently constrained by ethical, logistical, or practical limitations in many real-world health contexts. In such circumstances, observational studies constitute a critical source of empirical evidence. Drawing valid causal conclusions from such observational data presents unique challenges that demand rigorous methods.

This Collection aims to advance the methodological foundations and practical applications of causal inference using observational data in health and medical research. We invite submissions that explore novel statistical, computational, and conceptual approaches to causal inference in observational settings, with a focus on improving transparency, reproducibility, and robustness in medical research. Contributions may span theoretical developments, simulation studies, applied analyses, and software tools that facilitate causal reasoning. Topics of interest include, but are not limited to:

Methodological advances in causal inference frameworks (e.g., potential outcomes, structural causal models, graphical models)

Bias mitigation techniques, including confounding control, selection bias, and measurement error

Causal discovery algorithms and machine learning approaches for observational datasets

Novel implementation and rigorous comparisons of various propensity score-based methods for confounding adjustment

Instrumental variable approaches and natural experiments

Sensitivity analyses and quantitative bias analysis

Comparative effectiveness research using real-world data

Applications in epidemiology, health services research, and clinical decision-making

Reporting standards and transparency tools for causal inference studies

We particularly encourage submissions that demonstrate real-world relevance, address methodological gaps, or propose interdisciplinary solutions bridging statistics, computer science, and biomedical sciences.

All manuscripts submitted to this journal, including those submitted to collections and special issues, are assessed in line with our editorial policies and the journal’s peer review process. Reviewers and editors are required to declare competing interests and can be excluded from the peer review process if a competing interest exists.

Publishing Model: Open Access

Deadline: Jul 30, 2026

Infodemics

BMC Medical Research Methodology is calling for submissions to our Collection on the impact of rapid spread of either accurate or misleading information on medical research during public health crises, which can become a critical challenge in global health.

Medical research plays a role but is not the only actor in determining the public perception of the value of research, its findings, and the way these are reported in published form and shape public health policies; rather, medical researchers act alongside science communicators, social media, professional journalists, public institutions, just to name a few, able to influence the information landscape and the way scientific findings are used and disseminated within and outside of the scientific community.

As the global community is facing major challenges due to the spreading or new or old infectious diseases, a deluge of information, including misleading or false content is spreading even faster through the internet, social networks, and the media, dangerously altering risk perceptions and disseminating false information about diseases, their causes and dynamics, as well as potential treatments. The surge of excessive, false, or misleading information may pose new and serious threats to global health that should be adequately tackled through major research efforts and carefully targeted, evidence-based policy interventions.

This collection seeks to explore different approaches to understanding, preventing, managing, and mitigating the impact of infodemics on public health outcomes. While the main focus is on infectious diseases, the collection also welcomes studies on any other conditions or health-related practices that are commonly the target of misinformation or subject of misconception.

We invite contributions that address a wide range of topics, including but not limited to:

Methodological frameworks for studying the dynamics of infodemics in digital and physical environments

Strategies for enhancing the resilience of public health systems to infodemics and misinformation and disinformation

The role of interdisciplinary approaches in developing evidence-based interventions for infodemic management

Case studies highlighting the impact of infodemics on health behaviors, policy-making, and trust in health authorities

Technological innovations and tools for monitoring, analyzing, and countering infodemics

All manuscripts submitted to this journal, including those submitted to collections and special issues, are assessed in line with our editorial policies and the journal’s peer review process. Reviewers and editors are required to declare competing interests and can be excluded from the peer review process if a competing interest exists.

Publishing Model: Open Access

Deadline: Apr 05, 2026