Behind the Paper

From COVID Crisis to ChatGPT: How Urgent Evidence Needs Led to Rigorous AI Evaluation

Public health institutes aim to deliver reliable, timely evidence to stakeholders—policymakers, practitioners, and the public. Evidence synthesis sits at the heart of this mission. During COVID-19, this challenge became acute.

Published in Computational Sciences, Biomedical Research, and Public Health

Aug 25, 2025

Christopher James Rose, Martin Ringsten & Ashley Muller

3 contributors

From COVID Crisis to ChatGPT: How Urgent Evidence Needs Led to Rigorous AI Evaluation

Liked by India Ambler and 2 others

Explore the Research

BioMed Central

Using a large language model (ChatGPT) to assess risk of bias in randomized controlled trials of medical interventions: protocol for a pilot study of interrater agreement with human reviewers - BMC Medical Research Methodology

Background Risk of bias (RoB) assessment is an essential part of systematic reviews that requires reading and understanding each eligible trial and RoB tools. RoB assessment is subject to human error and is time-consuming. Machine learning-based tools have been developed to automate RoB assessment using simple models trained on limited corpuses. ChatGPT is a conversational agent based on a large language model (LLM) that was trained on an internet-scale corpus and has demonstrated human-like abilities in multiple areas including healthcare. LLMs might be able to support systematic reviewing tasks such as assessing RoB. We aim to assess interrater agreement in overall (rather than domain-level) RoB assessment between human reviewers and ChatGPT, in randomized controlled trials of interventions within medical interventions. Methods We will randomly select 100 individually- or cluster-randomized, parallel, two-arm trials of medical interventions from recent Cochrane systematic reviews that have been assessed using the RoB1 or RoB2 family of tools. We will exclude reviews and trials that were performed under emergency conditions (e.g., COVID-19), as well as public health and welfare interventions. We will use 25 of the trials and human RoB assessments to engineer a ChatGPT prompt for assessing overall RoB, based on trial methods text. We will obtain ChatGPT assessments of RoB for the remaining 75 trials and human assessments. We will then estimate interrater agreement using Cohen’s κ. Results The primary outcome for this study is overall human-ChatGPT interrater agreement. We will report observed agreement with an exact 95% confidence interval, expected agreement under random assessment, Cohen’s κ, and a p-value testing the null hypothesis of no difference in agreement. Several other analyses are also planned. Conclusions This study is likely to provide the first evidence on interrater agreement between human RoB assessments and those provided by LLMs and will inform subsequent research in this area.

At the height of the COVID-19 pandemic, systematic reviewers at the Norwegian Institute of Public Health (NIPH) were overwhelmed with health authorities’ needs for updated evidence to help them decide on infection prevention and control policies. Traditional review methods couldn't keep pace. For example, health authorities needed to know within a week whether they should close schools, not 12 to 24 months later, the traditional timeframe for important evidence syntheses. That pushed us to try something new.

We began experimenting with machine learning (ML) tools to support evidence synthesis, hoping to work faster without compromising quality. Results were promising enough that we established a dedicated ML team—not only to continue emergency-era innovation, but to embed ML into our routine evidence synthesis work.

This work has continued post-pandemic through several team iterations. While each had slightly different aims, all combined three core activities: horizon scanning to spot promising tools and technologies early; evaluation work to assess automated tools' benefits and harms; and implementation work to help reviewers adopt such tools effectively.

Through horizon scanning, we were quick to identify the potential of generative artificial intelligence (AI) and large language models (LLMs). However, evidence synthesis is highly consequential—NIPH's commissioned work directly influences Norwegian healthcare and welfare systems and the systematic reviews it publishes via Cochrane and in other journals can inform decision making globally. Adopting new tools without understanding how well they work would be unwise. We therefore designed a rigorous comparison of human and LLM performance on a core evidence synthesis task.

Why Risk of Bias Assessment?

When synthesizing evidence, systematic reviewers don't just summarize results—we assess how trustworthy they are. Risk of bias (RoB) assessment examines whether trial design, conduct, or reporting could systematically distort findings. This means asking questions like: Was randomization robust? Were participants and investigators blinded? Were outcome data complete and fully reported? A trial that is at risk of bias is likely to over- or underestimate benefits and harms. Thus a RoB assessment helps reviewers and stakeholders use evidence appropriately. Unfortunately, RoB assessment is time-consuming and best practice requires that at least two highly trained, experienced researchers assess papers and come to a consensus on a trial's risk of bias.

RoB assessment was an ideal test case for comparing human and LLM performance: it's clearly defined, labor-intensive, and could potentially be completely automated—if LLMs can perform well enough.

We therefore wrote a detailed protocol describing our design, analysis plan, and how we would interpret and report results.

Our Approach

The generative AI field moves rapidly, but academic publishing does not. We posted our protocol as a preprint, committing to publish whatever we found, and submitted it to BMC Medical Research Methodology for peer review, where it was recently published.

We planned to randomly select 100 Cochrane systematic reviews and one randomized trial from each that had been assessed for RoB by at least two human reviewers. We would develop the best prompt for ChatGPT that we could, using 25 pairs of reviews and trials to assess how well the LLM’s RoB assessments agreed with the human reviewers. We would then apply that best-performing prompt to the remaining 75 trials, again comparing ChatGPT's assessments with human assessments, to obtain our final estimate of agreement.

The protocol development attracted coauthors from across Norway and researchers with affiliations in Sweden, the UK, Canada, and Colombia.

The Prespecification Dilemma

We weren't alone in spotting this research opportunity. While we carefully followed our protocol, other teams began publishing studies on human-LLM agreement in RoB assessment. However, unlike our work, these studies were generally not prespecified. Some reported excellent agreement; others the opposite. This heterogeneity may reflect genuine differences in study objectives, but some likely results from lack of prespecification—for example, iteratively changing methods after seeing the data. Systematic reviewers familiar with the Cochrane RoB2 tool will recognize this as a "risk of bias in selection of the reported result". Prespecification in an openly available protocol is considered best practice, and the considerable lack of prespecification highlights that much research in this area should be considered exploratory.

Publishing a protocol in a high-interest field is risky because it exposes plans to potential competitors. But it can also encourage collaboration, strengthen research, and reduce duplicated effort to minimize research waste.

A Call for Better Incentives

Those relying on evidence syntheses should be confident that synthesis methods are fit for purpose and that the evidence underpinning the adoption of those methods is sound. Unfortunately, incentives to publish first discourage prespecification and peer review.

More journals should adopt the Registered Report model: fast-track peer reviews of protocols and guarantee rapid publication of results papers, whatever the findings, provided authors report and justify any deviations from the published plan. Funders of methodological research could explicitly favor grant proposals that plan to prespecify and publish protocols from applicants with track records of prespecification, and could release funding in tranches, contingent on publishing protocols for subsequent research. Universities and research institutes could weigh prespecification in hiring and promotion.

The stakes are high. If LLMs can reliably perform RoB assessment or other evidence synthesis tasks, they could free human reviewers for higher-value work, accelerate review production, and—most importantly—help get safe, effective, affordable treatments into routine use sooner. But only rigorous evaluations can determine the degree to which LLMs help and harm, and what is traded off when we adopt them.

Our results are now in, and we look forward to sharing them soon. Our work on this study has reinforced our conviction that in fast-moving fields like AI applications in research, the academic community needs better mechanisms to balance speed with rigor—and that prespecification and appropriate research methods constitute a solid defense against the temptation of being misled by our own enthusiasm for promising new tools.

Multiple Contributors

Christopher James Rose, Martin Ringsten & Ashley Muller

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Public Health

Life Sciences > Health Sciences > Public Health

Artificial Intelligence

Mathematics and Computing > Computer Science > Artificial Intelligence

Randomized Controlled Clinical Trials

Life Sciences > Health Sciences > Biomedical Research > Clinical Research > Clinical Trials > Clinical Trial Design > Randomized Controlled Clinical Trials

BMC Medical Research Methodology

BMC Medical Research Methodology

This is an open access, peer-reviewed journal publishing original research articles in methodological approaches to healthcare research.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

The role of qualitative and mixed methods in advancing medical research

As medical research continues to evolve, there is growing recognition of the importance of context, meaning, and lived experience in understanding health and illness. Qualitative inquiry and mixed-methods research offer systematic approaches for exploring the richness and complexity of how individuals and communities interpret illness and experience healthcare, providing insights that are essential to advancing holistic and person-centered care.

This Collection celebrates innovative qualitative and interdisciplinary approaches—whether used on their own or in combination with quantitative methods—that enrich our understanding of health phenomena. By integrating diverse epistemological perspectives, these contributions aim to inform evidence-based practice, shape policy, and foster more equitable and person-centered models of care.

We welcome contributions on the following topics:

Design and implementation of qualitative and mixed methods studies in healthcare

Integrating qualitative insights into clinical decision-making and patient care

Ethical challenges in qualitative and mixed methods research

Case studies demonstrating the impact of qualitative data on medical practice

Innovative data collection and analysis techniques in mixed methods research

Training and capacity building in mixed methods research in healthcare

Creative methods in research design, data collection, dissemination

Visual methods and visual representations in medical research

Longitudinal qualitative and mixed-methods methodologies

Reflections on innovative ways of conducting qualitative evidence synthesis

Citizen science approaches in doing medical research

This Collection supports and amplifies research related to SDG 3: Good Health & Well-Being.

All manuscripts submitted to this journal, including those submitted to collections and special issues, are assessed in line with our editorial policies and the journal’s peer review process. Reviewers and editors are required to declare competing interests and can be excluded from the peer review process if a competing interest exists.

Publishing Model: Open Access

Deadline: Mar 11, 2026

Explore this Collection

Impact of AI on medical research

BMC Medical Research Methodology is calling for submissions to our Collection exploring how AI is transforming medical research, with a focus on work that critically assesses challenges and limitations across areas of expertise as diverse as drug discovery, diagnostic precision, personalized medicine, and trial designs.

Medical research, just as virtually any domain of knowledge, is impacted by technological advancements. Technological advancements have a transformative potential with respect to the efficiency, precision, and breadth of research processes: new or more accurate and complete data can be collected and analyzed, novel frameworks for research can be developed, and the discovery of new treatments or the improvement of patient outcomes can be expedited.

A now headline-grabbing technology that bears potential and challenges is represented by large language models: they assist medical research in analyzing vast datasets, identifying patterns in patient data and generating hypotheses for further investigation; at the same time, they raise risks and ethical considerations. Similarly, the use of artificial intelligence is becoming critical in the drug discovery process, as it allows the identification of novel therapeutic targets, the optimization of molecule design, and the prediction of patient responses to treatments.

This collection delves into the ways in which various aspects of technological advancements impact current medical research. It does not aim to attract studies presenting technological innovations and new methods themselves, but rather the way these affect the research processes and frameworks:

The use of large language models to streamline and/or improve research processes

The use of artificial intelligence and machine learning for specific processes such as drug discovery and clinical trial design and management

Publishing Model: Open Access

Deadline: Jun 18, 2026

Explore this Collection

Latest Content

News and Opinion

Research on preventive health-care neglect between theory and effective interventions

Behind the Paper

Measuring What Truly Matters: A New Scale Blends Psychology and Spirituality to Redefine Effectiveness in Churches

Reforming PhD system to suit contemporary realities

Learning how to exploit bacterial competition to target harmful strains

Reflections on the 2025 WAAVP Conference

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

From COVID Crisis to ChatGPT: How Urgent Evidence Needs Led to Rigorous AI Evaluation

Share this post

Share with...

...or copy the link