From COVID Crisis to ChatGPT: How Urgent Evidence Needs Led to Rigorous AI Evaluation
Published in Computational Sciences, Biomedical Research, and Public Health
At the height of the COVID-19 pandemic, systematic reviewers at the Norwegian Institute of Public Health (NIPH) were overwhelmed with health authorities’ needs for updated evidence to help them decide on infection prevention and control policies. Traditional review methods couldn't keep pace. For example, health authorities needed to know within a week whether they should close schools, not 12 to 24 months later, the traditional timeframe for important evidence syntheses. That pushed us to try something new.
We began experimenting with machine learning (ML) tools to support evidence synthesis, hoping to work faster without compromising quality. Results were promising enough that we established a dedicated ML team—not only to continue emergency-era innovation, but to embed ML into our routine evidence synthesis work.
This work has continued post-pandemic through several team iterations. While each had slightly different aims, all combined three core activities: horizon scanning to spot promising tools and technologies early; evaluation work to assess automated tools' benefits and harms; and implementation work to help reviewers adopt such tools effectively.
Through horizon scanning, we were quick to identify the potential of generative artificial intelligence (AI) and large language models (LLMs). However, evidence synthesis is highly consequential—NIPH's commissioned work directly influences Norwegian healthcare and welfare systems and the systematic reviews it publishes via Cochrane and in other journals can inform decision making globally. Adopting new tools without understanding how well they work would be unwise. We therefore designed a rigorous comparison of human and LLM performance on a core evidence synthesis task.
Why Risk of Bias Assessment?
When synthesizing evidence, systematic reviewers don't just summarize results—we assess how trustworthy they are. Risk of bias (RoB) assessment examines whether trial design, conduct, or reporting could systematically distort findings. This means asking questions like: Was randomization robust? Were participants and investigators blinded? Were outcome data complete and fully reported? A trial that is at risk of bias is likely to over- or underestimate benefits and harms. Thus a RoB assessment helps reviewers and stakeholders use evidence appropriately. Unfortunately, RoB assessment is time-consuming and best practice requires that at least two highly trained, experienced researchers assess papers and come to a consensus on a trial's risk of bias.
RoB assessment was an ideal test case for comparing human and LLM performance: it's clearly defined, labor-intensive, and could potentially be completely automated—if LLMs can perform well enough.
We therefore wrote a detailed protocol describing our design, analysis plan, and how we would interpret and report results.
Our Approach
The generative AI field moves rapidly, but academic publishing does not. We posted our protocol as a preprint, committing to publish whatever we found, and submitted it to BMC Medical Research Methodology for peer review, where it was recently published.
We planned to randomly select 100 Cochrane systematic reviews and one randomized trial from each that had been assessed for RoB by at least two human reviewers. We would develop the best prompt for ChatGPT that we could, using 25 pairs of reviews and trials to assess how well the LLM’s RoB assessments agreed with the human reviewers. We would then apply that best-performing prompt to the remaining 75 trials, again comparing ChatGPT's assessments with human assessments, to obtain our final estimate of agreement.
The protocol development attracted coauthors from across Norway and researchers with affiliations in Sweden, the UK, Canada, and Colombia.
The Prespecification Dilemma
We weren't alone in spotting this research opportunity. While we carefully followed our protocol, other teams began publishing studies on human-LLM agreement in RoB assessment. However, unlike our work, these studies were generally not prespecified. Some reported excellent agreement; others the opposite. This heterogeneity may reflect genuine differences in study objectives, but some likely results from lack of prespecification—for example, iteratively changing methods after seeing the data. Systematic reviewers familiar with the Cochrane RoB2 tool will recognize this as a "risk of bias in selection of the reported result". Prespecification in an openly available protocol is considered best practice, and the considerable lack of prespecification highlights that much research in this area should be considered exploratory.
Publishing a protocol in a high-interest field is risky because it exposes plans to potential competitors. But it can also encourage collaboration, strengthen research, and reduce duplicated effort to minimize research waste.
A Call for Better Incentives
Those relying on evidence syntheses should be confident that synthesis methods are fit for purpose and that the evidence underpinning the adoption of those methods is sound. Unfortunately, incentives to publish first discourage prespecification and peer review.
More journals should adopt the Registered Report model: fast-track peer reviews of protocols and guarantee rapid publication of results papers, whatever the findings, provided authors report and justify any deviations from the published plan. Funders of methodological research could explicitly favor grant proposals that plan to prespecify and publish protocols from applicants with track records of prespecification, and could release funding in tranches, contingent on publishing protocols for subsequent research. Universities and research institutes could weigh prespecification in hiring and promotion.
The stakes are high. If LLMs can reliably perform RoB assessment or other evidence synthesis tasks, they could free human reviewers for higher-value work, accelerate review production, and—most importantly—help get safe, effective, affordable treatments into routine use sooner. But only rigorous evaluations can determine the degree to which LLMs help and harm, and what is traded off when we adopt them.
Our results are now in, and we look forward to sharing them soon. Our work on this study has reinforced our conviction that in fast-moving fields like AI applications in research, the academic community needs better mechanisms to balance speed with rigor—and that prespecification and appropriate research methods constitute a solid defense against the temptation of being misled by our own enthusiasm for promising new tools.
Follow the Topic
-
BMC Medical Research Methodology
This is an open access, peer-reviewed journal publishing original research articles in methodological approaches to healthcare research.
Related Collections
With Collections, you can get published faster and increase your visibility.
The role of qualitative and mixed methods in advancing medical research
As medical research continues to evolve, there is growing recognition of the importance of context, meaning, and lived experience in understanding health and illness. Qualitative inquiry and mixed-methods research offer systematic approaches for exploring the richness and complexity of how individuals and communities interpret illness and experience healthcare, providing insights that are essential to advancing holistic and person-centered care.
This Collection celebrates innovative qualitative and interdisciplinary approaches—whether used on their own or in combination with quantitative methods—that enrich our understanding of health phenomena. By integrating diverse epistemological perspectives, these contributions aim to inform evidence-based practice, shape policy, and foster more equitable and person-centered models of care.
We welcome contributions on the following topics:
Design and implementation of qualitative and mixed methods studies in healthcare
Integrating qualitative insights into clinical decision-making and patient care
Ethical challenges in qualitative and mixed methods research
Case studies demonstrating the impact of qualitative data on medical practice
Innovative data collection and analysis techniques in mixed methods research
Training and capacity building in mixed methods research in healthcare
Creative methods in research design, data collection, dissemination
Visual methods and visual representations in medical research
Longitudinal qualitative and mixed-methods methodologies
Reflections on innovative ways of conducting qualitative evidence synthesis
Citizen science approaches in doing medical research
This Collection supports and amplifies research related to SDG 3: Good Health & Well-Being.
All manuscripts submitted to this journal, including those submitted to collections and special issues, are assessed in line with our editorial policies and the journal’s peer review process. Reviewers and editors are required to declare competing interests and can be excluded from the peer review process if a competing interest exists.
Publishing Model: Open Access
Deadline: Mar 11, 2026
Impact of AI on medical research
BMC Medical Research Methodology is calling for submissions to our Collection exploring how AI is transforming medical research, with a focus on work that critically assesses challenges and limitations across areas of expertise as diverse as drug discovery, diagnostic precision, personalized medicine, and trial designs.
Medical research, just as virtually any domain of knowledge, is impacted by technological advancements. Technological advancements have a transformative potential with respect to the efficiency, precision, and breadth of research processes: new or more accurate and complete data can be collected and analyzed, novel frameworks for research can be developed, and the discovery of new treatments or the improvement of patient outcomes can be expedited.
A now headline-grabbing technology that bears potential and challenges is represented by large language models: they assist medical research in analyzing vast datasets, identifying patterns in patient data and generating hypotheses for further investigation; at the same time, they raise risks and ethical considerations. Similarly, the use of artificial intelligence is becoming critical in the drug discovery process, as it allows the identification of novel therapeutic targets, the optimization of molecule design, and the prediction of patient responses to treatments.
This collection delves into the ways in which various aspects of technological advancements impact current medical research. It does not aim to attract studies presenting technological innovations and new methods themselves, but rather the way these affect the research processes and frameworks:
The use of large language models to streamline and/or improve research processes
The use of artificial intelligence and machine learning for specific processes such as drug discovery and clinical trial design and management
All manuscripts submitted to this journal, including those submitted to collections and special issues, are assessed in line with our editorial policies and the journal’s peer review process. Reviewers and editors are required to declare competing interests and can be excluded from the peer review process if a competing interest exists.
Publishing Model: Open Access
Deadline: Jun 18, 2026
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in