How can we measure what health research actually is?
Debates on research funding often rely on broad categories such as “basic” or “applied” science. But these distinctions are rarely measured in a systematic and comparable way.
In our recent study, we developed a methodological framework to address this challenge. By combining large-scale text analysis with supervised machine learning, we analyzed more than 26,000 funded projects and their associated scientific publications across European and U.S. funding systems.
A conceptual framework for classifying research
At the core of our approach is a classification system grounded in two dimensions:
(1) the unit of analysis of research—from molecular and cellular mechanisms to population and health systems—and
(2) the orientation of research from basic to applied.
This framework allows us to distinguish five levels of health research, ranging from basic biomedical science to health policy and management. Importantly, this is not just a keyword-based classification, but a conceptually grounded system aligned with how health research is understood in practice.
From expert knowledge to machine learning
To scale this classification to tens of thousands of projects, we used a supervised machine learning approach.
We first constructed a training set based on expert annotation. These manually classified examples were then used to train a Naïve Bayes classifier, implemented in KH Coder.
The model was iteratively refined and validated, achieving around 82% agreement with expert classifications for projects and up to 95% accuracy for publications.
This approach ensures both scalability and interpretability—two key requirements for policy-relevant analysis.
Linking projects to publications
A central innovation of the study is the integration of funding data with scientific outputs.
We linked funded projects from CORDIS and NIH RePORTER to their resulting publications.
This required addressing important differences between systems. While European projects can often be directly linked to publications, NIH data required a time-window approach due to the cumulative nature of funding and publication processes.
By combining both datasets, we were able to compare not only what funding agencies aim to support, but what research is actually produced.
A multi-layered analytical strategy
Our methodology combines three complementary components (figure):
- Keyword-based content analysis
- Supervised classification
- Comparative analysis across funding mechanisms and time periods
The convergence of these approaches increases robustness and allows us to detect consistent patterns across different types of data.
Why this matters
This methodological framework moves beyond descriptive analyses of funding trends. It provides a way to empirically assess how policy priorities are translated into research activity and outputs.
More broadly, it opens new possibilities for studying how research systems evolve—and how funding shapes the direction of science.
Figure. Conceptual and analytical workflow for the classification of health research.
The figure illustrates the integration of funding data and scientific publications through a common text-mining and supervised classification framework. Projects and publications are classified into five levels of research, enabling comparison between funding priorities and research outputs.
David Fajardo-Ortiz, Bart Thijs, Wolfgang Glänzel, Karin R. Sipido; Evolution of public funding for collaborative health research towards higher-level patient-oriented research. Quantitative Science Studies 2026; doi: https://doi.org/10.1162/QSS.a.472