Nearly everyone in the world has encountered the term ‘correlation’. And everyone who has gone through at least a basic statistics course will be familiar with terms such as the Spearman correlation coefficient. Charles Edward Spearman was a giant in statistics and early cognitive psychology. In addition to formulating the famous rank correlation coefficient, he pioneered factor analysis, a crucial statistical method used to describe variability and shared contribution of observed and correlated variables to an unobserved latent variable. His seminal work further contains attempts to model human intelligence and the introduction of the infamous general intelligence factor. It was thus sort of an honour to learn a week before our planned submission of the first version of this paper that we were scooped by Spearman and Brown by about 110 years.
As often happens in science, our initial research question had nothing to do with reliability. We initially sought to use factor analysis and other statistical methods to analyse a vast battery of behavioural data, which yielded weaker than expected results. Being aware of the famous GIGO concept (garbage in – garbage out), we started to inspect our input data more closely. We began to ask ourselves if we can trust the measures in our battery – that is, whether they are reliable – and how reliability affects the correlations between them. Once we ran our split-halves reliability analysis and received rather discouraging results (see Fig. 6 in this paper and others1,2), we asked ourselves a different question – how much data do we need to make the measures reliable? We had already collected some data, but how much more did we need to collect to reach a certain level of reliability? Having looked at all the reliability curves, I started noticing a pattern. They all seemed to follow the same mathematical function. After consulting about these findings with several colleagues, we were fairly certain that such a strikingly simple relationship had to have been described before. However, no one could point us to an actual citation for the phenomenon we were observing. At that time, we approached our colleagues in the Physics Department, asking if they could mathematically derive the formula we found. It was a piece of cake for them, and after running extensive simulations, we were ready to submit our manuscript reporting on what seemed to us to be a ground-breaking discovery. It was then that we scoured the quantitative psychology literature once again, this time with the exact formula in various forms in hand. Fortunately, we came across the so-called ‘Spearman-Brown prophecy’, described independently by Charles Spearman3 and William Brown4, which is a formula used in psychometrics to predict the reliability of a test if its length is altered.
This discovery gave us pause. The first shock of having completely missed it during our initial search was replaced by an even bigger shock when we found an entire field formed around this topic that was buried so deep that even current statisticians barely know of it. Once we knew the keywords, it started snowballing. Wonderful textbooks5,6 written in the '50s and '60s entirely on this topic with details we didn’t even think of. It was a humbling experience to read through these ‘ancient’ tale-like articles and textbooks. It was also a time when the central message of our article seemingly crumbled, we needed to recombobulate and reassess the novelty of our contribution. However painful, this process gave us a new perspective on the problem of reliability and allowed us to appreciate its breadth fully.
In our Communications Psychology article, we introduce a new coefficient (C) that can be estimated from simple population statistics of a given task. This C coefficient can be used to predict the necessary number of trials and allows for direct comparison of tasks in terms of their reliability convergence – hence for their suitability for use in individual differences studies. We then validate the approach on a large dataset containing over a dozen behavioural tasks spanning several cognitive domains. The data provide a springboard for using the C coefficient in individual differences studies to select optimal cognitive tasks and their length.
The ping-pong peer review process highlighted an essential omission in our original draft, which led to an important addition to our article. Until then, we were concerned mainly with split-halves reliability and its convergence within a single session. The predictive formulas from our article allow researchers to predict the number of trials needed to achieve a certain reliability in a single session. During the review process, we added test-retest reliability convergence and the effect of time. This question concerns how stable a given test is over time – how many sessions rather than trials are needed to reliably assess a given cognitive trait and how the time between sessions affects reliability (Fig. 7).
In the end, we aim to promote the concept of reliability to a broad audience of neuroscientists, psychologists, psychophysicists, statisticians, and other researchers. Our goal was to highlight the importance of reliability, make its calculation accessible, and encourage researchers investigating individual differences to think more deeply and proactively about the reliability of their cognitive task measures. This is especially critical to the field of neuroscience, where brain-behaviour relationships are often investigated with little regard to how reliably either the neural or behavioural measures are estimated. Even though our article is neither a comprehensive review nor a formal tutorial, we tried to walk the line between being exact and approachable to anyone. We hope you find this article interesting and valuable, and we wish you a pleasant read. Please also check out our freely available web-based tool (https://jankawis.github.io/reliability-web-app/) for estimating how many trials are needed to achieve a given level of reliability.
References
- Hedge, C., Powell, G. & Sumner, P. The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behav. Res. Methods 50, 1166–1186 (2018).
- Rey-Mermet, A., Gade, M. & Oberauer, K. Should we stop thinking about inhibition? Searching for individual and age differences in inhibition ability. J. Exp. Psychol. Learn. Mem. Cogn. 44, 501–526 (2018).
- Spearman, C. Correlation Calculated from Faulty Data. Br. J. Psychol. 1904-1920 3, 271–295 (1910).
- Brown, W. Some Experimental Results in the Correlation of Mental Abilities1. Br. J. Psychol. 1904-1920 3, 296–322 (1910).
- Gulliksen, H. Theory of Mental Tests. (Routledge, New York, 1987, first published by Wiley & Sons in 1950). doi:10.4324/9780203052150.
- Lord, F. M. & Novick, M. R. Statistical Theories of Mental Test Scores. (IAP, 2008, first published by Addison-Wesley in 1968).
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in