The problems with evidence in educational practice

What is “effective” may not be desirable or appropriate in the classroom.
Published in Neuroscience
The problems with evidence in educational practice

Share this post

Choose a social network to share with, or copy the shortened URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

This article is the fourth of a weekly five-part series about how evidence can inform classroom practice. Begin with part one, in which Charlotte argues that an emphasis on evidence-based practice would lead to prescribed practice.


I have deliberately avoided defining evidence in my posts so far, but now it becomes necessary to do so. I appreciate the simplicity of Jose Picardo’s definition: “a sign or indication that something has been shown to work” (emphasis mine). Immediately, however, questions arise: What counts as a sign? Who decides that it counts? How is it perceived? Who perceives it? Works for what?

To the homeopath, homeopathy clearly works. To the teacher who assigns the labels of visual, auditory, or kinaesthetic learner to her students, using learning styles pedagogies clearly work. To the proponent of uniforms, or direct instruction, or inquiry, or Montessori, their policies and practices clearly work.

In response to my first post, Gary Jones suggested four sources of evidence: research, school data, practitioner expertise, and stakeholder values. This narrows the definition somewhat, though introduces new terms that need defining and querying (what is “practitioner expertise”? Is it developed through experience? Through training? Training in what? Through education? Education about what? All of these things?). Gary appears to have a nuanced view of evidence worth pondering, and I suggest you read some of his posts.

Usually, when people talk about evidence, they mean data. Broadly, there are two forms of data (with some messy, in between forms): quantitative, which is numerical, usually continuous data; and qualitative, which is categorical, descriptive, or narrative data.

Researchers engaged in research that requires qualitative data often collect a great deal of “rich data that reveals the perspectives and understandings of the research participants” (Gay, Mills & Airasian, 2009). Analysis of qualitative data is often very difficult, time-consuming, and challenging, and necessitates a sensibility shaped by recognition of bias, framing, and positionality.

Teachers undertake qualitative data analysis when they sit down to evaluate students’ textual work (writing, posters, artworks, graphs, etc) against any criteria and standards that have been defined, using their professional expertise and experience to interpret, evaluate, and justify their judgment of each student’s evidence of learning.

Analysis of quantitative data is usually to determine how well the data “fits” an assumed normal population (with most people being “average” and a few outliers at either end); we describe this analysis as statistical.

In experimental research that collects quantitative data, if an intervention caused a change, and if the data collected was valid (it measured the intended outcome) and reliable (the measurement is not affected by time or other factors, a pretty big if in education), analysis of the data can also inform us of the size of that change for the sample of the population tested (the effect size), and how likely that the change was the result of sampling effects rather than the intervention (the p-value, or significance). Analysis can also tell us how much variance there is across the sample of the population (how spread out the data were).

Teachers collect quantitative data when they sit down to check students’ responses to questions on tests, where responses are either correct or incorrect. Scores can indicate how much is known by a student about the topic or focus of the test, or how consistently students apply particular heuristics or habits to achieve particular outcomes. The scores on tests can be compared with those of other students, other classes, or even other schools and jurisdictions, as is the case in standardised testing.

Teachers regularly collect and evaluate both quantitative and qualitative evidence of student performance with in-class and school-based assessment; this is part of a teacher’s professional role. This evidence is a form of feedback about student performance. A teacher uses this evidence to make decisions about practice, and to make judgments of student performance. Teachers also sit in cohorts and compare their judgments of student work in a process known as moderation, which serves to ensure consistency between judgments, and give feedback to teachers to improve their future judgments.

Teacher judgments might be considered more valid and reliable than standardised measures, as teachers generally evaluate student learning on multiple assessment pieces over time, and in different ways, and with a deep and purposeful awareness of context, allowing them to triangulate the evidence they have collected to justify their judgments. Standards provided by curriculum bodies help teachers to do this according to values held by society – or at least, those formally responsible for making decisions about education policy – about what is expected of our students.

My criticism is of those who insist on using quantitative evidence derived by experimental research, including RCTs, to determine “what works,” to the exclusion of other forms of evidence, and insist on prescribing this practice to teachers. Given that the evidence-based practice movement appears not to trust teacher judgment, teacher grading might be viewed as problematic by evidence-based practice advocates, as it necessitates teacher judgment, and judgment cannot be value free.

What is “effective” may not be desirable or appropriate

Education practice is framed by purpose. Without purpose, education practice is without direction. Questions about effectiveness must be secondary to questions of purpose, which is a value judgment.

Evidence of effectiveness valued by “evidence-based practice” advocates is often from experimental research that collects and statistically analyses quantitative data using standardised, quantitative measures. This criterion narrows the focus of education not to what is desirable or appropriate or even necessarily valued, but to what can be measured or represented and analysed quantitatively, even if doing so might be considered an invalid representation: the number of correct answers on a test as a measure of “numeracy”, or the number of points of improvement from before and after an intervention as an indication of change in the ability.

Unfortunately, many learning goals (as described in Part 2) are not easily quantified, measured, or compared. The emphasis on evidence shifts the focus from practices that might help teachers achieve those aims they judge to be desirable or appropriate (as discussed in the second post), to achieving those aims that can be measured and compared, and these may not be desirable or appropriate. Indeed, they may have secondary outcomes, side effects, which make them quite undesirable or inappropriate.

The ongoing controversies around the NAPLAN exemplify this tension between what can be measured (a limited proxy of “literacy” and “numeracy”), and the secondary outcomes, as well as questions of the purpose of the program to begin with. These include reports of students feeling anxious, and becoming ill. Obviously, student anxiety is not a desirable or appropriate outcome of any educational practice. There are also many stories of teachers “teaching to the test” (sometimes at the direction of administrators), and in taking the time and space to do so, reducing opportunities for students to learn in areas not measured by NAPLAN, such as science, social studies, technologies, and the arts. One proposed solution to this undesirable and inappropriate narrowing of curriculum is a NAPLAN test of science. Schools have also been accused of gaming the system by asking low-performing students to sit out, and educational triaging, where students are given intensive training and attention to bring them up to pass the test at the expense of attention to other students. These activities aim to improve schools’ positions in league tables developed by the media using simplified representations of data sourced from the MySchool website. Finally, the linking of funding to school results in the NAPLAN is also arguably an undesirable and inappropriate outcome of the NAPLAN. NAPLAN oversamples the population, while undersampling the curriculum, and is a questionable measure of literacy and numeracy (ignoring, for the moment, the social, behavioural, affective and other cognitive outcomes we aim to achieve in education). All of these activities and outcomes potentially invalidate what is not a particularly valid or reliable measure of learning outcomes in the first place.

On the positive side, NAPLAN does generate a lot of data. What do we do with it?

The question of what is desirable and appropriate is as important, if not more so, than the question of what is evidenced as effective (which defaults to what is quantitatively measurable).

Educational practice is highly contextualised

Claims that classroom practice must be “evidence-based” sometimes cite evidence that has been collected by researchers in other fields, such as psychology or linguistics. Each study, in every field, should be evaluated on its own merits, not on the basis of the field to which it belongs. Such research evidence is useful, but must be considered in light of the context in which it is collected. Research from these fields is often conducted in contexts that are distinctly dissimilar or isolated from those common in education.

Education research identifies possibilities. What evidence from research can tell us approximately the probability that a practice will affect change, and possibly in what direction that change will be (towards a specified goal, or away from it, usually). Unfortunately, generalising from research in a specific educational context to a different educational context is risky. This doesn’t mean that education research is not worthwhile; it means that the evidence needs to be carefully and thoughtfully interpreted and applied. When applying that evidence to decision-making in a different context, the probability changes.

Evidence from RCTs describe patterns in populations, but may not be relevant to particular individuals

There are suggestions that education research should test hypotheses using best-practice experimental methodologies such as randomised controlled trials (RCTs). RCTs are commonly held to be the “gold standard” for collecting evidence in science, though there are criticisms of this position. According to some commentators, the lack of RCTs makes education research substandard. Dr Ben Goldacre, a physician based in the United Kingdom, is one of many who have been pushing for this form of research in education.

An RCT involves randomly allocating participants to different treatments (including a control group). Where RCTs have been conducted in education contexts, Project Follow Through and the Sheffield phonics study for example, the validity of the results are called into question by contextual factors that cannot be addressed, or by the confounding actions of the teachers themselves, who reflexively act and adapt their practice to assist students to achieve learning goals. RCTs collect useful data that can be used to compare two treatments, but with questions about whether or not causal mechanisms can be identified with any certainty in education, the value of this research is questionable (see Part 3 for a discussion of causality in education).

Even assuming that causality can be determined, there are two issues with reasoning inductively from the evidence generated by an RCT (or any other methodology). Firstly, it is problematic to assume that what has been shown to work by statistical analysis of data collected by experimental research will apply to other students to the same degree. Teacher judgment, made with deep knowledge of individual students, class dynamics, and the learning environment, is needed. Secondly, due to the nature of social science research, we cannot reason inductively with any degree of certainty that an interaction that has been shown to be effective in the past will be effective in the future.

I am concerned that the prescription of practice based on what’s been demonstrated to be “effective” by research assumes what’s best for all students, to the same degree, because those practices have been “validated”. However, those “validations” are based on a one-size-fits-all approach, and must be carefully applied to specific students and contexts.

Evidence can come from various sources, including academic and school-based research. The question of what is desirable and appropriate is a value judgment that must be made in consideration of the purpose(s) of practices, the participating students and the learning environment. Teachers already collect evidence and make judgments about student learning, and use that evidence to make decisions about practice. Basing practice on what is effective, or focusing on attempts to measure quantitatively what is better judged qualitatively, can have undesirable or inappropriate consequences. NAPLAN is an example of this. We need to be cautious about applying evidence collected from one context to another, and from what has worked for a large group, to what will work for a specific student or small set of students.


A version of this article has appeared on Charlotte's blog.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Go to the profile of Phil Cowley
over 7 years ago
A nice summary of the issues. To illustrate one of the points you made: in the UK there is a Phonics Test taken by Year 1 pupils. When it was first discussed by the profession, it was in the context of teachers having access to a diagnostic test which would inform their practice, and check whether certain grapheme-phoneme correspondences had been learned (to the point where the pupils were able to read 'regular' nonsense words). The test was 'low stakes' and facilitating, giving teachers an idea of national norms, and a direction to go in their future teaching for particular pupils who performed poorly. Once installed, it then became a 'high stakes' test, used to compare teacher with teacher, and school with school. Teachers naturally responded by teaching to the test, with nonsense words being a regular feature of the reading curriculum, thereby removing from the test its diagnostic and facilitating aspects. I understand that Australia is thinking about adopting this test. Caveat emptor!