Why ‘What Works’ Doesn’t: False Positives in Education Research

If edtech is to help improve education research it will need to kick a bad habit—focusing on whether or not an educational intervention ‘works’.

Answering that question through null hypothesis significance testing (NHST), which explores whether an intervention or product has an effect on the average outcome, undermines the ability to make sustained progress in helping students learn. It provides little useful information and fails miserably as a method for accumulating knowledge about learning and teaching. For the sake of efficiency and learning gains, edtech companies need to understand the limits of this practice and adopt a more progressive research agenda that yields actionable data on which to build useful products.

How does NHST look in action? A typical research question in education might be whether average test scores differ for students who use a new math game and those who don’t. Applying NHST, a researcher would assess whether a positive—i.e. non-zero—difference in scores is significant enough to conclude that the game has had an impact, or, in other words, that it ‘works’. Left unanswered is why and for whom.

This approach pervades education research. It is reflected in the U.S. government-supported initiative to aggregate and evaluate educational research, aptly named the What Works Clearinghouse, and frequently serves as a litmus test for publication worthiness in education journals. Yet it has been subjected to scathing criticism almost since its inception, criticism that centers on two issues.

False Positives And Other Pitfalls

First, obtaining statistical evidence of an effect is shockingly easy in experimental research. One of the emerging realizations from the current crisis in psychology is that rather than serving as a responsible gatekeeper ensuring the trustworthiness of published findings, reliance on statistical significance has had the opposite effect of creating a literature filled with false positives, overestimated effect sizes, and grossly underpowered research designs.

Assuming a proposed intervention involves students doing virtually anything more cognitively challenging than passively listening to lecturing-as-usual (the typical straw man control in education research), then a researcher is very likely to find a positive difference as long as the sample size is large enough. Showing that an educational intervention has a positive effect is quite a feeble hurdle to overcome. It isn’t at all shocking, therefore, that in education almost everything seems to work.

But even if these methodological concerns with NHST were addressed, there is a second serious flaw undermining the NHST framework upon which most experimental educational research rests.

Null hypothesis significance testing is an epistemic dead end. It obviates the need for researchers to put forward testable models of theories to predict and explain the effects that interventions have. In fact, the only hypothesis evaluated within the framework of NHST is a caricature, a hypothesis the researcher doesn’t believe—which is that an intervention has zero effect. A researcher’s own hypothesis is never directly tested. And yet with almost universal aplomb, education researchers falsely conclude that a rejection of the null hypothesis counts as strong evidence in favor of their preferred theory.

As a result, NHST encourages and preserves hypotheses so vague, so lacking in predictive power and theoretical content, as to be nearly useless. As researchers in psychology are realizing, even well-regarded theories, ostensibly supported by hundreds of randomized controlled experiments, can start to evaporate under scrutiny because reliance on null hypothesis significance testing means a theory is never really tested at all. As long as educational research continues to rely on testing the null hypothesis of no difference as a universal foil for establishing whether an intervention or product ‘works,’ it will fail to improve our understanding of how to help students learn.

As analysts Michael Horn and Julia Freeland have noted, this dominant paradigm of educational research is woefully incomplete and must change if we are going make progress in our understanding of how to help students learn:

“An effective research agenda moves beyond merely identifying correlations of what works on average to articulate and test theories about how and why certain educational interventions work in different circumstances for different students.”

Yet for academic researchers concerned primarily with producing publishable evidence of interventions that ‘work,’ the vapid nature of NHST has not been recognized as a serious issue. And because the NHST approach to educational research is relatively straightforward and safe to conduct (researchers have an excellent chance of getting the answer they want), a quick perusal of the efficacy pages at leading edtech companies shows that it holds as the dominant paradigm in edtech.

Are there, however, reasons to think edtech companies might be incentivized to abandon the current NHST paradigm? We think there are.

What About The Data You’re Not Capturing?

Consider a product owner at an edtech company. Although evidence that an educational product has a positive effect is great for producing compelling marketing brochures, it provides little information regarding why a product works, how well it works in different circumstances, or really any guidance for how to make it more effective.

Are some product features useful and others not? Are some features actually detrimental to learners but masked by more effective elements?
Is the product more or less effective for different types of learners or levels of prior expertise?
What elements should be added, left alone or removed in future versions of the product?

Testing whether a product works doesn’t provide answers to these questions. In fact, despite all the time, money, and resources spent conducting experimental research, a company actually learns very little about their product’s efficacy when evaluated using NHST. There is minimal ability to build on research of this sort. So product research becomes a game of efficacy roulette, with the company just hoping that findings show a positive effect each time it spins the NHST wheel. Companies truly committed to innovation and improving the effectiveness of their products should find this a very bitter pill to swallow.

A Blueprint For Change

We suggest edtech companies can vastly improve both their own product research as well as our understanding of how to help students learn by modifying their approach to research in several ways.

Recognize the limited information NHST can provide. As the primary statistical framework for moving our understanding of learning and teaching forward, it is misapplied because it ultimatelytells us nothing that we actually want to know. Furthermore, it contributes to the proliferation of spurious findings in education by encouraging questionable research practices and the reporting of overestimated intervention effects.
Instead of relying on NHST, edtech researchers should focus on putting forward theoretically informed predictions and then designing experiments to test them against meaningful alternatives. Rather than rejecting the uninteresting hypothesis of “no-difference,” the primary goal of edtech research should be to improve our understanding of the impact that interventions have, and the best way to do this is to compare models that compete to describe observations that arise from experimentation.
Rather than dichotomous judgments about whether an intervention works on average, greater evaluative emphasis should be devoted to exploring the impact of interventions across subsets of students and conditions. No intervention works equally well for every student and it’s the creative and imaginative work of trying to understand why and where an intervention fails or succeeds that is most valuable.

Returning to our original example, rather than relying on NHST to evaluate a math game, a company will learn more by trying to improve its estimates and measurements of important variables, looking beneath group mean differences to explore why the game worked better or worse for sub-groups of students, and directly testing competing theoretical mechanisms proposed to explain the game’s influence on learner achievement. It is in this way that practical, problem-solving tools will develop and evolve to improve the lives of all learners.