News

Best Evidence and the What Works Clearinghouse

September 14, 2016 4121

teacher in classroom In the 1990s governments around the world began to promote evidence-based practice, the idea that social policies should be based on strong scientific research. Many social scientists, including myself, applauded this movement. We hope that our research findings can help make the world a fairer and better place.

Determining Best Evidence in Education: The What Works Clearinghouse

The United States moved quickly to promote evidence-based practice in education. The No Child Left Behind Act of 2001 called for using evidence-based curricula in schools. In 2002 the National Research Council encouraged the movement by issuing a lengthy report on scientific educational research. In the same year the U.S. Department of Education established the What Works Clearinghouse (WWC) to develop summaries of the “scientific evidence for what works in education.”

This article based on a paper, “The Threshold and Inclusive Approaches to Determining ‘Best Available Evidence’: An Empirical Analysis,” by Jean Stockard and Tim Wood, that appears in the American Journal of Evaluation.

However, in public policy, as with so many parts of life, the devil is in the details, and the WWC got off to a rocky start. In 2003 the Department of Education published an initial statement of WWC procedures. The feedback from researchers and practitioners indicated a high degree of concern. Comments came from both individual researchers and well-respected professional organizations including the American Evaluation Association, the American Educational Research Association, and the National Education Association. Over 90 percent of these comments, including those from the major professional organizations, expressed serious reservations.

Almost all the comments focused on the way in which the proposed procedures differed from the way scientists usually reach conclusions. Generations of social scientists, including those who wrote the National Research Council report, have embraced the idea of a cumulative science. Just as the FDA requires multiple tests of drugs to ensure that they are safe and effective, social scientists, including those who study education, believe that we need multiple tests to make any conclusions regarding effectiveness. Only when these studies are conducted in different settings, with different populations, and with different types of methodologies, can we be confident in our results. We can’t make judgments from results of just one or two research reports. A body of evidence is necessary.

The WWC approach couldn’t be more different. Over the objections of the vast majority of those who submitted feedback, the WWC adopted a “threshold” system. The clearinghouse limits its summaries of best evidence to those that meet a defined set of criteria and standards. Over time its threshold has become higher, with several criteria beyond those described in the original call for comments added to the list.

Not surprisingly, the researchers who commented on the WWC methods were worried that these limitations would result in only a small slice of available information given to the public. As a result of these limitations, reports of “best evidence” could be misleading and even harmful.

But, as we social scientists like to say, this concern is an empirical question. In an article forthcoming in the American Journal of Evaluation my colleague, Tim Wood, and I, describe two studies we conducted to see if these initial worries were justified. What proportion of the available research is summarized in WWC reports? Do the WWC best evidence reports differ from those using traditional, cumulative methods?

The What Works Clearinghouse website

Studying the What Works Clearinghouse

Our first study looked at the results of WWC reviews of over 250 literacy programs. The WWC determined that only 93 of these programs had any studies that could pass their thresholds. While they identified more than 4,000 individual research reports, fewer than 100 passed their thresholds without reservation. Because so few studies passed the thresholds, almost all the best evidence reports published on the WWC website were based on only a sliver of the available data – most often only one study. In addition, over half of the reports were based on data gathered from fewer than 150 students.

Are we wrong?

Might Wood and I be wrong in our conclusions? Might the WWC’s threshold process just get rid of studies that are low quality and thus produce a more accurate summary of best evidence?

It is certainly true that some studies are more carefully conducted than others. Yet the traditional methodological literature stresses that there can be no perfect study. All research projects have limitations. Cumulative analyses, such as those we used in our second study, can take these variations into account and adjust for them.

In addition, it is not at all clear that the WWC standards can accurately identify lower quality research. The 131 studies of Reading Mastery that would be rejected by the WWC include a number that were supported by the U.S. Department of Education. Thus, on the one hand the federal government is spending millions of dollars to support high quality educational research. But its summaries of best evidence exclude the results of these very studies.

Still, at a more general level, could the threshold approach used by the WWC be better than the traditional cumulative, broad-based approach to science? Might the traditional cumulative approach be wrong? This too is possible. But such a sea change in the logic that underlies generations of science should be subjected to scholarly debate and judgment. To date, Wood and I have not been able to find any systematic theoretical or empirical defense of the superior nature of the threshold approach.

Our second study systematically compared conclusions that would be reached by the WWC with conclusions reached using traditional statistical methods. We focused on 131 studies of a reading curriculum called Reading Mastery. We chose this program because other scholars had found strong and consistent evidence of its effectiveness. A number of the studies had been sponsored by the U.S. Department of Education, which also funds the WWC, and most had been published in peer-reviewed journals. But, to our surprise, none of the 131 studies would pass all of the WWC thresholds.

The WWC would have stopped its analysis at that point, reporting that there were no studies that met their evidence standards. But, we continued with the analysis. We first asked if some of the WWC standards could provide useful insights into estimates of how effective a curriculum was. Might the estimates vary if we only looked at studies that passed one or another of the WWC’s criteria? The answer was no. Estimates were the same no matter what standard was examined.

We also compared the accuracy of summary judgments from the cumulative and threshold approaches. If summary judgments were based only on those that came closer to passing the WWC thresholds, might the margin of error be smaller? Again, the results favored the cumulative approach. It produced more precise estimates with a smaller margin of error and estimates that closely matched results from other summaries of the literature.

Based on these results Wood and I concluded that the initial concerns regarding WWC procedures were probably justified. As the researchers and professional organizations feared, the WWC reports of best evidence reflect only a tiny proportion of the available literature and summary judgments are often based on the results from relatively few students. In addition, they can be misleading, contrasting sharply with conclusions reached through the widely accepted, traditional methods of summarizing evidence.

Implications for our Schools and Students

The magnitude of these differences is not small. Educational researchers and the WWC use effect sizes to indicate the strength of an intervention, with values of .25 or larger seen as educationally significant. Our statistical analysis estimated the effect size of Reading Mastery on students’ achievement. Rather than excluding studies, we statistically controlled for all of the WWC criteria and standards as well as various aspects of each study, such as the age of students, place where the study occurred, the assessment that was used, how long the students were exposed to the program and whether the program was well implemented. The 131 studies included over 84,000 students and almost 1,400 comparisons of those who did and did not have the program.

Overall, we found an average effect size of .79, more than three times the cut-off point of .25 and similar to the effect found by others who have summarized studies of Reading Mastery. Change of this magnitude could benefit tens of thousands of students around the country, especially those in the greatest need of effective interventions: children in poverty, with disabilities, and those who do not speak English. Yet the WWC’s system of reporting best evidence would not reveal this finding. It would simply report that no studies met their standards.

Why the Discrepancies? Some Devilish Details

In trying to understand why results in WWC reports would differ so much from those using traditional methods we looked closely at their procedures. We discovered a variety of elements that can lead to misleading conclusions. Here are three examples. The discussion is admittedly wonky, but the examples show how seemingly small procedural elements can have a big effect on the accuracy of best evidence reviews.

Example 1:

Scientists like to have multiple measures of a phenomenon. For instance, within education, the cumulative tradition urges researchers to measure student achievement in a variety of ways. This helps us be more confident in our results. The WWC also encourages multiple measures. But it requires that comparison groups have very small differences on ALL of these measures before any intervention occurs. Unfortunately, differences larger than the WWC criterion can occur by chance. Moreover the probability that differences will exceed the required level increases rapidly as a researcher uses more measures. Thus, a researcher who opts to follow scientific norms by incorporating more measures faces an increased chance of having a study rejected by the WWC just through the laws of statistical probability. The net result is the elimination of even moderately sophisticated research, but the potential acceptance of more simplistic studies.

Example 2:

Valid studies of the efficacy of medications or medical procedures require that patients take the recommended dosage and that doctors implement the procedures in a standard manner, that is, with high fidelity. The same thing applies in education – dosage and fidelity matter. If a program is effective, greater exposure to the program in the way the author intended should logically result in stronger improvement. If a program is ineffective or harmful, greater dosage and/or less fidelity would result in no change or even decline. Yet, while the WWC has very elaborate rules about some things (such as differences on pretests), it does not consider dosage or whether a program was implemented in the way it was designed. Such details may be given in footnotes or back matter, but the typical user would have no way of knowing that this information is available and, most importantly, it does not affect summary judgments of best evidence.

Example 3:

The cumulative tradition emphasizes the importance of developing a large body of literature. If we have more studies of a phenomenon we can be more confident of our conclusions. Yet, a larger set of studies will, by simple statistical chance, have some significant contrary findings. This is what the term “margin of error” means. Unfortunately, if a body of literature has even one contrary finding the WWC will, at best, only give it a “mixed” rating. Thus, a large body of literature, which would be more valued in the scientific community, would be less likely to be given a definitive judgment with the WWC’s threshold process.

What Can Be Done – Getting Better “Best Evidence”

It is important to emphasize the extent to which social scientists support the intent of the evidence-based practice movement. Most of us wouldn’t be doing this work if we didn’t think it could benefit others. In addition, the WWC procedures were, no doubt, developed in good faith. Each of the detailed criteria and standards, by itself, might appear reasonable and appropriate. Yet, when taken together the result is far from what we had hoped. WWC’s best evidence reports present only a very limited and misleading view of the research evidence to the public.

This can be changed. The WWC could abandon its current threshold approach and use the well-established and accepted methods for summarizing research literature. It could widen its net and look at all the available research literature. Such steps would bring the WWC’s procedures much closer to those long accepted within the scientific world. They could also go a long way toward helping ensure that teachers, parents, and educational policy makers really do have the best reports of best evidence. Our students and schools deserve no less.

Executive Summary

In the 1990s governments around the world began to promote evidence-based practice, the idea that social policies should be based on strong scientific research. In 2002 the U.S. Department of Education established the What Works Clearinghouse (WWC) to develop summaries of the scientific evidence for what works in education. Generations of scientists have stressed the cumulative nature of science. Effectiveness of a medical procedure or an educational curriculum can only be determined by converging findings from multiple studies with large numbers of participants.

Against the advice of researchers and professional associations, the WWC rejected traditional views and methods of summarizing research. As a result, their reports provide very limited and misleading views of the research. This in turn means students and our communities lose out. A larger question, perhaps, is whether the WWC approach is an outlier or a widespread phenomenon. In our paper we cite examples from other fields, including medicine, public health and psychotherapy, that show a similar pernicious pattern of ignoring solid research on community and region-wide issues that can’t be studied in the way as lab research and so are not included in the “best evidence” reviews.

In the years we have spent trying to understand WWC procedures, we have relayed our concerns to the Department of Education but have not seen any changes made as a result. There is still an important role for WWC, but only if it abandons its current failing method and adopts the traditional methods widely used in the social sciences.

Jean Stockard

Jean Stockard, PhD., is professor emerita at the University of Oregon and director of research at the National Institute for Direct Instruction.

View all posts by Jean Stockard

Published

September 14, 2016

A Look at How Large Language Models Transform Research

By Ali Shiri

Read Now

Why Men Have a Bigger Carbon Footprint Than Women

Insights

July 8, 2025

Why Men Have a Bigger Carbon Footprint Than Women

By Joe Sweeney

Read Now

Anna Harvey Stepping Down as SSRC President

Infrastructure

June 18, 2025

Anna Harvey Stepping Down as SSRC President

By Social Science Space

Read Now

Closing the Gender Pay Gap: Why Intermediaries Matter

Business and Management INK

June 18, 2025

Closing the Gender Pay Gap: Why Intermediaries Matter

By Sally Curtis, Jananie William, Anna von Reibnitz, Miriam Glennie, and Andreas Pekarek

Read Now

Degrading Sites of Punishment and Pain: The Case for Abolishing Prisons

Joe Sim 5088 Insights, Opinion, Public Policy

Prisons have been in crisis in England and Wales for 200 years. The state has responded with piecemeal, ‘pragmatic’ reforms which have […]

Read Now

Who Gets to Flourish?

Joe Sweeney 8745 Bookshelf, Public Policy

In this month’s issue of The Evidence newsletter, Josephine Lethbridge examines how gender shapes experiences of human flourishing. A recently published international […]

Read Now

David Autor on the Labor Market

Social Science Bites 17595 Insights, Public Policy, Social Science Bites

When economic news, especially that revolving around working, gets reported, it tends to get reported in aggregate – the total number of […]

Read Now

0 0 votes

Article Rating

This site uses Akismet to reduce spam. Learn how your comment data is processed.

1 Comment

Newest

Oldest Most Voted

Inline Feedbacks

View all comments

Casey Wimsatt

8 years ago

On the one hand, they insist on very exacting standards for inclusion, and on the other hand, they completely and steadfastly ignore a key measure of rigor, treatment fidelity. There is no rational way to reconcile these diametrically opposing approaches. The WWC says treatment fidelity is not feasible to measure, but they have no problem insisting on study criteria that are equally if not more difficult to achieve.