International Debate

What’s That? The Replication Crisis is Good for Science?

April 10, 2019 2394

Cris question mark
(Image:Gerd Altmann from Pixabay)

Science is in the midst of a crisis: A surprising fraction of published studies fail to replicate when the procedures are repeated.

For example, take the study, published in 2007, that claimed that tricky math problems requiring careful thought are easier to solve when presented in a fuzzy font. When researchers found in a small study that using a fuzzy font improved performance accuracy, it supported a claim that encountering perceptual challenges could induce people to reflect more carefully.

However, 16 attempts to replicate the result failed, definitively demonstrating that the original claim was erroneous. Plotted together on a graph, the studies formed a perfect bell curve centered around zero effect. As is frequently the case with failures to replicate, of the 17 total attempts, the original had both the smallest sample size and the most extreme result.

The Conversation logo
This article by Eric Loken originally appeared at The Conversation, a Social Science Space partner site, under the title “The replication crisis is good for science”

The Reproducibility Project, a collaboration of 270 psychologists, has attempted to replicate 100 psychology studies, while a 2018 report examined studies published in the prestigious scholarly journals Nature and Science between 2010 and 2015. These efforts find that about two-thirds of studies do replicate to some degree, but that the strength of the findings is often weaker than originally claimed.

Is this bad for science? It’s certainly uncomfortable for many scientists whose work gets undercut, and the rate of failures may currently be unacceptably high. But, as a psychologist and a statistician, I believe confronting the replication crisis is good for science as a whole.

Practicing good science

First, these replication attempts are examples of good science operating as it should. They are focused applications of the scientific method, careful experimentation and observation in the pursuit of reproducible results.

Many people incorrectly assume that, due to the “p<.05” threshold for statistical significance, only 5 percent of discoveries will prove to be errors. However, 15 years ago, physician John Ioannidis pointed to some fallacies in that assumption, arguing that false discoveries made up the majority of the published literature. Replication efforts are confirming that the false discovery rate is much higher than 5 percent.

Awareness about the replication crisis appears to be promoting better behavior among scientists. Twenty years ago, the cycle for publication was basically complete after a scientist convinced three reviewers and an editor that the work was sound. Yes, the published research would become part of the literature, and therefore open to review – but that was a slow-moving process.

Today, the stakes have been raised for researchers. They know that there’s the possibility that their study might be reviewed by thousands of opinionated commenters on the internet or by a high-profile group like the Reproducibility Project. Some journals now require scientists to make their data and computer code available, which makes it likelier that others will catch errors in their work. What’s more, some scientists can now “preregister” their hypotheses before starting their study – the equivalent of calling your shot before you take it.

Combined with open sharing of materials and data, preregistration improves the transparency and reproducibility of science, hopefully ensuring that a smaller fraction of future studies will fail to replicate.

While there are signs that scientists are indeed reforming their ways, there is still a long way to go. Out of the 1,500 accepted presentations at the annual meeting for the Society for Behavioral Medicine in March, only one in four of the authors reported using these open science techniques in the work they presented.

Improving statistical intuition

Finally, the replication crisis is helping improve scientists’ intuitions about statistical inference.

Researchers now better understand how weak designs with high uncertainty – in combination with choosing to publish only when results are statistically significant – produce exaggerated results. In fact, it is one of the reasons more than 800 scientists recently argued in favor of abandoning statistical significance testing.

We also better appreciate how isolated research findings fit into the broader pattern of results. In another study, Ionnadis and oncologist Jonathan Schoenfeld surveyed the epidemiology literature for studies associating 40 common food ingredients with cancer. There were some broad consistent trends – unsurprisingly, bacon, salt and sugar are never found to be protective against cancer.

But plotting the effects from 264 studies produced a confusing pattern. The magnitudes of the reported effects were highly variable. In other words, one study might say that a given ingredient was very bad for you, while another might conclude that the harms were small. In many cases, the studies even disagreed on whether a given ingredient was harmful or beneficial.

Each of the studies had at some point been reported in isolation in a newspaper or a website as the latest finding in health and nutrition. But taken as a whole, the evidence from all the studies was not nearly as definitive as each single study may have appeared.

Schoenfeld and Ioannidis also graphed the 264 published effect sizes. Unlike the fuzzy font replications, their graph of published effects looked like the tails of a bell curve. It was centered at zero with all the nonsignificant findings carved out. The unmistakable impression from seeing all the published nutrition results presented at once is that many of them might be like the fuzzy font result – impressive in isolation, but anomalous under replication.

The breathtaking possibility that a large fraction of published research findings might just be serendipitous is exactly why people speak of the replication crisis. But it’s not really a scientific crisis, because the awareness is bringing improvements in research practice, new understandings about statistical inference and an appreciation that isolated findings must be interpreted as part of a larger pattern.

Rather than undermining science, I feel that this is reaffirming the best practices of the scientific method.


Eric Loken is an associate professor in the Department of Educational Psychology, affiliated with the Measurement, Evaluation and Assessment program, at the University of Connecticut. His interests focus on latent variable models, Bayesian inference, and methods for reproducible science. He received his Ph.D. from Harvard University and studies advanced statistical modeling with applications to large scale educational testing.

View all posts by Eric Loken

Related Articles

Emerson College Pollsters Explain How Pollsters Do What They Do
Communication
October 23, 2024

Emerson College Pollsters Explain How Pollsters Do What They Do

Read Now
All Change! 2024 – A Year of Elections: Campaign for Social Science Annual Sage Lecture
Event
October 10, 2024

All Change! 2024 – A Year of Elections: Campaign for Social Science Annual Sage Lecture

Read Now
Exploring the ‘Publish or Perish’ Mentality and its Impact on Research Paper Retractions
Research
October 10, 2024

Exploring the ‘Publish or Perish’ Mentality and its Impact on Research Paper Retractions

Read Now
‘Settler Colonialism’ and the Promised Land
International Debate
September 27, 2024

‘Settler Colonialism’ and the Promised Land

Read Now
Webinar: Banned Books Week 2024

Webinar: Banned Books Week 2024

As book bans and academic censorship escalate across the United States, this free hour-long webinar gathers experts to discuss the impact these […]

Read Now
Research Assessment, Scientometrics, and Qualitative v. Quantitative Measures

Research Assessment, Scientometrics, and Qualitative v. Quantitative Measures

The creation of the Coalition for Advancing Research Assessment (CoARA) has led to a heated debate on the balance between peer review and evaluative metrics in research assessment regimes. Luciana Balboa, Elizabeth Gadd, Eva Mendez, Janne Pölönen, Karen Stroobants, Erzsebet Toth Cithra and the CoARA Steering Board address these arguments and state CoARA’s commitment to finding ways in which peer review and bibliometrics can be used together responsibly.

Read Now
Revisiting the ‘Research Parasite’ Debate in the Age of AI

Revisiting the ‘Research Parasite’ Debate in the Age of AI

The large language models, or LLMs, that underlie generative AI tools such as OpenAI’s ChatGPT, have an ethical challenge in how they parasitize freely available data.

Read Now
0 0 votes
Article Rating
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments