Following one of the largest-scale scientific reproducibility investigations to date, a group of psychology researchers has reported results from an effort to replicate 100 recently published psychology studies; though they were able to successfully repeat the original experiments in most all cases, they were able to reproduce the original results in less than half, they report.

The authors - part of the Reproducibility Project: Psychology, and led by Brian Nosek - emphasize that a failure to reproduce does not necessarily mean the original report was incorrect, but say that their results do provide evidence toward the challenges of reproducing research findings, including identifying predictors of reproducibility and practices to improve it. Nosek and colleagues were motivated to undertake this collaborative project given continued concerns regarding reproducibility across scientific disciplines. They designed an open study in which teams of psychologists selected studies for replication from the 2008 issues of three leading psychology journals, then conducted the studies, analyzed and summarized the data, and posted their methods on a public website for further scrutiny. Between November 2011 and December 2014, 100 replications of studies featuring many different research designs were completed by 270 contributing authors worldwide.

Determining what constituted a successful replication was challenging, say the authors, as a replication can take varying forms, but Nosek and colleagues determined this through five complementary indicators. The most likely predictor of replication success, they found, was not related to the characteristics of the teams conducting the research (e.g., experience and expertise) but rather the variation in the strength of initial evidence (e.g., the original P value).

A common weakness among psychology papers is to treat P<0.05 as a significant effect. In the real world of statistics, such a P value means results could be wrong over 30 percent of the time, because of false positives. Nosek and colleagues found that original studies with P values very close to 0.05 were much less likely to reproduce compared to ones that were closer to zero. The efforts of this group increase confidence in the reliability of the original study results in some cases, and in others, suggest that more investigation is needed to establish validity of the original findings. 

In general, surprising effects were less reproducible, say the authors, as were effects for which it was more challenging to conduct the replication.

The authors invite other investigators to develop alternative indicators to explore the roles of expertise and quality in reproducibility on this open dataset available at https:/​/​osf.​io/​ezcuj/​wiki/​home/​.