Note: After I wrote this blog post, helpful folks at PsychMAP shared some excellent initiatives addressing the same issue: Curate Science is a crowdsourced collection and annotation of replications, and Pub Peer is a browser plug-in providing other researchers’ comments on the article you’re reading. Check them out!

Science is in trouble. We pay lip service to the notion that scientific knowledge can be trusted because it is reproducible. However, as many researchers suspected, too much of what we think we know doesn’t hold up. The Reproducibility Project tried to reproduce 100 papers randomly chosen from top psychology journals, and found that only one third of original results replicated. In a survey by Nature, 70% of researchers said they had tried and failed to reproduce other researchers’ experiments, and more than half had failed reproduce their own. In a series of Registered Replication Reports, in which a large number of labs independently conduct direct replications of the same study, textbook psychology findings have failed to reproduce.

The causes are painfully obvious: Publishing articles is how researchers build reputations, and get (and keep) jobs. At the same time, scientific journals traditionally publish papers only if they contain positive and novel findings. Therefore, published work is very rarely direct replications of previously published findings, but only tweaks and extensions thereof (‘conceptual replications’). Similarly, data from experiments that didn’t work never see the light of day.

There used to be good reasons for this. At the time of print journals, it seemed a waste of precious space to publish studies that didn’t work, or which merely reiterated what we already knew. However, the downsides are serious: When failed experiments aren’t published, research efforts are wasted by trying and failing at studies that others have already tried and failed at. Worse, the ‘publish or perish’ culture incentivises researchers to dig deep into their data for anything that passes conventional thresholds for statistical significance. Entire careers may have been built on publications in which ‘significant’ findings, which were the result of post-hoc exploratory analyses and likely to be statistical flukes, are presented as confirming initial hypotheses.

Hence, we find ourselves with a mess of publications in which true findings are scattered among false positives with, at present, no obvious way to tell the difference. With the vast number of articles that have been (and are being) published, an unacceptably wide range of claims can be made to seem legitimate, by cherrypicking the literature for papers in agreement. In fact, the use of ‘statistical significance’ as a criterion for publication should cause such a mess, even in the absence of questionable research practices.

Where do we go from here?

Psychologist Michael Inzlicht from the Toronto Laboratory for Social Neuroscience sighed on his blog that “During my dark moments, I feel like social psychology needs a redo, a fresh start”. Is the way forward to ‘purify’ the scientific record, by retracting all publications that contain methodological flaws or erroneous statistical analyses? Should we set up automatic algorithms to trawl through papers and publicly call out statistical errors? Should authors state on their articles how much money they are willing to bet that their results will replicate? Could these efforts lead to a witch hunt of (actual or perceived) shaming of researchers whose published results and established theories turn out to be in error?

I think there is a solution that avoids a (actual or perceived) witch hunt on researchers whose published results and/or methodologies have been shady. We need three things to happen:

1. Articles must be linked with their replication relation - we need a ‘Replication Index’.

That is, if I publish a paper that is an attempted replication of someone else’s study, then the original paper must be linked to my paper. To put it bluntly: The front page of all experimental papers should contain a badge listing how many times its results have been attempted replicated, and how many of those have been succesful - an at-a-glance indication of how much confidence to have in a given result, given the current status of the evidence. Let’s call this badge a ‘Replication Index’. The first publication of any experiment must keep a continually updated replication index.

2. When we cite other people’s work, we must include the Replication Index at the time of writing.

That is, if I cite an article in my paper, let’s call it “Brown 1992”, I won’t write “X has been found to cause Y (Brown 1992)”. rather, I will write “X has been found to cause Y (Brown 1992, 8/6)”, which would mean that Browns’ finding has been attempted replicated directly 8 times, and out of those, 6 have been successful. The point is to give the reader an indication of the confidence we should in the evidence used to back up a statement.

3. We must publish all studies conducted with sound methodology.

For the Replication Index to be meaningful, we must publish all methodologically sound studies, no matter their outcome. That is, common practice must be to preregister our studies, have the methodology and our planned analyses peer-reviewed (peer-review should safeguard against poor methodology, not against ‘boring work’), and then publish the studies regardless of their result.

The devil of this vision is in the detail. Several practical and theoretical obstacles stand in the way: First, when does something count as a replication? When are the methods and population similar enough to count as a replication? When is an effect size large enough, or a p-value small enough, for it to count as ‘succesful’ replication? (This is a substantial problem, but it’s one that we’ve been discussing for a while, and we face the same challenge when we conduct meta-analyses.)
Second, how do we weigh the differential contribution of different studies? A small, underpowered study should not give as many ‘points’ on a replication index as a study with large numbers of participants.
Third, which should be the ‘reference version’ of a particular finding? What if a paper contains five different variation of an experiment? Fundamentally, we do not replicate ‘papers’, but rather procedures in the world that reliably leads to certain outcomes. In the ideal world, our databases would not be databases of articles, but databases of experimental procedures.
Fourth, how should articles practically be linked together by their replication relationship, and who should be responsible? Should authors of a direct replication of a previous study notify the editors of the journal the original article was published in? Should a paper’s Replication Index be shown only in its original publication venue?

Curate Science‘s crowdsourced collection and annotation of replications is one impressive attempt at creating transparent overviews of replication studies and a best estimate of the current confidence level to put in a given finding. A continously evolving ‘Replication Index’ which shows the state of evidence rather than just how many references the author chose to cherrypick to support his claim, would be the most important project for us to work towards, to fulfil the ambition of science as continuous knowledge accumulation.