It’s impossible to discuss preregistration with statisticians since their statistical indoctrination tells them to conflate two different things:

(A) Policy X will improve the percentage of published claims which are true.
(B) Policy X will improve individual inferences.

That (A) and (B) are different is easy to illustrate. Suppose a theory of inferences implies we should use a random number generator to arbitrarily refuse to publish 10% of claims. To see the effect of this policy, imagine a journal is reviewing 100 papers, only 20% of which are right. Then 68% of 10-paper removals will leave the journal with a \geq 20\% success rate. In other words, even the most irrelevant policy imaginable can improve the success rate of published research.

Take another example: suppose researchers game preregistration by sticking to hypothesis prior information says are likely. The result would be that preregistered research will have a higher rate of successes, even though the quality of individual inferences hasn’t improved at all.

Indeed if increasing the percentage of correct research is your goal, you could determine whether blonds or brunettes do better research and simply refuse to publish the group with a lower average. Or you could take the safest paper written each year, and make it the only paper published in the world. That ought to give 100% reliability in published results.

The percentage of successes goes up, but individual inferences haven’t improved one bit. Another more subtle method for doing this is given below, but first step back and consider a related question.

Why are posthoc explanations less convincing than predictions?

This isn’t always the case of course, it’s just a heuristic, but it’s such a useful one it’s worth understanding how it comes about.

Imagine you have a datum D already in hand and wish to advance a favored theory H_0 explaining it. You can cheat Bayes Theorem,

(1)   \begin{equation*} P(H_0|D) = \frac{1}{1+\sum_{i>0}\frac{P(D|H_i)P(H_i)}{P(D|H_0)P(H_0)}} \end{equation*}

by ignoring other plausible (i.e. high p(D|H_i)) explanations H_i, since doing so will drive P(H_0|D) close to 1. That’s exactly what people do in real life. If some lunatic believes we’re being visited by ice cream eating space aliens, they’ll interpret a discarded ice cream wrapper as strong evidence for their theory by ignoring other explanations.

When making a prediction however, cheating this way is hard to do successfully. If you ignore plausible hypothesis in your predictive equation,

(2)   \begin{equation*} p(D) = \sum_{i \geq 0} P(D|H_i)P(H_i) \end{equation*}

you’re very likely to get the prediction wrong unless D was something which would have occurred almost no matter what.

Frequentism exacerbates this cheating to an extraordinary degree because the p-value depends only on P(D|H_0) and doesn’t use P(H_i|D) (technically the p-value uses P(T(data) \geq T(D)|H_0) for some T, which introduces even more irrelevancies, but it’s not germane to the this post).

Thus p-values don’t just allow this kind of cheating, they make it virtually mandatory. Consequently, posthoc data dredging with p-values tends to be worse than other uses of p-values. Which leads finally to the key point: because of this correlation with worse results, preregistered results will tend to have a higher percentage of successes even though the individual quality of specific inferences hasn’t improved at all.

To summarize:

Here’s how to destroy science in 5 easy steps:

Step 1: Use a bogus method of inference (p-values) with barely any connection to the truth.

Step 2: Generate as many hypothesis as possible.

Step 3: Apply the defective method to those hypothesis creating a mass of false claims. Publish any that you find.

Step 4: Avoid doing all the normal stuff to check such as aggressively considering all alternative hypothesis, double checking the result, using additional better experimental tests, and so on.

Step 5: When the failures eventually become too embarrassing, refuse to admit the method of inference is crap (Step 1), rather claim p-values are being used too often and devise tricks to limit their use (Step 2).

The Frequentist Explanation:

Of course, this is not how Frequentists understand the situation. In a turn so hilarious our descendants will be laughing at us for the next 2000 years, they’ve adopted a radically subjective theory of inference.

According to Frequentists, if two researchers have the same data for the same physical situation, they can arrive at different conclusions depending on how many questions they thought to ask. Moreover even if they ask a single question, the inference can depend on when they thought to ask it.

That’s not the worst of it. There is a widely held belief called the “Garden of Forked Paths” which says even if both researchers ask one question of the data which they thought of at the same time, they can still be lead to different results if one of the researchers would do something different in a different universe yielding different data.

No word yet on what happens when two collaborators decide to ask different questions at different times. Or if a researcher forgets when they though of the question, or dies before telling anyone. Don’t even bother asking how we’re supposed to know what we’d do in a different universe (meditation? yoga? tarot cards?). As Jaynes remarked in a similar context:

This would really cause trouble in a high energy physics laboratory, where a dozen researchers may collaborate on carrying out a big experiment. Suppose that by mutual consent they agree to stop at a certain point; but they had a dozen different private reasons for doing so. According to the principles expounded by Armitage, one ought then to analyze the data in a dozen different ways and publish a dozen different conclusions, from what is in fact only one data set from one experiment!

Frequentism, if taken seriously enough, always leads to this kind of stuff.


There are consequences to getting this right. One such is that if you do inference right you don’t need to limit how hard the data is interrogated.

It’s a historical fact that most of our big scientific successes were done without preregistration. An embarrassingly large number of them were done using pure posthoc data dredging of the kind preregistration is designed to stop. That’s how the structure of DNA was determined for example. A stream of hypothesis, each one an educated guess based on results of previous hypothesis, were tried out on the data until one worked.

Even on a everyday level we see this. Radio astronomers don’t need to preregister what they’ll do if they get a strange signal. Nor do they put out a press release saying “p<.05 therefore space aliens exist” when one does arrive. Rather they go through a laborious process of considering all the down-to-earth (literally) explanations for the signal. This can often take months or years and involve dozens of competing hypothesis. It may be done using only existing data (entirely posthoc), or as Bayes Theorem implies, by collecting data which separates the true hypothesis from the others (i.e. makes P(D|H_{true}) high and simultaneously makes all the others low)..

A second consequence is that preregistered studies won’t be as successful as Frequentist “guarantees” claim they should be. For example, when a Frequentist makes inferences which should be right most of the time under a “gaussian process” and they’ve conducted a test which affirms their process is gaussian, then they should achieve a very high rate of success.

This is the infamous “guarantees” and “at least we can verify our assumptions” that Frequentists obnoxiously boast about all the time. It’s about time they were held accountable for them. If the percentage of “successes” for preregistered studies improves from 20% to 40-60% that is not the same as “rarely in error”. Either those “guarantees” need to show up after a century of trying or Frequentists need to admit their viewpoint is snake-oil mixed with hokum.

Will preregistration make things better or worse?

In the medium term, it’s hard to say. Perhaps the effects of researchers gaming preregistration will cancel with the positive effects of openness and the whole thing will be a wash. Maybe it will have positive consequences or lead to more productive reforms. Three things are clear though:

(1) It won’t by itself improve individual inferences.
(2) It will tend to hinder lots of perfectly valid posthoc reasoning.
(3) A belief that inference depends on irrelevant psychological factors will poison future generations of statisticians.

It’s hard to see how (1-3) will improve statistics in the long run.