A pre-emptive warning to the reader: the article below is too long to publish as a single post. I have broken it out in four installments. After reading the text below you should continue with part II, part III, and part IV (which includes a summary).---
During the last few years science popularization magazines, physics blogs, and other outreach agents have been very busy, and in general successful, at explaining to the public the idea that a scientific discovery in physics research requires that an effect be found with a statistical significance exceeding five standard deviations.
Alas, I regret it. Science outreach has succeeded at explaining and making well known a practice which is not terribly smart, and one which in my opinion should be used with quite some caution. It is the purpose of this article to explain, to those of you who are willing to hear, what is wrong with the "five sigma" criterion.
[Note that as far as this blog is concerned, I have tried several times to bring to the attention of my twentythree readers the real meaning and impact on physics measurements of statistical fluctations, look-elsewhere effect, and systematic biases, and to teach some critical thinking on the real significance of observed physics effects; so I call myself out of the pack and plead innocent of the above count...]
Whatsa Statistical Significance ?
First of all let us all get on the same page. Statistical significance is a measure of how small is the probability of some observed data under a given "null hypothesis", which in physics is the one describing the currently accepted and established theory. If one observes data that disagree with the null hypothesis, that is data having a small probability of being observed if the null hypothesis is tue, one may convert that probability into the corresponding number of "sigma", i.e. standard deviations units, from a Gaussian mean. A 16% probability is a one-standard-deviation effect; a 0.17% probability is a three-standard-deviation effect; and a 0.000027% probability corresponds to five standard deviations - "five sigma" for insiders.
The above mentioned "five sigma" criterion has been heavily publicized especially well when the Higgs boson discovery came together: it was clear from the start that a claim of observing the goddamn particle could only be put forth if one had data whose probability under the "no Higgs" hypothesis -the null- was below 0.000027%. That was a pure convention, and an ill-founded one to boot. But it worked well to set a reference point.
Note that as far as believing that the Higgs boson was indeed in CMS and ATLAS data, waiting for five sigma significance (by each experiment!) was only followed diligently by the press, by educated bystanders (who had been instructed to do so by the media), and by some of the less smart particle theorists (you know who you are). Most experimental physicists with some experience were already convinced that the Higgs boson was there six months before the official announcement, after CMS and ATLAS showed coincident evidence, globally above three standard deviations, in multiple search channels.
Also note that the "five sigma" criterion is not an accepted standard in other branches of science. Much less extreme evidence is required in medicine or biology, or even in astrophysics. Indeed, there is nothing magical in the number five, nor in the even more obscure number 0.000027%.
So what is the reason of the five sigma criterion ? Where does it come from ? Has it been a standard for Nobel prize discoveries in the past ? Is it a reasonable one ?
First of all, in order to understand where the requirement comes from, we need to know some past history of particle physics discoveries. We need to go back 50 years, to an age when new hadrons were discovered on a weekly basis, and when particle physicists were struggling to make sense of an overwhelming amount of experimental evidence to which did not yet correspond a clear theoretical picture, at least as far as the dynamics of particle interactions is concerned.
I believe it was the large number of new particles that were discovered in the fifties and sixties what lowered the general level of scepticism that physicists naturally have in their genes, and made many of them willing to accept even modest statistical effects in their data as evidence for something new, when such effects were often just due to fluctuations, to unknown or ill-determined systematic effects, or just to the sheer number of places where one had looked for discrepancies. So along with many true discoveries, several mistaken claims were made: new particles were published and then a refutation by other experiments proved them non-existent; or worse, new resonances were kept "unconfirmed" for years in the listings of the Particle Data Group, generating headaches in theorists and a flurry of useless papers offering improbable interpretations. That was a problem !
At some point Gerry Lynch (as reported in Rosenfeld's article quoted below) came along with the idea of showing sets of 100 random mass histograms to colleagues -ones drawn from smooth distributions not containing any real signal-, asking them to judge "by eye" whether a new resonance (a bump in an otherwise smooth mass spectrum) was present anywhere. Usually, the average physicist would find several "clear peaks" in those histograms ! The exercise demonstrated not only that a "look-elsewhere effect" was at work (the more places you looked for a discrepancy, the easier it was to see one): it also showed how easy it was for a particle hunter those days to get enamoured with a two- or three-bin bump. The particle was in the eye of the beholder !
Non-Gaussian Systematics At Work
Another clear source of missed discoveries was, as mentioned above, the presence of large systematic effects, ones which produced non-Gaussian distributed biases in the measurement. Let me explain what I mean. Usually, in the absence of knowledge about the detailed mechanism by means of which a particular systematic source produces an effect on a measurement, the physicist assumes that this has a Gaussian distribution, and estimates the width of the Gaussian as well as he or she can. The Gaussian assumption has some merit, but it is of course just a guess, and it may fail miserably in particular cases, leading the researcher to believe that a large observed effect cannot be the result of that "small" systematic bias. I will make a concrete example below, loosely based on my recollection of Carlo Rubbia's claim of the top quark discovery at the SppS in the eighties (please forgive me for being vague: I am writing this on a ship, with no internet connection!).
Suppose you are searching for a new particle in the decay of the W boson. W decays involving the appearance of that new particle may manifest as events with a charged lepton, missing energy, and two hadronic jets; however that signature can in principle be also produced by strong interactions, when a W boson is produced together with gluon radiation. Based on your state-of-the-art knowledge of strong interactions, you estimate the background from those processes as yielding 3 events in your dataset, and you assign a systematic uncertainty of 20% to that prediction.
Your 20% guess comes from an assessment of the precision of theoretical knowledge of the background processes, which is the input to the Monte Carlo simulation program you use to study them. The assessment may be loosely based on how good have proven to be similar theoretical predictions in the past, but of course you know that this process is different to previously studied ones. Also, being unable to collect ensembles of theorists and compare their predictions, you have no way to determine what is the "probability distribution" of the error on your background prediction. So you assume it is a Gaussian.
Hence your background prediction is "3+-0.6" events, or G(3.0,0.6), a Gaussian centered at 3 with a sigma of 0.6. In that case the probability that the background is actually yielding an average of, say, at least 4.8 events in you dataset, due to your having underestimated it by "three standard deviations" (three times 0.6 = 1.8 + 3 = 4.8), is then naturally related to the tail of a Gaussian, and is readily computed as 0.17%.
Upon observing 17 events in the data you get excited and believe you have definitely observed a signal of the new particle: for the probability that a background process, one whose average unknown yield has a Gaussian probability distribution centered at 3 events with a sigma of 0.6, fluctuates to yield 17 or more events is very small ! You can compute it with a few lines of code to be p=0.00000025, i.e. a more than five-sigma effect. But if your knowledge of the strong interactions were much poorer than you believed, and the uncertainty on your prediction were actually 200% and not 20%, the probability of the data would rather be p=0.025, and the effect could be easily ascribed to a statistical fluctuation !
A similar thing happened to Rubbia, who believed his backgrounds to be small because simulations of strong interactions at the time were insufficiently precise. His 1984 top quark discovery paper, based on a dozen of top-like events with allegedly small backgrounds, is a striking example of the damage that systematic uncertainties may produce.
A final note on the effect of this kind of systematic uncertainties concerns the "independent confirmation" picture. Usually, when two or more experiments independently observe the same physical effect, the level of scepticism of a physicist toward the claimed effect lowers dramatically. The rationale is that different experiments are operated by different scientists who analyze their data differently, and the data themselves are subject to different systematic sources (they may be entirely different apparata). So the effect must be real, otherwise one would have to accept that the same fluctuation has occurred in two places.
However, you can see how the independent confirmation may itself be a deceiving input: if the other experiment along the SppS ring, UA2, had performed a similar search for the top quark, it is quite likely that the UA2 experimentalists would have also observed an excess of those "W + jets" events, because they, too, would have relied on the same state-of-the-art knowledge of strong interactions. The same systematic effect would have caused both experiments to err in the very same direction !