Banner
    Demystifying The Five-Sigma Criterion
    By Tommaso Dorigo | August 11th 2013 04:30 PM | 13 comments | Print | E-mail | Track Comments
    About Tommaso

    I am an experimental particle physicist working with the CMS experiment at CERN. In my spare time I play chess, abuse the piano, and aim my dobson...

    View Tommaso's Profile

    A pre-emptive warning to the reader: the article below is too long to publish as a single post. I have broken it out in four installments. After reading the text below you should continue with part II, part III, and part IV (which includes a summary).

    ---

    During the last few years science popularization magazines, physics blogs, and other outreach agents have been very busy, and in general successful, at explaining to the public the idea that a scientific discovery in physics research requires that an effect be found with a statistical significance exceeding five standard deviations.

    Alas, I regret it. Science outreach has succeeded at explaining and making well known a practice which is not terribly smart, and one which in my opinion should be used with quite some caution. It is the purpose of this article to explain, to those of you who are willing to hear, what is wrong with the "five sigma" criterion.

    [Note that as far as this blog is concerned, I have tried several times to bring to the attention of my twentythree readers the real meaning and impact on physics measurements of statistical fluctations, look-elsewhere effect, and systematic biases, and to teach some critical thinking on the real significance of observed physics effects; so I call myself out of the pack and plead innocent of the above count...]

    Whatsa Statistical Significance ?

    First of all let us all get on the same page. Statistical significance is a measure of how small is the probability of some observed data under a given "null hypothesis", which in physics is the one describing the currently accepted and established theory. If one observes data that disagree with the null hypothesis, that is data having a small probability of being observed if the null hypothesis is tue, one may convert that probability into the corresponding number of "sigma", i.e. standard deviations units, from a Gaussian mean. A 16% probability is a one-standard-deviation effect; a 0.17% probability is a three-standard-deviation effect; and a 0.000027% probability corresponds to five standard deviations - "five sigma" for insiders.

    The above mentioned "five sigma" criterion has been heavily publicized especially well when the Higgs boson discovery came together: it was clear from the start that a claim of observing the goddamn particle could only be put forth if one had data whose probability under the "no Higgs" hypothesis -the null- was below 0.000027%. That was a pure convention, and an ill-founded one to boot. But it worked well to set a reference point.

    Note that as far as believing that the Higgs boson was indeed in CMS and ATLAS data, waiting for five sigma significance (by each experiment!) was only followed diligently by the press, by educated bystanders (who had been instructed to do so by the media), and by some of the less smart particle theorists (you know who you are). Most experimental physicists with some experience were already convinced that the Higgs boson was there six months before the official announcement, after CMS and ATLAS showed coincident evidence, globally above three standard deviations, in multiple search channels.

    Also note that the "five sigma" criterion is not an accepted standard in other branches of science. Much less extreme evidence is required in medicine or biology, or even in astrophysics. Indeed, there is nothing magical in the number five, nor in the even more obscure number 0.000027%.

    So what is the reason of the five sigma criterion ? Where does it come from ? Has it been a standard for Nobel prize discoveries in the past ? Is it a reasonable one ?

    Some History

    First of all, in order to understand where the requirement comes from, we need to know some past history of particle physics discoveries. We need to go back 50 years, to an age when new hadrons were discovered on a weekly basis, and when particle physicists were struggling to make sense of an overwhelming amount of experimental evidence to which did not yet correspond a clear theoretical picture, at least as far as the dynamics of particle interactions is concerned.

    I believe it was the large number of new particles that were discovered in the fifties and sixties what lowered the general level of scepticism that physicists naturally have in their genes, and made many of them willing to accept even modest statistical effects in their data as evidence for something new, when such effects were often just due to fluctuations, to unknown or ill-determined systematic effects, or just to the sheer number of places where one had looked for discrepancies. So along with many true discoveries, several mistaken claims were made: new particles were published and then a refutation by other experiments proved them non-existent; or worse, new resonances were kept "unconfirmed" for years in the listings of the Particle Data Group, generating headaches in theorists and a flurry of useless papers offering improbable interpretations. That was a problem !

    At some point Gerry Lynch (as reported in Rosenfeld's article quoted below) came along with the idea of showing sets of 100 random mass histograms to colleagues -ones drawn from smooth distributions not containing any real signal-, asking them to judge "by eye" whether a new resonance (a bump in an otherwise smooth mass spectrum) was present anywhere. Usually, the average physicist would find several "clear peaks" in those histograms ! The exercise demonstrated not only that a "look-elsewhere effect" was at work (the more places you looked for a discrepancy, the easier it was to see one): it also showed how easy it was for a particle hunter those days to get enamoured with a two- or three-bin bump. The particle was in the eye of the beholder !

    Non-Gaussian Systematics At Work

    Another clear source of missed discoveries was, as mentioned above, the presence of large systematic effects, ones which produced non-Gaussian distributed biases in the measurement. Let me explain what I mean. Usually, in the absence of knowledge about the detailed mechanism by means of which a particular systematic source produces an effect on a measurement, the physicist assumes that this has a Gaussian distribution, and estimates the width of the Gaussian as well as he or she can. The Gaussian assumption has some merit, but it is of course just a guess, and it may fail miserably in particular cases, leading the researcher to believe that a large observed effect cannot be the result of that "small" systematic bias. I will make a concrete example below, loosely based on my recollection of Carlo Rubbia's claim of the top quark discovery at the SppS in the eighties (please forgive me for being vague: I am writing this on a ship, with no internet connection!).

    Suppose you are searching for a new particle in the decay of the W boson. W decays involving the appearance of that new particle may manifest as events with a charged lepton, missing energy, and two hadronic jets; however that signature can in principle be also produced by strong interactions, when a W boson is produced together with gluon radiation. Based on your state-of-the-art knowledge of strong interactions, you estimate the background from those processes as yielding 3 events in your dataset, and you assign a systematic uncertainty of 20% to that prediction.

    Your 20% guess comes from an assessment of the precision of theoretical knowledge of the background processes, which is the input to the Monte Carlo simulation program you use to study them. The assessment may be loosely based on how good have proven to be similar theoretical predictions in the past, but of course you know that this process is different to previously studied ones. Also, being unable to collect ensembles of theorists and compare their predictions, you have no way to determine what is the "probability distribution" of the error on your background prediction. So you assume it is a Gaussian.

    Hence your background prediction is "3+-0.6" events, or G(3.0,0.6), a Gaussian centered at 3 with a sigma of 0.6. In that case the probability that the background is actually yielding an average of, say, at least 4.8 events in you dataset, due to your having underestimated it by "three standard deviations" (three times 0.6 = 1.8 + 3 = 4.8), is then naturally related to the tail of a Gaussian, and is readily computed as 0.17%.

    Upon observing 17 events in the data you get excited and believe you have definitely observed a signal of the new particle: for the probability that a background process, one whose average unknown yield has a Gaussian probability distribution centered at 3 events with a sigma of 0.6, fluctuates to yield 17 or more events is very small ! You can compute it with a few lines of code to be p=0.00000025, i.e. a more than five-sigma effect. But if your knowledge of the strong interactions were much poorer than you believed, and the uncertainty on your prediction were actually 200% and not 20%, the probability of the data would rather be p=0.025, and the effect could be easily ascribed to a statistical fluctuation !

    A similar thing happened to Rubbia, who believed his backgrounds to be small because simulations of strong interactions at the time were insufficiently precise. His 1984 top quark discovery paper, based on a dozen of top-like events with allegedly small backgrounds, is a striking example of the damage that systematic uncertainties may produce.

    A final note on the effect of this kind of systematic uncertainties concerns the "independent confirmation" picture. Usually, when two or more experiments independently observe the same physical effect, the level of scepticism of a physicist toward the claimed effect lowers dramatically. The rationale is that different experiments are operated by different scientists who analyze their data differently, and the data themselves are subject to different systematic sources (they may be entirely different apparata). So the effect must be real, otherwise one would have to accept that the same fluctuation has occurred in two places.

    However, you can see how the independent confirmation may itself be a deceiving input: if the other experiment along the SppS ring, UA2, had performed a similar search for the top quark, it is quite likely that the UA2 experimentalists would have also observed an excess of those "W + jets" events, because they, too, would have relied on the same state-of-the-art knowledge of strong interactions. The same systematic effect would have caused both experiments to err in the very same direction !

    ---

    Continue to part II

    Comments

    Mr T - slightly off topic. I took a few friends to RHIC last week - they have public events during the summer. Not everything was RHIC though, there was a lecture by Howard Gordon of Atlas giving a LHC update - I'm told he is some what senior in the US Atlas contingent. Anyhow, during the q&a I asked who was the Atlas equivalent to CMS and CDF's Tomasso. To my surprise and dismay, he had no idea who you were and yet here he is doing public outreach!

    All in all, it was a good day. BNL does a series of events every summer and I look forward to poking around NLS II when it opens. I was really shocked how crowded the original NLS is - the tour groups were had so little room to move about without fear of messing something up!

    Cheers

    dorigo
    Hi AS, not surprised I am not known - I don't know everybody in CMS, much less in ATLAS. And I do not know the person you mention. I'd say a good counterpart as a blogger in ATLAS (and DZERO) is Gordon Watts, "Life as a physicist". Cheers, T.
    Are you familiar with Stephen T. Ziliak and Deirdre N. McCloskey The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (2008)? They make much the same argument regarding medical and economic use, misuse, and abuse of "significance".

    And of course, there is now an almost endless supply of gnashing over the reliance of Gaussian models in pricing derivatives, when of course most of the relevant assets were highly correlated.

    The same effect has been seen in finance. It's all very well to calculate the probability of your derivative losing money when the distribution of outcomes is accurately known, but it's possible (and has happened) to put yourself into a situation where your probability of an unfavorable outcome was indeed correctly calculated as low --- given the mean and standard deviation you were working with --- however sadly that mean was very poorly characterized and inaccurate...

    It does seem like the combination of what undergrads are taught as recipe and the constant refrain of "statistical significance" and "sampling error" blinds most people to the meaning and IMPORTANCE of systematic error, even among those (like on Wall Street) who really should know better, and who should automatically be calculating outcome spreads based on not just standard deviation but also the risk inherent in the knowledge (or lack thereof) of the mean.

    dorigo
    Hi William,

    no, I don't know that book. I'll try to find a copy and read it though - sounds interesting.

    Cheers,
    T.
    Hi Tommaso,

    I would be wary of Ziliak & McCloskey book. From my skimming through reviews of their work, there seems to be a lot of spleen-venting about p-values and significance testing and little else. Active researchers in statistics and applied probability will readily admit there is sloppy use of significances tests in a variety of fields, as you no doubt would agree. But it doesn't rise to the level of "This is all crap! You people have no clue how to do anything!" so common in Ziliak's writing.

    Back in June, Ziliak penned this horrid article in the Financial Post claiming "Statistical significance is junk science, and its big piles of nonsense are spoiling the research of more than particle physicists." Note, his jumping off point was the claim of 5σ detection of the Higgs boson last year. So yeah, probably not worth your time.

    Cheers

    I would agree that Ziliak does not have clue about particle physics. Yet his criticisms of the Higgs, while meaningless, are in the same general ballpark as Tommaso's on Rubbia. Reducing physics to 5σ is perverse.

    dorigo
    Ahem... Sorry to disagree. I am making specific statements about the credibility of a claim and the possible issues with a scientific result. And I am making very clearly a distinction between a robust result and one on shaky ground. Rubbia's top did not stand the test of time, the Higgs is now (after one more year of studies, 2.5 times more data, etc.) as sure as death and taxes.

    Cheers,
    T.
    I don't see the disagreement. You are saying the good work consists of solid foundations behind the significance claims. Ziliak claims that large parts of medical and economic publishable claims consist of significance assertions without any foundations, and without any reproducibility. His foray into criticizing the Higgs was ignorant, not just because he doesn't known any physics, but also because he doesn't understand the physicists. As you point out, physicists have kept on accumulating supportive data. Yet had Ziliak written a similar article, say about the Θ pentaquark in its first year of "discovery", he'd have been hailed as all-wise and all-knowing.

    Rubbia also flopped with the "alternating" neutral current. But when he had a clear-cut target with known properties (W/Z) he found them. Does it even mean anything to talk about the p-value of that first Z they rushed to publish?

    dorigo
    Hi W&E,

    I know too little about Zilak's point of view to elaborate... Of course targeting the Higgs was a way to raise more interest, though. So he picked something which unfortunately is too well established to deserve that kind of criticism.

    Cheers,
    T.
    Ziliak is, unfortunately, quite justified in his rage. Last week I read through a paper that was clearly produced by someone cranking the SPSS machine without a clue as to what they were doing.
    "Let's throw indicator variables into a linear regression; oh p is below above 5%; must be true.
    No need to draw a graph for example, to see if what I am doing makes the slightest bit of sense"

    But, you say, that was an undergrad thesis. That doesn't happen in the real world. Ahem:
    http://delong.typepad.com/sdj/2013/08/journal-refereeing-practices-and-p...

    Hi Tommaso,

    I am looking forward to part 2 of your demystification. Part 1 is a good reprisal of your regular "beware statistical fluctuations" message along with some interesting history. It was entertaining to go through the CMS & Atlas discovery presentations with a bunch of colleagues and realize the significances quoted were local p-values. We had been under the impression that the "look elsewhere effect" had been folded into the final result in some sophisticated but non-obvious manner.

    If you want to read a statistician/philosopher who is honestly trying to grapple with quoted Higgs significance from an outsiders perspective, my recommendation is Deborah Mayo's posts ( part 1 & part 2 ) from this March.

    Cheers,
    hμν

    dorigo
    Hello,
    I'll give that a look. I'll post a second part of this article (maybe there'll be three) tomorrow.
    Cheers,
    T.