Demystifying The Five-Sigma Criterion - Part IV And Summary
    By Tommaso Dorigo | August 19th 2013 04:41 AM | 14 comments | Print | E-mail | Track Comments
    About Tommaso

    I am an experimental particle physicist working with the CMS experiment at CERN. In my spare time I play chess, abuse the piano, and aim my dobson...

    View Tommaso's Profile
    Note: this is the fourth, and last, part of a four-part article (see part I, part II, part III) on the five-sigma criterion for discovery claims in particle physics. If you haven't read the first three installments, the text below may or may not make much sense to you... Below I first add some information about the dijet resonance of CDF which was the subject of part III, then I deal with one last topic to be considered, and then summarize the discussion.

    The last word on the CDF dijet mass bump at 150 GeV

    To complement the story I told in the third part of this article I must produce an addendum. A reanalysis of the CDF data, including 8.9 inverse femtobarns of collisions, has finally put an end to all speculations concerning the dijet mass bump that had been found at 150 GeV in events containing a W boson.

    As I discussed a few days ago, scientists were already sure that there was no resonance after the results of parallel investigations by DZERO, CMS, and ATLAS had observed no effect and put stringent limits on its existence. However, the question remained: if CDF sees that effect, what is the cause ? And if the cause is some mismeasurement or systematic effect, is the same affecting other measurements, e.g. in Higgs boson searches ?

    The cause, in short, was understood to be the sum of two different ones. First of all, CDF found that while jet energy corrections had been produced studying datasets rich in quark jets, the jets accompanying the W boson were observed to have a much larger gluon contribution. Gluon-originated jets behave differently, such that a different composition in data and W+jets simulation caused a shift to higher dijet masses in the data.

    A different effect was observed with electrons: some particular kinematic configurations caused the non-W background (three-jet events from strong interactions) to contribute significantly to the electron plus jets dataset, when the electron was in fact a jet with a large electromagnetic component. That background contributed to the bump at 150 GeV more than it did elsewhere in the spectrum.

    Below you can see the "corrected" dijet mass distribution of CDF with 8.9/fb of data. You can see that the bump, after accounting for jet correction differences and the unaccounted background, has totally disappeared !

    So we learn something here as well: sometimes, systematic uncertainties do add linearly their effect, although we usually claim that they add quadratically...

    Scientific impact - or the cost of a wrong claim

    Let us go back to the discussion of the various issues we have dealt with earlier. One last thing to consider, when we discuss the issue of how high should the bar be set for a discovery claim, is the cost of making a wrong claim.

    The cost of claiming a false the discovery of a new 4-GeV hadron decaying to charmed mesons - a particle whose existence does not change in a significant way our general understanding of strong interactions or the standard model - is of course minor: our reputation as scientists is not affected much. That is because the claim of the new particle does not make the headlines of all newspapers around the world. The impact of the claim is small, and the possible backfiring of it being found faulty is also small.

    One may compare the above with the cost of claiming that neutrinos are superluminal: such a claim has an enormous impact on the media; by putting it forth, an experiment is voluntarily stepping on hot coals, with the prospect of being kept there for all the time necessary to ascertain whether the claim is true or false.

    That is indeed what happened with OPERA: it took seven months to find out that the observed effect was due to an experimental error. At the end of the process one usually does not get away with it easily, and in fact the OPERA spokesperson had to resign, and the reputation of the experiment was at least in part affected.

    I have argued that the whole affair did not affect negatively the perception of basic science by outsiders: the value of drawing interest in fundamental physics in my opinion is very high, so even if laymen for once got the impression that scientists can fail and that they should not always believe in the claims that are made, they at least got to know what neutrinos are, how hard it is to detect them, and that there are facilities that shoot neutrinos through the rock of the earth to other places in the world. But even if I am right, the subjective cost, for an OPERA scientist, of the faulty claim was rather large.

    An even larger cost would have been entailed by a faulty claim of Higgs boson discovery by CERN. The Higgs boson had been on newspapers for the last few years already, and the Large Hadron Collider the subject of speculations on whether big science really costs too much taxpayers money (it doesn't, by the way - one year of money spent in Italy by people visiting magicians and tarot readers can finance the construction of a second LHC ring!). Claiming discovery of the Higgs and then having to retract it would have been a serious damage of the image of CERN, and of particle physics in general. That is in a nutshell why the definitive announcement of a discovery was made relatively late in the game (five sigma excesses by two independent experiments, in July 2012).

    So we are led to the conclusion that the cost function connected to a false discovery claim depends a lot on what is being claimed, as well as by who is behind it. Large experiments have more to lose, because they are under the spotlights of media attention. And claims of new physics of course are more risky.

    Whether the cost of being wrong should be considered in a discovery threshold, however, is not necessarily granted: from a scientific integralist point of view one should disregard it, and be concerned only with the claim itself. But in practice, a wise spokesperson should consider it very carefully !

    In Summary

    To summarize this long article, I wish to say that in general I am very happy that more and more outsiders are getting to know what five standard deviations are and what they signify for scientific investigations in fundamental physics; I am also happy to observe similar instances of the language of scientific investigations being made less arcane. We all win if we communicate more easily, and the language of science becomes the language of every day.

    However, I am a bit disappointed in the particular case at hand, observing how an arbitrary and fixed convention has crystallized into a unmovable requirement. It looks as if the popularization of the concept has made it even harder to replace it with something smarter. The strict "five sigma" criterion is liable to cause us to claim false discoveries before we have completed our homeworks carefully -as was the case of OPERA- or to wait years before we can confirm a model, when trials factors and systematics are not a concern and when we are studying an effect that we know must exist.

    Despite the careful accounting of Rosenfeld, times have changed as far as bump-hunting is concerned - we do not look at millions of histograms with all possible combinations of invariant masses any more, simply because there is little left to discover, alas. We do look at hundreds of mass distributions in modern day collider experiments, though, so the trials factor still exists; but we know well how to account for it nowadays, using the power of computer simulations. Toy simulations allow us to estimate the trials factor in every practical situation, so that we can simply quote a local and a global p-value for any fluctuation we observe.

    On the other hand, unknown or ill-known systematics effects always play a role and the possibility of a mistake is always there in these very difficult measurements we routinely perform at particle colliders and other particle physics facilities. Of course, some measurements are harder ro perform than others: the sources of systematic uncertainties to be accounted for in the OPERA timing measurement were dozens, and some took quite some ingenuity to estimate; yet a very basic one was missed... A loose cable. It sucks, but such errors are always possible.

    What I believe should be the guiding idea in changing the five-sigma criterion, or modifying it to tailor different situations, is a combination of our belief in the soundness of the evaluation of systematic uncertainties, in the robustness of the analysis, and in the surprise factor connected to the studied effect. Asking for five sigma blindly in widely different cases is a very, very silly thing to do !

    A proposal

    Hence a proposal can be put forth. One can simply put together a list of past, present, and future measurements in particle physics and evaluate separately the different components that should be the input on a decision of what could be considered a reasonable threshold for the observed significance of the data, which once crossed determines the right to put forth a discovery claim.

    Of course the elements of such a table are somewhat arbitrary. Each of us can try and assess the various inputs independently, and a discussion may or may not converge on an agreed upon value. However, already the compilation of the table is a very useful learning exercise.

    Let us see what could be the columns of such a table. Above we discussed trials factor, systematics known and unknown, surprise factors, a posterioriness, cost functions above.

    We might add a column specifying how beneficial it is that the result of the analysis be known outside the collaboration: if a discovery is not claimed because the experiment needs more data to reach the agreed upon threshold of significance, and a long delay in the publication is the result of this, then one must consider the cost, to the scientific community as a whole, of this delay.

    Imagine what would have happened if the Planck collaboration had found a significant departure of their data from the standard cosmological model in the power spectrum of the cosmic background radiation: they might have decided to either publish their find, adding caveat sentences to their publication to explain the existence of some still not perfectly studied systematic effects, or rather keep the data private and enter a long investigation of those effects, hiding their find from the scientific community for six more months. Those six months might be considered wasted time by a cosmologist who had conceived some pet model predicting exactly those modifications to the power spectrum !

    Louis' table

    So let us see what we come up with. The table below is just a first attempt, compiled by my colleague Louis, to stimulate a discussion on the topic of the discovery threshold in physics discoveries.

    In the table you see a list of discoveries, and an assessment - qualitative in most cases - of the various things one should consider when setting a proper discovery threshold. We see in order the surprise factor, the scientific impact, the entity of the trials factor ("LEE"), the systematic uncertainties that may affect the measurement, and finally the proposed number of standard deviations that an effect must correspond to in order to be considered a discovery.

    (Again, I should mention that some of these evaluations are arguable - indeed for single top it has been already pointed out in the comments thread of one of the previous parts of this article that systematics were not irrelevant, so that three sigma would have been too few for a discovery claim...).

    In the table are present both past and future possible discoveries. You may play the game and add your own set of discoveries - it would be nice to start an open discussion on the issue in the comments thread !

    Finally, let me thank you if you were able to read in full these four posts. I realize I have written more than I should have, but the topic is just so varied and there's so much to discuss... I might write a book on this one day!


    This is a nice series of articles calling the attention to several details.
    I have recently found myself explaining to a summer student why people tend to invert hypothesis tests when the result disfavours the alternate hypothesis - i.e. going from concluding that the alternative is disproved into concluding that the null is affirmed - and these aspects all featured in that discussion.

    Alas, it is hard to separate what the mathematical significance of a result is and what its importance (for lack of a better word) is.
    Perhaps the title of that upcoming book is "Assessing the importance of scientific results - a decision theory approach to assessing significance"?

    Tommaso, you say "... Imagine what would have happened if the Planck collaboration had found a significant departure of their data from the standard cosmological model ...
    they might have decided to either publish their find, adding caveat sentences ...
    or rather keep the data private and enter a long investigation of those effects ...".

    In fact, it seems that Planck has done both:

    the basic data with some interpetation has been released showing
    some differences between Planck and WMAP
    the Planck results XI paper on "Consistency of the data" is "still in preparation" and has not yet been released.

    The situation is followed by Francis (th)E mule Science's News
    in entries dated 17 August 2013 and 6 April 2013
    In the earlier entry, Francis said
    "... En casi todas las charlas de miembros de la Colaboración Planck aparece una “oveja negra” entre el público que pregunta si ya se sabe el motivo de esta discrepancia y cuándo se hará público el artículo XI sobre la consistencia de los resultados de Planck;
    la respuesta siempre es la misma, no se sabe el porqué, pero no se trata de un error sistemático de Planck y el artículo XI se publicará próximamente. Aún seguimos esperando …".


    Hi Tommaso,
    Thanks for the interesting set of articles. Does Louis's table come from a talk, or have some other context? I'd be interested to take a closer look.


    The single most important extra step that can be done to increase the chance of a new discovery would be making LHC data public, following the standard practice of other fields. Having some wrong claims is less important than the possibility that analysing the data in novel ways would bring to a real discovery, that otherwise could be missed

    Ciao Alessandro,

    I agree that the data should at some point be made public. I however have strong doubts that this would foster new discoveries; or rather - it would certainly produce many new FALSE ones.

    The problem is that in order to understand the data one needs an enormous amount of expertise - not enough information for a single physicist to master. That is why our analyses proceed through the scrutiny of the whole 3000-strong collaboration: 6000 eyes see better. So many are the possibilities of doing some mistake, or cutting corners that have an impact in the understanding of the data, that it is quite hard to believe that an outsider, or a small group of outsiders, who have no insider knowledge of the detector details etcetera, can do a good enough job.

    What is possible and should be encouraged is that outsiders with good ideas be allowed to collaborate with groups of physicists from the collaboration. This would be a much better working model...

    hi Tommaso,

    for sure many claims would be wrong, but still something good could come out, as happened when outsiders looked at data from Fermi, WMAP, etc.
    Big collaborations of 3000 persons are needed and do an excellent work, but they have the limitation of a big social inertia. I think that having 3001 people instead of 3000 would not make a difference.

    What could make a difference is independent crazy attempts.
    After all, there are things that mice can do and that elephants cannot do...


    I agree in part, and I am not against the idea in general. However most of my colleagues are... And since it is very hard to change a physicists' mind about the property of his data, imagine 3000.

    In cases like single top or Bs->μμ, or in general expected SM processes to be confirmed, what is really interesting is a deviation from the expected rate (an extreme case would be a non-observation). Even for the Higgs the same could be argued.
    How many sigmas would you require to be convinced of their exclusion?

    Hi Andrea,

    I'd say it would be a high surprise factor / high-impact find. No LEE. As for systematics, it of course depends on the details of the measurement. But I believe that "excluding the SM" by finding a rate significantly different from expected would require at least 6-7 sigma to be taken seriously...

    I mentioned in a previous comment the "single Z" publication, not "single top". In fact the original publication (Physics Letters 162B) had 4 Zs decaying to two electrons, and 1 Z decaying to two muons (the abstract itself mentions four events). Apparently UA1 sent out a preprint after their first Z. I say that because there is a title page with abstract of just such a one-Z claim that was added to the paperback reissue of a then-current introduction to particle physics textbook. (The main body of the text had said the expected discovery signal was going to be dramatic and unmistakable.)

    The single top is a little murky, since unlike the Z, the top noticeably faces background noise, and "evidence" for top was published piecemeal. DØ for the longest time claimed only a single top.

    Two other notable single event publishable discoveries have been the Ω-, which was immediately accepted as genuine and has been amply confirmed over the years, and the magnetic monopole, which has never been accepted and has never been confirmed. Amusingly, there are theories that state, roughly speaking, that there exists exactly one magnetic monopole per visible universe.

    Two other prominent examples were (1) Weber bars--it's still beyond belief that they were all down at the time of SN1987a--and (2) some of the early exoplanet discovery claims were actually artifacts.

    Hi Tommaso,

    As a graduate student working on search software development for LIGO, I can't help but be curious how "Grav Waves" got its scores. Since the surprise factor is deemed low, I guess the significance requirement comes from the presumed enormous LEE and large potential systematic errors. Is this correct?

    It seems that according to this criterion, a search for gravitational waves can 'never' claim a first detection if an 8σ p-value is required. One the face of it, this seems ridiculous highly and unrealistic. How much more data would Atlas or CMS have to collect in order for their Higgs discovery to hit 8σ?

    Other than my issues with the table of searches, I have very much enjoyed the series of articles on the problems of a naive 5σ.


    Dea Hmn,

    ATLAS and CMS now have over 8 sigma by combining their signals (I think ATLAS alone is close to that actually).

    I understand your concern, but the trials factor for gravitational waves is indeed very large. However if one experiment found a, say, 4- or 5- sigma effect, and a second one concentrating on that frequency alone found another 3- or 4-sigma effect, that would probably be sufficient.

    Needless to say that your list of possible appealing discoveries can be much richer...
    My two cents, considering only LHC possibilities:
    - extra-dimensions: I think big surprise and very big impact, large LEE due to lack of exact predictions, systematics from small to sizable depending from the signature, sigma bar above 5-6
    - violation of discrete symmetries: B number, L number, CPT: I think very big surprise and a very big impact, no LEE but large systematics, all of this requiring a sigma maybe larger than 7-8

    Hi Leonardo,

    the list indeed was just a stub, and the values not mine but Louis' choice. Indeed it would be good to include more signatures and effects... As you suggest, some of the discoveries that the LHC can make must be well above 5 sigma to be credible. At some point though, the LEE is wiped out by a two-experiment find, but correlated systematics (especially theoretical assumptions) are the culprit in that case.