In the previous installment of this longish article, I have introduced some of the issues that may affect the correct interpretation of a statistically significant effect. Below is the second part, dealing with the Rosenfeld paper and trying to address the questions I posed a few days ago; a third and a fourth section should be read after that, which discuss some additional factors and draw some conclusions...

Enter Rosenfeld

In 1968 Arthur Rosenfeld wrote a paper, titled "Are There Any Far-out Mesons or Baryons?". In it, he demonstrated that the number of claims of discovery of such exotic particles published in scientific magazines agreed reasonably well with the number of statistical fluctuations that one would expect.

I am not going to describe in detail what "far-out mesons" and baryons are here, since it would be heavily off-topic; suffices to say that these hypothetical particles can be defined as ones that would not fit the picture of being composed of quark-antiquark pairs or quark triplets, as we now know to be the case with no exception for hadrons. In 1968 quarks were not yet accepted as real entities, and since the "Eightfold way" of Gell-Mann  allowed a very successful categorization of all known hadrons (i.e. mesons or baryons) into multiplets of the SU(3) group, the question of the existence of exotic hadrons not fitting Gell-Mann's schemes was a very important one.

[For the record, the "Eightfold way" is the core of the static quark model of hadrons. It is a group-theory-based scheme suggesting that hadrons are composed of three different kinds of quarks: the up, the down, and the strange quark. The up and down quarks form the proton and the neutron, and the strange quark helps to construct strange mesons. Together the trio allows several combinations that exhibit different properties quite like what one would expect from their composition.]

Rosenfeld examined the literature and pointed his finger at large trial factors coming into play due to the massive use of combinations of observed particles to derive mass spectra containing potential discoveries. During the past twenty years physicists had learnt that to discover a new resonance they could combine the four-momenta of observed tracks in bubble chambers: if state A decayed into particles B and C and D, by adding the four-momenta of B+C+D one could derive the mass of A. In presence of some clear idea of what to search for, one would do this only with specific combinations of observed particle tracks (or even neutral particles inferred from the combination of two charged ones that the neutrals had decayed into); but when searching for unknown hypothetical new states whose properties were not predicted by a model, one could start combining any possible set of observed tracks, creating a multiplication of possible search channels:

"[...] This reasoning on multiplicities, extended to all combinations of all outgoing particles and to all countries, leads to an estimate of 35 million mass combinations calculated per year. How many histograms are plotted from these 35 million combinations ? A glance through the journals shows that a typical mass histogram has about 2,500 entries, so the number we were looking for, h is then 15,000 histograms per year (Our annual surveys also tells you that the U.S. measurement rate tends to double every two years, so things will get worse)."

He derives a trial factor as follows:
"[...] Our typical 2,500 entry histogram seems to average 40 bins. This means that therein a physicist could observe 40 different fluctuations one bin wide, 39 two bins wide, 38 three bins wide... This arithmetic is made worse by the fact that when a physicist sees 'something', he then tries to enhance it by making cuts..."

This is also painfully true: enthusiastic experimentalists may get excited by a bump, convince themselves they have a discovery in their hands, and from that point on they will do whatever it takes to make the peak more convincing. They will fool themselves into believing that a particular selection cut has a solid theoretical basis if they see that the cut enhances the peak; they will find the best possible choice of binning to make the peak appear more prominent; and they will refuse to accept that their own behaviour is the source of most of the statistical significance of their own find. I have been a particle experimentalist for long enough to see several such attacks of compulsive behaviour seize otherwise straight-thinking and esteemed colleagues.

Rosenfeld then explains in detail the program "Game", which I have mentioned already above:

"My colleague Gerry Lynch has instead tried to study this problem 'experimentally' using a 'Las Vegas' computer program called Game. Game is played as follows. You wait until a unsuspecting friend comes to show you his latest 4-sigma peak. You draw a smooth curve through his data (based on the hypothesis that the peak is just a fluctuation), and punch this smooth curve as one of the inputs for Game. The other input is his actual data. If you then call for 100 Las Vegas histograms, Game will generate them, with the actual data reproduced for comparison at some random page. You and your friend then go around the halls, asking physicists to pick out the most surprising histogram in the printout. Often it is one of the 100 phoneys, rather than the real "4-sigma" peak. "

As I transcribe the above paragraph I can't help chuckling at Rosenfeld's description of the "punching" of the input. He is talking of a long-gone era when computers still took punched cards as inputs! But let's follow Rosenfeld's reasoning, who thus summarizes his calculation:

"In summary of all the discussion above, I conclude that each of our 150,000 annual histograms is capable of generating somewhere between 10 and 100 deceptive upward fluctuations [...]".

With the backing of the above described hard data he is basically accusing the authors of 10 to 100 unconfirmed past claims of exotic particles of having been fooled by their combinatorics-based searches and by a huge trials factor, without explicitly naming them. Brilliant !

Back to the original questions

The above summary of Rosenfeld's paper is a possible answer to the question of what is the origin of the five sigma criterion in particle physics. However, we should note that the article was written in 1968, but the strict criterion of five standard deviations for discovery claims was not adopted in the seventies and eighties. For instance, no such thing as a five-sigma criterion was used for the discovery of the W and Z bosons, which earned Rubbia and Van der Meer the Nobel Prize in physics in 1984. Rubbia was an exceptional speaker and when he presented his finds he did it so convincingly that nobody could doubt that the handful of W and Z events he had collected were enough proof for the existence of those particles.

In contrast, I can recall that five standard deviations were a required threshold for the top quark discovery in 1994-1995. Because of it, the very strong and convincing result of the 1994 CDF search - a three-sigma numerical excess, but one complemented by a similarly sized effect from the kinematical distributions of the events - was only called "evidence", and the official discovery was only claimed in 1995, jointly with DZERO. Note that the 1994 measurement of the top quark mass by CDF, pre-discovery, is M_top = 174 GeV, with a 10-GeV-ish error (for a comparison, the first DZERO measurement in their 1995 paper is 199 GeV, with 20-GeV-ish uncertainty). It is almost unbelievable that after 19 years the first CDF measurement still stands anchored to the value we believe to be true, as the world average of the top quark mass is known with sub-GeV accuracy at M_top(WA) = 173.3 +-0.8 GeV !!

More recently, the 2002 ALEPH claim of a Higgs boson at 114 GeV did not reach three standard deviations; once averaged with the three other experiments along the LEP II ring, the effect was finally assessed as a 1.7 sigma one. A false but weak signal, and yet one which threatened the schedule of the Large Hadron Collider as a few believers tried to advocate an extended run of LEP II - which would have significantly delayed the program of the LHC construction.

What is certain is that the five sigma threshold today appears very hard to modify. But is a fixed threshold a reasonable choice ? I argue that it is not, and you will not be surprised to know that I am in good company. Many of the most statistics-savvy physicists in the CMS and ATLAS experiments concur that a fixed threshold is a rather silly thing. Let me explain my own perspective of this.

First of all let us consider the original motivation (compare Rosenfeld's quotes above) of the strict criterion: the trials factor. Nowadays we don't make mass histograms of all conceivable N-body combinations, so the trials factor usually comes from the mass range where we search rather than the large number of distributions we study. Our trials factors are occasionally large, but most often they are small or even absent. Let us take the observation of single top production as a clear case where there was no trials factor involved.

In hadron-hadron collisions the top quark can be produced in pairs or singly. Pair production occurs by  processes mediated by strong interactions; it was observed in 1995, as already discussed above. Instead single top production is an electroweak process, and thus a bit more infrequent (not terribly so, because the need of smaller energy implied by having to produce only one massive body mitigates the weak production); also, the fact that there is only one top gives rise to less striking signatures in the final state, such that the extraction of a signal is much more problematic. At the Tevatron the conclusive observation of single top production, entailing the collection of a "five-sigma" effect, took painstaking searches that lasted for several years during the last decade. Yet it can be argued that it should not have: five sigma were really not necessary there. Why ? Because there was only one place where to look for single top: the top quark mass was already perfectly well known, and the kind of excess sought for in the data could manifest itself only in one possible way. The trials factor was 1.0 !

[Also worth mentioning is that in the single top search the presence of large, unknown systematic uncertainties could be called off, since all the backgrounds were very well known and studied -the most important of them are the same processes that one has to account for when looking for top quark pairs.]

Finally, one should note that the "surprise potential" of the observation of single top production was practically null: I know no colleague who would have bet a dime, fifteen years ago, on the chance that the process does not occur. Electroweak theory is too well verified experimentally, while it would need to be subverted too radically in order to make single top production non-existent.

For all the good reasons above, a genuine three-sigma excess of events with the required topology of the single top signature should have convinced anybody that the process had been conclusively discovered. Yet five sigma were needed.

Let us now consider a very different measurement: the one of neutrino speeds by the OPERA experiment. In that case we were confronted, in September 2011, with a six-standard-deviation effect ! Was that enough to convince physicists of the genuine nature of the observed effect ? The answer is a resounding NO. In this case there is no trials factor involved, as in the single top observation; but systematic uncertainties are a different matter. What weighs the most, however, in making the measured statistical effect not convincing despite its large magnitude, is the "surprise factor": so steep was the claim (neutrinos are superluminal) that probably not even a 10-sigma effect would have convinced most physicists. "Extraordinary claims require extraordinary evidence" !

Indeed, the source of the problem of the OPERA measurement, as we now know, was systematic. A unaccounted bias of 60 nanoseconds due to a loose cable was the source of the observed offset. That embarassing mistake should help teaching us something: the five-sigma criterion is a rather silly one, if used without a grain of salt. A five-sigma discovery can be wrong, and it is much more likely to be wrong if the discovered effect is a really, really hard to believe one.

[Note that despite being a frequentist, here I am straggling into Bayesian territory: but there is no contradiction. Scientific results should be reported in a frequentist manner, but their intepretation can and should be left open to Bayesian reasoning -thus allowing one to evaluate cost functions and use decision theory, which frequentist tools do not provide. I admit that one's "prior beliefs" have an important role in deciding what to do of a scientific claim, because the beliefs are sometimes on much more solid ground than the claim ! ]

Finally, let us now examine the LHC Higgs discovery of last year again. The trials factor (due to looking for a signal in a wide mass range without knowing the true mass of the Higgs boson in advance) was readily evaluated and included in the "global p-values" that the two experiments quoted. Systematics were either much smaller than statistical errors (in the case of the H->ZZ searches) or not relevant (due to data-driven background extraction methods in the case of the H->gamma-gamma searches). As for the surprise factor, it was rather low: most physicists would bet that the Standard Model Higgs was there to be found. No extraordinary claim, in other words.

Crucially, though, what should have convinced everybody that the Higgs boson had been found already in December 2011, when ATLAS and CMS presented their interim results, was that both experiments were coherently observing the same effects. These were not five-sigma ones, but rather three-sigma or so. Still, once two independent experiments find a particle signal at the same mass, the trials factor is automatically eliminated. The two apparata brought in at least in part different systematic uncertainties, and the analyses were not exactly equal; so their coherent find was a very convincing argument.


Continue to part III