In High-Energy Physics the small p-value of an observation may be the first hint of a discovery about to be made. Here by p-value I just mean the probability, just to be fancy (or brief). Because we rely on the assessment of the rarity of observations to decide whether we have discovered something or not, we physicists are (or should be) really careful with p-values. Today's article aims at demonstrating how easy it is to be carried away into giving more relevance to an observation than we should.

The problem is that often, upon observing a odd phenomenon, we are usually unable to correctly take into account the a-posteriori nature of our attempts at estimating the probability of our observation. This is sometimes called "look elsewhere effect" in some instances, but the issue is more general.

To clarify what I am getting at, let me first introduce two concepts. The first is the "flatness" of a p-value. If you choose at random a number between zero and one, and your random number generator is unbiased, the number has the same probability to be in the range [0,0.01] as to be in the range [0.49,0.50] or [0.99,1.00]: this probability is in fact one hundredth in all cases. We call this a "flat distribution", or more correctly "uniform distribution".

The second idea is that usually when we get some data we perform some test on it, which returns a p-value: this again is a number between zero and one, which says how likely it is that the data distributes as observed, if our model is true. By "model" I am referring to any underlying physical theory which we call our "null hypothesis". It could be the Standard Model, and the data could be some reconstructed particle mass distribution which, in the presence of new physics, should depart from the model we fit through them.

Now, what we generally do upon getting a p-value from some fit to some data is to compare it with a pre-defined threshold value: if it is below 0.0017 we call the observed effect a "three-sigma" one; if it is below 0.00000027 we call it a "five-sigma" one (the jargon is due to the correspondence of those values with tail integrals of a Gaussian distribution from 3-sigma to infinity, or from 5-sigma to infinity). It is clear from this that a three-sigma effect can sometimes occur in data that are indeed drawn from our model; while a five-sigma effect is usually a sufficient indicium to claim that the model is false, and that we have e.g. discovered a new particle not included in it.

What do we do, then, if we instead than a single p-value our experiment is designed to produce several (such as is the case if e.g. we are testing not just one mass distribution, but several independent kinematic distributions together) ? We need a prescription to combine them, because the number of questions we may end up asking ourselves upon observing these p-values grows geometrically with their number!

Among the questions one might be tempted to ask, upon getting several p-values, are the following:

- what is the probability that the smallest of them is as small as the one I got ?

- what is the probability that the largest one is as I observed it to be ?

- what is the probability that the product of the numbers is as small as this ?

Please note:

In case you believe that each of your, say N, p-values tells you something about the truth of the null hypothesis you are testing, you may want to avoid concentrating on any subset of the set, and just use them all together. A reasonable (but not the optimal!) thing to do is to then check for how small is the product of the N numbers. There is then a very handy formula that provides the cumulative distribution of the density of x = p1*p2*...*pN. This can be derived by induction (see e.g. Byron Roe, "Probability and Statistics in Experimental Physics", Springer-Verlag 1992, p.129) and is equal to

Above, N is the number of observations; j runs from 0 to 4; and x is the product of the N observations. The exclamation mark is the symbol of the factorial [j!= j*(j-1)*...*2*1]. Note that the formula is blind to the individual values, as it should: it only cares for the number of them and the product.

Let us check this formula with a simple case. We take five really uniformly distributed p-values for a start:

Now let us take instead

Finally, let us take

Upon getting a 2.58% probability that the five numbers are as given above, you may be tempted to say that they do not give strong evidence against the null hypothesis (which, I remind you, is that the numbers distribute uniformly between zero and one). You might want to compare this conclusion with the one you would get if you had been asking "what is the chance that five numbers between zero and one are all smaller than 0.25?", which is a totally different question than the one on the product. The answer to the above is P=(0.25)^5 = 0.00098, a much smaller probability! This is a sample demonstration that the a-posteriori choice of the test is to be avoided at all costs. Alas, a lesson which is still needed in HEP occasionally!

The problem is that often, upon observing a odd phenomenon, we are usually unable to correctly take into account the a-posteriori nature of our attempts at estimating the probability of our observation. This is sometimes called "look elsewhere effect" in some instances, but the issue is more general.

**Two basic concepts**

To clarify what I am getting at, let me first introduce two concepts. The first is the "flatness" of a p-value. If you choose at random a number between zero and one, and your random number generator is unbiased, the number has the same probability to be in the range [0,0.01] as to be in the range [0.49,0.50] or [0.99,1.00]: this probability is in fact one hundredth in all cases. We call this a "flat distribution", or more correctly "uniform distribution".

The second idea is that usually when we get some data we perform some test on it, which returns a p-value: this again is a number between zero and one, which says how likely it is that the data distributes as observed, if our model is true. By "model" I am referring to any underlying physical theory which we call our "null hypothesis". It could be the Standard Model, and the data could be some reconstructed particle mass distribution which, in the presence of new physics, should depart from the model we fit through them.

**P-values in HEP**

Now, what we generally do upon getting a p-value from some fit to some data is to compare it with a pre-defined threshold value: if it is below 0.0017 we call the observed effect a "three-sigma" one; if it is below 0.00000027 we call it a "five-sigma" one (the jargon is due to the correspondence of those values with tail integrals of a Gaussian distribution from 3-sigma to infinity, or from 5-sigma to infinity). It is clear from this that a three-sigma effect can sometimes occur in data that are indeed drawn from our model; while a five-sigma effect is usually a sufficient indicium to claim that the model is false, and that we have e.g. discovered a new particle not included in it.

What do we do, then, if we instead than a single p-value our experiment is designed to produce several (such as is the case if e.g. we are testing not just one mass distribution, but several independent kinematic distributions together) ? We need a prescription to combine them, because the number of questions we may end up asking ourselves upon observing these p-values grows geometrically with their number!

**Fun with p-values**

Among the questions one might be tempted to ask, upon getting several p-values, are the following:

- what is the probability that the smallest of them is as small as the one I got ?

- what is the probability that the largest one is as I observed it to be ?

- what is the probability that the product of the numbers is as small as this ?

Please note:

*your inference on the data at hand strongly depends on the check you perform, for a given set of data*!In case you believe that each of your, say N, p-values tells you something about the truth of the null hypothesis you are testing, you may want to avoid concentrating on any subset of the set, and just use them all together. A reasonable (but not the optimal!) thing to do is to then check for how small is the product of the N numbers. There is then a very handy formula that provides the cumulative distribution of the density of x = p1*p2*...*pN. This can be derived by induction (see e.g. Byron Roe, "Probability and Statistics in Experimental Physics", Springer-Verlag 1992, p.129) and is equal to

Above, N is the number of observations; j runs from 0 to 4; and x is the product of the N observations. The exclamation mark is the symbol of the factorial [j!= j*(j-1)*...*2*1]. Note that the formula is blind to the individual values, as it should: it only cares for the number of them and the product.

Let us check this formula with a simple case. We take five really uniformly distributed p-values for a start:

**p1=0.1, p2=0.3, p3=0.5, p4=0.7, p5=0.9**(note that the ordering is inessential here). The product of the five numbers above is 0.00945, and the use of the formula above gives that the probability of obtaining a product as small as this is P(0.00945|N=5) is 0.5017. As expected, the distribution is flat and we should find a combined probability which does not scream at anything abnormal.Now let us take instead

**p1=0.00001**and the other numbers as above. The product is 0.000000945, and P(0.000000945|N=5) is 0.00123. So there is about a chance in a thousand that the product is as small as that. Note that one in a thousand is a rather large outcome, given that one of the numbers is as small as a hundred thousandth! One might (correctly) think that the chance of getting one in five numbers as small as 10^-5 is only few times in 10^-5. But that is not what we tested!!! We are testing the product, not how small is the smallest number.Finally, let us take

**p1=0.05, p2=0.10, p3=0.15, p4=0.20, p5=0.25**. The test of the product gives P(0.0000375|N=5)=0.0258. This is shown pictorially on the right: the lower graph shows the cumulative distribution of the product of five numbers, the x axis giving the product and the y axis giving the global p-value; the upper graph shows the probability density function of the product of five numbers distributed flatly between zero and one.Upon getting a 2.58% probability that the five numbers are as given above, you may be tempted to say that they do not give strong evidence against the null hypothesis (which, I remind you, is that the numbers distribute uniformly between zero and one). You might want to compare this conclusion with the one you would get if you had been asking "what is the chance that five numbers between zero and one are all smaller than 0.25?", which is a totally different question than the one on the product. The answer to the above is P=(0.25)^5 = 0.00098, a much smaller probability! This is a sample demonstration that the a-posteriori choice of the test is to be avoided at all costs. Alas, a lesson which is still needed in HEP occasionally!

Behold the mighty dinosaur,

Famous in prehistoric lore,

Not only for his power and strength

But for his intellectual length.

You will observe by these remains

The creature had two sets of brains—

One in his head (the usual place),

The other at his spinal base.

Thus he could reason '

A priori'As well as '

A posteriori'.No problem bothered him a bit

He made both head and tail of it.

So wise was he, so wise and solemn,

Each thought filled just a spinal column.

If one brain found the pressure strong

It passed a few ideas along.

If something slipped his forward mind

'Twas rescued by the one behind.

And if in error he was caught

He had a saving afterthought.

As he thought twice before he spoke

He had no judgment to revoke.

Thus he could think without congestion

Upon both sides of every question.

Oh, gaze upon this model beast

Defunct ten million years at least.

at the Carnegie Museum, 1934.

This (on the right) may well have been the creature in question, the

Apatosauruswith the skull of aCamarasaurus, the constructed and now sadly lostBrontosaurus. It was widely spoken of as having a “bigger brain in its backside”.At Reading, we had a number PhD students with one supervisor in Physics and one in Chemistry. When the two supervisors disagreed over something, the student must have felt like the brain in the buttocks of a two-headed Brontosaurus.