On Combining Probabilities And A-Posterioriness
    By Tommaso Dorigo | December 10th 2012 04:47 AM | 7 comments | Print | E-mail | Track Comments
    About Tommaso

    I am an experimental particle physicist working with the CMS experiment at CERN. In my spare time I play chess, abuse the piano, and aim my dobson...

    View Tommaso's Profile
    In High-Energy Physics the small p-value of an observation may be the first hint of a discovery about to be made. Here by p-value I just mean the probability, just to be fancy (or brief). Because we rely on the assessment of the rarity of observations to decide whether we have discovered something or not, we physicists are (or should be) really careful with p-values. Today's article aims at demonstrating how easy it is to be carried away into giving more relevance to an observation than we should.

    The problem is that often, upon observing a odd phenomenon, we are usually unable to correctly take into account the a-posteriori nature of our attempts at estimating the probability of our observation. This is sometimes called "look elsewhere effect" in some instances, but the issue is more general.

    Two basic concepts

    To clarify what I am getting at, let me first introduce two concepts. The first is the "flatness" of a p-value. If you choose at random a number between zero and one, and your random number generator is unbiased, the number has the same probability to be in the range [0,0.01] as to be in the range [0.49,0.50] or [0.99,1.00]: this probability is in fact one hundredth in all cases. We call this a "flat distribution", or more correctly "uniform distribution".

    The second idea is that usually when we get some data we perform some test on it, which returns a p-value: this again is a number between zero and one, which says how likely it is that the data distributes as observed, if our model is true. By "model" I am referring to any underlying physical theory which we call our "null hypothesis". It could be the Standard Model, and the data could be some reconstructed particle mass distribution which, in the presence of new physics, should depart from the model we fit through them.

    P-values in HEP

    Now, what we generally do upon getting a p-value from some fit to some data is to compare it with a pre-defined threshold value: if it is below 0.0017 we call the observed effect a "three-sigma" one; if it is below 0.00000027 we call it a "five-sigma" one (the jargon is due to the correspondence of those values with tail integrals of a Gaussian distribution from 3-sigma to infinity, or from 5-sigma to infinity). It is clear from this that a three-sigma effect can sometimes occur in data that are indeed drawn from our model; while a five-sigma effect is usually a sufficient indicium to claim that the model is false, and that we have e.g. discovered a new particle not included in it.

    What do we do, then, if we instead than a single p-value our experiment is designed to produce several (such as is the case if e.g. we are testing not just one mass distribution, but several independent kinematic distributions together) ? We need a prescription to combine them, because the number of questions we may end up asking ourselves upon observing these p-values grows geometrically with their number!

    Fun with p-values

    Among the questions one might be tempted to ask, upon getting several p-values, are the following:
    - what is the probability that the smallest of them is as small as the one I got ?
    - what is the probability that the largest one is as I observed it to be ?
    - what is the probability that the product of the numbers is as small as this ?

    Please note: your inference on the data at hand strongly depends on the check you perform, for a given set of data!

    In case you believe that each of your, say N, p-values tells you something about the truth of the null hypothesis you are testing, you may want to avoid concentrating on any subset of the set, and just use them all together. A reasonable (but not the optimal!) thing to do is to then check for how small is the product of the N numbers. There is then a very handy formula that provides the cumulative distribution of the density of x = p1*p2*...*pN. This can be derived by induction (see e.g. Byron Roe, "Probability and Statistics in Experimental Physics", Springer-Verlag 1992, p.129) and is equal to

    Above, N is the number of observations; j runs from 0 to 4; and x is the product of the N observations. The exclamation mark is the symbol of the factorial [j!= j*(j-1)*...*2*1]. Note that the formula is blind to the individual values, as it should: it only cares for the number of them and the product.

    Let us check this formula with a simple case. We take five really uniformly distributed p-values for a start: p1=0.1, p2=0.3, p3=0.5, p4=0.7, p5=0.9 (note that the ordering is inessential here). The product of the five numbers above is 0.00945, and the use of the formula above gives that the probability of obtaining a product as small as this is P(0.00945|N=5) is 0.5017. As expected, the distribution is flat and we should find a combined probability which does not scream at anything abnormal.

    Now let us take instead p1=0.00001 and the other numbers as above. The product is 0.000000945, and P(0.000000945|N=5) is 0.00123. So there is about a chance in a thousand that the product is as small as that. Note that one in a thousand is a rather large outcome, given that one of the numbers is as small as a hundred thousandth! One might (correctly) think that the chance of getting one in five numbers as small as 10^-5 is only few times in 10^-5. But that is not what we tested!!! We are testing the product, not how small is the smallest number.

    Finally, let us take p1=0.05, p2=0.10, p3=0.15, p4=0.20, p5=0.25. The test of the product gives P(0.0000375|N=5)=0.0258. This is shown pictorially on the right: the lower graph shows the cumulative distribution of the product of five numbers, the x axis giving the product and the y axis giving the global p-value; the upper graph shows the probability density function of the product of five numbers distributed flatly between zero and one.

    Upon getting a 2.58% probability that the five numbers are as given above, you may be tempted to say that they do not give strong evidence against the null hypothesis (which, I remind you, is that the numbers distribute uniformly between zero and one). You might want to compare this conclusion with the one you would get if you had been asking "what is the chance that five numbers between zero and one are all smaller than 0.25?", which is a totally different question than the one on the product. The answer to the above is P=(0.25)^5 = 0.00098, a much smaller probability! This is a sample demonstration that the a-posteriori choice of the test is to be avoided at all costs. Alas, a lesson which is still needed in HEP occasionally!


    The title immediately made me think of this:

    Behold the mighty dinosaur,
    Famous in prehistoric lore,
    Not only for his power and strength
    But for his intellectual length.
    You will observe by these remains
    The creature had two sets of brains—
    One in his head (the usual place),
    The other at his spinal base.
    Thus he could reason 'A priori'
    As well as 'A posteriori'.
    No problem bothered him a bit
    He made both head and tail of it.
    So wise was he, so wise and solemn,
    Each thought filled just a spinal column.
    If one brain found the pressure strong
    It passed a few ideas along.
    If something slipped his forward mind
    'Twas rescued by the one behind.
    And if in error he was caught
    He had a saving afterthought.
    As he thought twice before he spoke
    He had no judgment to revoke.
    Thus he could think without congestion
    Upon both sides of every question.
    Oh, gaze upon this model beast
    Defunct ten million years at least.

            at the Carnegie Museum, 1934.

    This (on the right) may well have been the creature in question, the Apatosaurus with the skull of a Camarasaurus, the constructed and now sadly lost Brontosaurus.  It was widely spoken of as having a “bigger brain in its backside”. 

    At Reading, we had a number PhD students with one supervisor in Physics and one in Chemistry.  When the two supervisors disagreed over something, the student must have felt like the brain in the buttocks of a two-headed Brontosaurus.
    Robert H. Olley / Quondam Physics Department / University of Reading / England
    This is great Robert, and very funny. Thanks for sharing it!

    I should also add a quote from Fred James, from a 2008 PhyStat talk writeup:

    The Early History of Bayesian Ideas

    "As I am often accused of being an incorrigible frequentist, I thought I should do some more studying of the Bayesian methodology and in particular the early history and foundations of Bayesian thinking.

    I found to my great surprise that many of the ideas and even the terminology I thought originated in the 18th, 19th and 20th centuries can actually be traced back long before Reverend Thomas Bayes' famous paper on the Doctrine of Chances.

    The earliest traces I could find are from the 12th and 13th centuries. In those days in Europe, there were many ways in which one could lead a religious life. Probably the most devoted servants of the faith were the monks, who lived in monasteries (as their name implies), and the friars, who led humble lives in the outside world. The name of the latter group derives from the French frère, or brother.

    Although we tend to have a romantic view of monasteries now, by all accounts life in the monasteries was not very comfortable. Everything was in stone or hard wood, and meals were taken while seated on long wooden or even stone benches. Although quite uncomfortable, these benches were very important, for it was here that one encountered the highest posterior densities.

    The friars, on the other hand, were not concentrated in monasteries, so the friar density was much more spread out and uniform, as it should be. Not completely uniform, of course, because friars were believers, so the friar density reected the degree of belief, or faith, for a given region.
    In the beginning, friars were supposed to have no possessions and live from begging alone, but this was not an entirely workable arrangement, so most of them eventually took on regular jobs. Those who worked in the library were known as reference friars, and those who did ironing were called at friars.

    As it must happen with any social group, some friars, and indeed those most often encountered
    in public, were accused of improper behaviour. These improper friars soon became a source of scandal, starting with their provocative dress often characterized by considerable undercoverage. There was some discussion about how much coverage a friar should have, and it was decided that, at the very least, their posteriors should have adequate coverage.
    Finally, one case was reported of a friar with no coverage at all! This friar was arrested, but was
    later released for lack of evidence.

    There was a suggestion to organize the friars into groups, with leaders that would oversee the
    behaviour of the group members, but the idea of hierarchical friars was not well received. Finally, a physicist friar by the name of Jeffrey decided to form his own group of friars known as Jeffrey's friars, who would promise to behave themselves better. Most importantly, their behaviour was to be invariant.

    But even if it was invariant, the behaviour of some of Jeffrey's friars was still improper. Moreover,
    they were accused of being unprincipled, since they refused to obey an obscure religious dogma known as the Principle of Likelihood.

    Just at this time an additional problem arose with those friars who (like many monks) had taken
    vows of silence. These friars were called noninformative friars, and there was considerable discussion about just how noninformative they were. An inuential author in Valencia even published a pamphlet with the provocative title "Noninformative Friars do not Exist!"

    Further problems arose as an unexpected group, the Multidimensional Friars, exhibited a new form of unacceptable behaviour: inconsistency. But by this time, the Reformation was in full swing in much of Europe, and in the confusion that followed, the trail of early Bayes history has been lost."
    Hi Tommaso,

    Great post! The method you describe is also equivalent to Fisher's method, where a test statistic is constructed of n (independent and uniformly distributed) p-values by multiplying them together, taking the logarithm, and then multiplying by -2. The resulting test statistic follows the chi2 distribution with 2n dof, which gives a probability very similar (if not exact) to what you would get using the above formula. But I'm sure you knew this. Of course, it gets much more difficult whenever the p-values are correlated.

    Hello Kyle,

    thanks! Yes, I know about Fisher's method. Indeed I've bumped into the matter a few times over the last few years, so I did end up reading a couple of interesting sources on the subject.

    One is by Thomas Loughin, "A systematic comparison of methods for combining p-values from independent tests", Comp. Stat.&Data Anal. 47 (2004) 467;

    Another is by O. Davidov, "Combining p-values using order-based methods", Comp. Stat. & Data Anal. 55 (2011) 2433.

    I like also Bob Cousins' short review "Annotated Bibliography of Some Papers on Combining Significances or p-values", arxiv:0705.2209v2.

    Why don't CMS and ATLAS update the result of higgs gamma gamma decay channel in the SM higgs boson search at 8 Tev?
    Da Liu

    Hello DL,

    I am unable to answer this question in a meaningful way. You should expect results
    at the winter conferences.