Banner
    Those Deceiving Error Bars
    By Tommaso Dorigo | December 22nd 2011 04:26 AM | 17 comments | Print | E-mail | Track Comments
    About Tommaso

    I am an experimental particle physicist working with the CMS experiment at CERN. In my spare time I play chess, abuse the piano, and aim my dobson...

    View Tommaso's Profile
    Have you ever looked at a histogram with the data displayed as counts per bin in the form of points with error bars, and wondered whether those fluctuations and departures from the underlying hypothesized model (usually overimposed as a continuous line or histogram) were really significant or worth ignoring ?

    The subject is one of the topics which takes the most time away in discussions which arise during talks at internal meetings of HEP experiments. Physicists in the audience will be always happy to compare their ability of eye-fitting and to argue about whether there's a bump here or a mismodeling there. It is just as if we came with a built-in "goodness-of-fit" co-processor in the back of our mind, and that was connected with our mouth without passing through those other parts of our brain handling the "think first" commandment. We do lose a lot of time in such discussions.

    I am not overstating it: it happens day in and day out. For instance, it happened at a meeting I attended to yesterday. We were approving a physics result and somebody started arguing that most of the data points were above the fit in a certain region of the money plot of the analysis. This turned out to be false: in the specific case, the questioner was forgetting to take into account bins where the data had fluctuated down to zero entries: the histogram had a logarithmic y scale, which made those points disappear from the horizon below the lower edge of the plot. (The histogram on the right is taken from the paper described here and is a hypothetical one, so it has nothing to do with the episode I am mentioning!)

    Besides the issue of deceiving zero-entry bins, there are several other reasons why one should be careful with such eyeballing comparisons, but by far the most important one is that the data, when they consist in event counts per bin, are universally shown as points with error bars, and the error bars by default are drawn symmetrically above and below the observed count, and extend from N - sqrt(N) to N + sqrt(N), if N is the bin content. In other words, the default is to use the fact that the event counts, being a random variable drawn from a Poisson distribution, has a variance equal to the mean.

    Here I should explain to the least knowledgeable readers what is a Poisson distribution. Any statistics textbook explains that the Poisson is a discrete distribution describing the probability to observe N counts when an average of m is expected. Its formula, P(N|m)=[exp(-m)* m^N]/N! (where ! is the symbol for the factorial, such that N!=N*(N-1)*(N-2)*...*1, and P(N|m) should be read as "the probability that I observe N given an expectation value of m").

    So what's the problem with Poisson error bars ? The problem is that those error bars are not representing exactly what we would want them to. A "plus-or-minus-one-sigma" error bar is one which should "cover", on average, 68% of the time the true value of the unknown quantity we have measured: 68% is the fraction of area of a Gaussian distribution contained within -1 and +1 sigma. For Gaussian-distributed random variables a 1-sigma error bar is always a sound idea, but for Poisson-distributed data it is not typically so. What's worse, we do not know what the true variance is, because we only have an estimate of it (N), while the variance is equal to the true value (m). In some cases this makes a huge difference.

    Take a bin where you observe 9 event counts and the true value was 16: the variance is sqrt(16)=4, so you should assign an error bar of +-4 to your data point at 9. But you do not know the true value, so you correctly estimate it as N=9, whose square root is 3. You thus proceed to plot a point at 9 and draw an error bar of +-3. Upon visually comparing your data point (9+-3) with the expectation from a true model, drawn as a continuous histogram and having value 16 in that bin, you are led to believe you are observing a significant negative departure of event counts from the model, since 9+-3 is over two "sigma" away from 16; 9 is instead less than two sigma away from 16+-4. So the practice of plotting +-sqrt(N) error bars deceives the eye of the user.

    Worse still is the fact that the Poisson distribution, for small (m<50 or so) expected counts, is not really symmetric. This causes the +-sqrt(N) bar to misrepresent the situation very badly for small N. Let us see this with an example.

    Imagine you observe N=1 event count in a bin, and you want to draw two models on top of that observation: one model predicts m=0.01 events there, the other predicts m=1.99. Now, regardless of whether m=0.01 or m=1.99 is the expectation of the event counts, if you see 1 event you are going to draw a error bar extending from 0 (i.e., 1-sqrt(1)) to 2 (1+sqrt(1)), thus apparently "covering" the expectation value in both cases; but while for m=1.99 the probability to observe 1 event is very high (and thus the error bar around your N=1 data point should indeed cover 1.99), for m=0.01 the probability to observe 1 event is very small: P(1|0.01)=exp(-0.01)*0.01^1/1!=0.01*exp(-0.01)=0.0099. N=1 should definitely not belong to a 1-sigma interval if the expectation is 0.01, since almost all the probability is concentrated at N=0 in that case (P(0|0.01)=0.99)!

    The solution, of course, is to try and draw error bars that correspond more precisely to the 68% coverage they should be taken to mean. But how to do that ? We simply cannot: as I explained above, we observe N, but we do not know m, so we do not know the variance. Rather, we should realize that the problem is ill-posed. If we observe N, that measurement has NO uncertainty: that is what we saw, with 100% probability. Instead, we should apply a paradigm shift, and insist that the uncertainty should be drawn around the model curve we want to compare our data points to, and not around the data points!

    If our model predicts m=16, should we then draw a uncertainty bar, or some kind of shading, around that histogram value, extending from 16-sqrt(16) to 16+sqrt(16), i.e. from 12 to 20 ? That would be almost okay, were it not for the asymmetric nature of the Poisson. Instead, we need to work out some prescription to count the probability of different event counts for any given m (where m, the expectation value of the event counts, is not an integer!), finding an interval around m which contains 68% of it.

    Sound prescriptions do exist. One is the "central interval": we start from the value of N which is the nearest integer to m smaller than m, and proceed to move right and left summing the probability of N+1 and N-1 given m, taking in turn the largest of these. We continue to sum until we exceed 68%: this gives us a continuous range of integer values which includes m and "covers" precisely as it should, given the Poisson nature of the distribution.

    Another prescription is that of finding the "smallest interval" which contains 68% or more of the total area of the Poisson distribution for a given m. But I do not need to go into that kind of detail. Suffices here to point the interested reader to a preprint recently produced by R.Aggarwal and A.Caldwell, titled "Error Bars for Distributions of Numbers of Events". The paper also deals with the more complicated issue of how to include in the display of model uncertainty the systematics on the model prediction, finding a Bayesian solution to the problem which I consider overkill for the problem at hand. I would be very happy, however, if particle physics experiments turned away from the sqrt(N) error bars and adopted the method of plotting box uncertainties with different colours, as advocated in the cited paper. You would get something like what is shown in the figure on the right.

    Note how the data points can now be immediately classified here, and more soundly, according to how much they depart to the model, which is now not anymore a line, but a band giving the extra dimensionality of the problem (the model's probability density function, as colour-coded by green for 68% coverage, yellow for 95% coverage, and red for 99% coverage). I would be willing to get away without the red shading -68% and 95% coverages would suffice, and are more in line with current styles of showing expected values adopted in many recent search results (the so-called "Brazil-bands").

    Despite the soundness of the approach advocated in the paper, though, I am willing to bet that it will be very hard to impossible to convince the HEP community to stop plotting Poisson error bars as sqrt(N) and start using central intervals around model expectations. An attempt to do so was made in BaBar, but resulted in a failure -everybody continued to use the standard sqrt(N) prescription. There is a definite amount of serendipity in the behaviour of the average HEP experimentalist, I gather!

    Comments

    The OPERA results for superluminal neutrinos as presented in their graphs would certainly look different if they followed the approach you advocate above.

    rholley
    Mention the Poisson distribution and I think of what I read when I was studying it years ago, decades before the appearance of the Wikipedia article from which this extract comes:
    The distribution was first introduced by Siméon Denis Poisson (1781–1840) and published, together with his probability theory, in 1837  ...

    The first practical application of this distribution was done by Ladislaus Bortkiewicz in 1898 when he was given the task to investigate the number of soldiers of the Prussian army killed accidentally by horse kick; this experiment introduced Poisson distribution to the field of reliability engineering.
    Over sixty years to find an application!

    P.S. I have just noticed how ohwilleke has mentioned
    a Poisson distribution or other discrete probability distribution that does well with low numbers
    One of Bortkiewicz’s publications was Das Gesetz der kleinen Zahlen (The Law of Small Numbers), of special significance to Roulette players.
    Robert H. Olley / Quondam Physics Department / University of Reading / England
    Dear Tommaso, all your warnings are valid and I have experienced most of them in practice. But it's still true that they're only important if the statistics is poor and so is the confidence level.

    When you have enough data, "0" in a bin doesn't occur; the shape of the Poisson distribution reduces to the Gaussian; sqrt(N) and sqrt(N-sqrt(N)) become indistinguishable as error bars, and so on.

    There are situations in which the non-Gaussianity and other things are important but they're pretty rare. That's why e.g. for the Higgs Brazil graphs, Gibbs-like Gaussian treatment of all the error bars is just fine and I actually prefer it over some excessively complicated formulae that elevate the risk of errors.

    Good point. In almost every situation where the distinction matters, the data don't have the statistical power to make a strong prediction anyway.

    dorigo
    Hi Lubos,

    the problem is that no matter how much data you have, in HEP you will always end up looking in the tails, where you have few events! That is what happens with new particle searches: we exclude easily resonances in a mass spectrum where there is a lot of data, and the new particle would produce a significant and visible contribution; but we are usually turned on by the small fluke occurring right on the tail, where statistical inference is the hardest to make, both because of the issues I have discussed above and because the tails are the less well known part of the underlying background model.

    So that is why the arguments in the paper I discussed are indeed relevant in HEP.

    Cheers,
    T.
    FWIW, I am far more unhappy when I see physics papers that use a continuous normal distribution to model a discrete quantity that is ideal for a Poisson distribution or other discrete probability distribution that does well with low numbers (e.g. studies of how many kinds of neutrinos, or strong force colors there are, or how many jets a collision has produced, or how many quarks are in an unidentified hadron, or what the charge of a quantum physics scale entity is), than I am when error bars are off, simply because errors bar are routinely poor estimates of error, that hindsight reveals are almost always too small in published work due to publication bias, anyway. It shows to me an unwillingness of the researcher to think seriously about what is going on in a statistical model in the context of the experiment in question.

    dorigo
    I might agree with what you say ohwilleke, but the point of my post is that the typical judgement of the physicist sitting in the back row at a physics analysis meeting usually manages to get the whole audience in pointless discussions, because error bars do deceive the eye. So a different style of plotting them would indeed ease matters in that department.

    Cheers,
    T.
    rholley
    “Those Deceiving Error Bars!”
     
    Imagine those words as the translation of an Italian madrigal, set to a hitherto undiscovered work of Francesco Petrarca.
     
     
     
    Robert H. Olley / Quondam Physics Department / University of Reading / England
    logicman
    It is just as if we came with a built-in "goodness-of-fit" co-processor in the back of our mind, and that was connected with our mouth without passing through those other parts of our brain handling the "think first" commandment.

    A brilliant use of the English language!
    dorigo
    Patrick, thank you so much for this. I love English and I do not always treat it the way I should, so I am very sensitive to similar words of appreciation!

    Cheers and merry xmas
    T.
    rholley
    ’Tis poetry, that’s what it is!  (Or, to be more precise, Poetic Diction.)

    But more that that, I think that such a system may indeed develop in the brains of those much accustomed to gazing on graphs – especially those presented at conferences using PowerPoint.

    Meanwhile, for yourself, Patrick, just now on Meridian News they were recounting what happened in Lewes 175 years ago today:
    On 27 December 1836, an avalanche occurred in Lewes, the worst ever recorded in Britain. A large build-up of snow on the nearby cliff slipped down onto a row of cottages called Boulters Row (now part of South Street). About fifteen people were buried, and eight of these died. A pub in South Street is named The Snowdrop in memory of the event.
    From the map, it is not the sort of place one associates with avalanches.


    Robert H. Olley / Quondam Physics Department / University of Reading / England
    As an old timer, I cannot approve this proposal, mainly because it inherently assumes that a model is correct and should take precedence over data.

    Brazil plots are the ultimate magicians trick, imo, introducing unprecedented hidden levels of modeling uncertainty . At least simple square root error bars are not obfuscating, they are what they are and are telling is what we might find if we repeated the experiment and looked at the data the prescribed times from the statistics, irrespective of underlying models, imo. OK, when the numbers are small one can qualify the statistics, but introducing models in the error bars is like wagging the dog by the tail.

    If you look at the plot in your blog http://www.science20.com/quantum_diaries_survivor/cms_bosons , it is evident that there is no need for a model to gauge the importance of the signal from the data.

    dorigo
    Hi Anna,

    thanks for your opinion, which I understand but in part disagree with. The purpose of those coloured shading, as described in the first part of the preprint I linked, is only to feed the user with a more visual description of the possible range of statistical uncertainty of the model prediction, if the model is correct. When we compare data to a model we know the model could be wrong, and we cannot certainly do a goodness-of-fit by eye which uses that uncertainty in our assessment. We usually just stop at asking ourselves whether the data and the model appear compatible as they are drawn. These shadings can address the issue, making it a bit easier for us.
    Also, I insist that putting a Gaussian sqrt(N) bar on an observation is very unprincipled - we observed that datum, so there is no uncertainty.

    Cheers,
    T.
    As an old timer, I cannot approve this proposal, mainly because it inherently assumes that a model is correct and should take precedence over data.

    Brazil plots are the ultimate magicians trick, imo, introducing unprecedented hidden levels of modeling uncertainty . At least simple square root error bars are not obfuscating, they are what they are and are telling is what we might find if we repeated the experiment and looked at the data the prescribed times from the statistics, irrespective of underlying models, imo. OK, when the numbers are small one can qualify the statistics, but introducing models in the error bars is like wagging the dog by the tail.

    If you look at the plot in your blog http://www.science20.com/quantum_diaries_survivor/cms_bosons , it is evident that there is no need for a model to gauge the importance of the signal from the data.

    What do you think of the 'Pearson chisq' motivated error bars discussed for CDF here:

    http://www-cdf.fnal.gov/physics/statistics/notes/pois_eb.txt

    'upper error = 0.5 + sqrt(n+0.25), lower error = -0.5 + sqrt(n+0.25)' ?

    It's somewhat of a hybrid between the default square root and a more thorough interval evaluation: plot the interval of true values mu for which the observed number would be within sqrt(mu) of the true value.

    There is a long discussion of different possible strategies and their 'coverage' here : http://www-cdf.fnal.gov/publications/cdf6438_coverage.pdf

    Of course if one has a theoretical curve then the task of evaluating whether the observations agree with it can well expressed as fluctuations around the theory. But the frequentist's first task is to produce meaningful error bars that are right the right fraction of the time...

    As to "... forgetting to take into account bins where the data had fluctuated down to zero entries ..."
    would that indicate that you should ignore
    "... three ... ATLAS ... ZZ candidates ... at 126 GeV [that] is the source of the significance that ATLAS quotes at that mass..."
    in light of
    the fact that Figure 4(a) of ATLAS-CONF-2011-162.pdf
    shows those 3 events as all being in a single bin (120-125 GeV)
    of their ZZ-4l histogram for 4.8/fb
    with the two adjacent bins (115-120) and 125-130) being empty
    and
    the 3 event total being consistent with background
    for the region 115-130 GeV ?

    Tony

    Actually I was looking at Figure 4(b) with respect to my previous comment.

    Tony