Banner
    Need Statistics Lessons For HEP Data Analysis ? Look Here.
    By Tommaso Dorigo | January 31st 2012 09:08 AM | 11 comments | Print | E-mail | Track Comments
    About Tommaso

    I am an experimental particle physicist working with the CMS experiment at CERN. In my spare time I play chess, abuse the piano, and aim my dobson...

    View Tommaso's Profile
    Last week I spent a few interesting days in the pleasant winter setting of Engelberg, a mountain location in the Swiss alps. There I attended the CHIPP 2012 winter school, an event organized by Vincenzo Chiochia  from University of Zurich and Gabriella Pazstor from University of Geneva. They invited me to give a three-hour mini-course in Statistics for data analysis in High-Energy Physics, something which was a new experience for me. It took me the best part of the last couple of months to get prepared, but I was glad I did. In the end, the material I put together could have been used profitfully for five or six hours of lecture, but by skipping some of the topics I could get to the end without using a silly speed.

    Now, the purpose of this article is multifold. First of all, I wish to publicize the slides (which you can freely access in three separate files: part 1, part 2, part 3), with the hope that some of you will actually read them (they should be beneficial to anybody with a PhD in particle physics who does not consider herself an expert in statistics, as well as to undergraduates and graduate students who have an interest in not remaining ignorant in these matters forever), possibly reporting any mistake or inconsistency. This is in the interest of improving the material for the next time I will be able to use it.

    The second reason of mentioning my Statistics course here is shameless self-promotion. I had fun giving the course, and I would be happy to be given another chance - to improve one has to practice! So if you are a graduate student or a young postdoc and you think it would be nice if your University or institute would feature such a set of lectures, consider suggesting your advisor to organize them! I think there is too little awareness on the subtleties of the practice of statistics in data analysis for particle physics, and I would be happy to contribute to raising the education around.

    Third, I want to make an example of the pitfalls you can avoid if you know statistics, by pasting here the result of a study of two test statistics used to fit a data histogram: Pearson's chisquare and Neyman's chisquare. Neyman's chisquare is the sum of the squared differences between the measured values and the fit function, divided by the individual measured variances, while Pearson's chisquare substitutes the denominator with the best-fit variance. It is important to note that both test statistics therefore estimate the variance of the true value to "normalize" the observed squared deviations between observations and model.

    The fact that the variances at the denominator are estimates instead than the true values causes the minimization of the resulting chisquares to produce different results. Let us take a simple case of the fit to a constant of a 100-bin histogram, where each bin contains Poisson-distributed event counts. While the maximum likelihood estimate of the value of the constant is just the sum of the bin entries divided by the number of bins (the arithmetic average!), the Pearsons and Neyman chisquares, if minimized, produce biased estimates of the constant.

    The figure on the right show the amount by which the two chisquare fits (Pearsons, in blue, and Neyman, in red) under- or overestimate the true value of the constant (in black - which coincides with the maximum likelihood estimate). On the horizontal axis is the real value of the Poisson mean from which 100 bins were populated; on the vertical axis is the fractional bias on the fit result.

    To understand the reason why the two test statistics are biased (but note, they are asymptotically unbiased in the large-sample limit) we need to put ourselves in the pants of the minimization routine.

    What we are doing when we fit a constant through a set of 100 bin contents is to extract the common, unknown, true value μ from which the entries were generated, by combining the 100 measurements, each of which is a Poisson estimate of this true value. Each should have the same weight in the combination, because each is drawn from a Poisson of mean μ, whose true variance is also equal to μ (a property of the Poisson distribution!). But, having no μ to start with, we must use estimates of the variance as a (inverse) weight. So Neyman's chisquare gives the different observations different weights 1/N, where N is the bin content. Since negative fluctuations (N < μ) have larger weights, the result is downward biased!

    What Pearsons' chisquare does is different: it uses a common weight for all measurements, but this is of course also an estimate of the true variance V = μ : the denominator of Pearsons' chisquare  is the fit result for the average, call it μ*. Since we minimize that chisquare to find μ*, larger denominators get preferred, and we get a positive bias: μ* > μ!

    The bottomline is the following: when possible, especially when statistics is poor, use a likelihood to fit your data points !

    Comments

    I will read your slides with great interest!

    Tiny correction to the above article:
    "Neyman, in black" should be "Neyman, in red"

    Hi Tommaso,
    I'm really enjoying your slides but they have reminded me of a seemingly simple statistical point that I seen stated in a number of places (including the review of Particle Physics if I'm not mistaken) that I have never fully understood. It would be great if you could shed some light on it.

    It is regarding chi-squared distributions and should apply to many scenarios, however I'm thinking of it in terms of the chi-squared values of fitted particle tracks in a typical HEP charged track reconstruction scheme.

    Lets say you have a sample of 1000 fitted tracks, each has a chi-squared value and number of degrees of freedom, from this you can calculate the chi-squared probability.

    Basically, the point is that for a correct hypothesis (the the hits in a fit were created by a single charged particle moving in a magnetic field) with correctly estimated hit-errors, the chi-squared probability distribution should be flat. I don't understand a priori why it should be flat, although I do see how incorrect errors can cause slopes in the distribution and incorrect hypotheses can cause peaks at low probabilities.

    Any clues or even a point in the right direction is appreciated!



    dorigo
    Hi Se',

    for any normalized distribution, you can define a probability that a observed value x belongs to it as the integral from that x to infinity. This is just one way of doing it, of course.
    Once you do that, if you sample values from the distribution, i.e. using their relative frequency according to that distribution as generating PDF, your integrals from x to infinity will by construction distribute as a uniform[0,1].

    Cheers,
    T.
    Hi Tommaso,
    I'm really enjoying your slides but they have reminded me of a seemingly simple statistical point that I seen stated in a number of places (including the review of Particle Physics if I'm not mistaken) that I have never fully understood. It would be great if you could shed some light on it.

    It is regarding chi-squared distributions and should apply to many scenarios, however I'm thinking of it in terms of the chi-squared values of fitted particle tracks in a typical HEP charged track reconstruction scheme.

    Lets say you have a sample of 1000 fitted tracks, each has a chi-squared value and number of degrees of freedom, from this you can calculate the chi-squared probability.

    Basically, the point is that for a correct hypothesis (the the hits in a fit were created by a single charged particle moving in a magnetic field) with correctly estimated hit-errors, the chi-squared probability distribution should be flat. I don't understand a priori why it should be flat, although I do see how incorrect errors can cause slopes in the distribution and incorrect hypotheses can cause peaks at low probabilities.

    Any clues or even a point in the right direction is appreciated!



    Can you annotate the references at the end of the slides? Which one is best for reference, or for self teaching, which one covers better what and so on...

    dorigo
    Hi Rdm,

    the references are there only because the relative texts are cited in the slides. What you propose is a significant amount of work, which I may take on if I re-give the course... Not now unfortunately!
    A clue however is to look where the reference is cited, in the slides.

    Best,
    T.
    If you are going to write about the latest Higgs search papers, please could you address why CMS has chosen to reduce the LEE by saying you are looking between 110-145 GeV only. What's special about 145?

    dorigo
    Hi Brett,
    I think 145 GeV is the upper limit of the previous search. In other words, CMS could claim they consider the restricted 115-145 GeV range for an observation because they had excluded the rest of the spectrum earlier.

    Cheers,
    T.
    Hi Tommaso,

    indeed, briliant slides with plenty of interesting and uncommon examples. Since you've asked about comments to slides themselves:), I have a couple of minor comments:
    - slide numbers are missing, whereas you reffer to a particular slide at least once;
    - slide are very good for reading offline, but might look overloaded with text during a presentation;
    - in general it took me ~ 3-4 hours to go through most of material thoroughly and to digest some of your ideas. It might be too much to flush it in 3 hours.

    Cheers,
    Misha.

    Hi Tommaso,

    I've discovered F-test from your slides and it looks like a very useful tool. Still, I have a couple of questions to make sure that my understanding is correct:

    - I had to spend some time to clarify what do nu1 and nu2 mean in terms of n, p1 and p2 (bottom right on slide 16, part 3). Also what do p1 and p2 mean. Definitions on the slide would help.
    - I'm also puzzled by numbers in the example, that you provide on this issue(slides 17-18, part 3). Calculating F myself (using chi2 numbers read from the plots on slide 17), I got :
    ((10.775-4.164)/(2.-1.)) / (4.164/(10.-2.)) = 12.70
    ((4.164-2.637)/(3.-2.)) / (2.637/(10.-3.)) = 4.053
    These numbers are in disagreement with pltos on slide 18, as well as with the p value for F distribution (calculated as 1.-TMath::FDistI(F,p2-p1,n-p2) in ROOT5.26). This, probably, mean that I do not fully understand how F is calculated from input set of parameters. Could you clarify where an error appears?

    Best regards,
    Misha

    dorigo
    Hi Misha,

    I do not have a means of checking back those results. But I do not think your calculation is wrong. Maybe the fitter needs to be initialized the same way to produce the same results. See the F-test definition in wikipedia (under the "regression" section).

    Cheers,
    T.