The Puzzling Weighted Average
    By Tommaso Dorigo | November 21st 2011 07:18 AM | 31 comments | Print | E-mail | Track Comments
    About Tommaso

    I am an experimental particle physicist working with the CMS experiment at CERN. In my spare time I play chess, abuse the piano, and aim my dobson...

    View Tommaso's Profile
    For you today here is a test of whether you shoud trust your intuition when confronted with an apparently simple problem. Incidentally, this article is the answer to the "Guess The Plot" riddle I posted a few days ago here.

    Suppose you are given two measurements of the same physical quantity. Make it something easy to visualize, such as the length of a stick. They tell you that when measured with method 1 the result was x1=10 cm, with a estimated uncertainty s1=0.1 cm, and when measured with method 2 the result was x2=11 cm, with estimated uncertainty s2=0.5cm. Here is a question for you today: What is your best guess of the length of the stick ?

    To make things interesting, let me give you a few possible answers from which to pick. We will discuss it later which is the correct one, and this will allow us to shed some light in the whole matter. Unless your knowledge of basic Statistics is above that of an average graduate student in Physics, you will be surprised, I guarantee it.

    1. Anything between 10 and 11cm, I do not care to make the exact calculation but it is obviously trivial and I can't be bothered with such trivial details.
    2. 10 cm, the result with the best stated accuracy
    3. 10.5 cm, the linear average of the two measurements
    4. The best estimate is L*=[10/(0.1^2)+11/(0.5^2)]/[1/(0.1^2)+1/(0.5^2)] cm, which is of course the weighted mean of the two measurements, with weights equal to the inverse variances.
    5. Not possible to answer, unless some more information is provided about the stated accuracies.
    6. Something between 10 and 11 cm, whose exact value depends on details which were not disclosed in the statement of the problem.

    Which is the answer you'd take ? Let me guess that you (most of you, I mean) are oscillating between (4) and (6). That is, you have been taught that the weighted average is the correct way to go, but you suspect that there would be no point in this article if that were the correct answer. So you might actually end up choosing (6), which appears the most foul-proof answer.

    However, the correct answer is (5). You cannot make the correct average of the two measurements unless you know the amount of correlation between the two uncertainties - in layman terms, how much the two measurement errors depend on one another. And here is the "surprise" bit: you are not even guaranteed that the best estimate of the true stick length lays between the two measurements!

    To see this, we should fiddle with the math behind the method of least squares  -or equivalently, with a maximum-likelihood expression of the problem. I am going to explain the bare bones of the method in the section below, but if you loathe math or if you are not really in the mood of following complicated expressions, you are advised to skip it completely and go to the following section, where you will be convinced that the true answer is (5), and get close and personal with even more startling results, by means of a purely geometrical construction. Further down you also find a practical proof that even a fifth grader could understand. In other words, this article is built upside-down: from hard to easy! I hope it is not too confusing...

    The method of least squares

    In the case we are considering, the method of least squares allows you to find the best estimate L* of the true value L (length of the stick) in case you have a set of n measurements y_i of it, which are assumed to distribute according to a n-dimensional Gaussian (if you do not know what a n-dimensional Gaussian is do not worry: we will be making just the two dimensional case here, and in two dimensions you can easily visualize a Gaussian as the hump shown in the graph on the right). The method assumes also that you know the expression of the covariance matrix of the n measurements, V_ij (the covariance matrix contains the squares of the errors in the measurements -the individual variances- as diagonal terms, and terms describing how the measurements depend on one another in the off-diagonal positions; we will get to that later). The expression

    is then "asymptotically distributed like a chisquared". What the jargon between quotes means is that for large n the sum approximates a functional form called chisquare distribution: ignore the specification, it is of no use for us today. What is important is that the maximum value of the expression is obtained at a value L* which is called the least-squared estimator of L: it is an estimate of the true value of the stick length which has several good properties (which again I won't discuss here).

    Let us write down the general expression of V in case of two measurements y1, y2:

    The letter rho in the expression indicates the correlation coefficient, and the sigma are the standard deviations -their squares are called variances. If you find the inverse of V (i.e. the matrix which multiplied by V gives the identity), plug it into the expression of the chisquared above, and proceed to find the point where the derivative of the chisquared over L nullifies, you will obtain the value of L*:

    This is still called a weighted average, although more complicated than the expression you are probably familiar with (the one at point (3) in the list of answers above). The weights are w and 1-w. The (inverse) variance of this weighted average can be written as

    and from this expression we see that once we take a measurement y1 of variance v1, any second measurement of the same quantity will reduce the variance of the average (i.e., our knowledge of the unknown stick length will improve by considering the average of the two measurements) unless the correlation coefficient rho equals the ratio sigma_1/sigma_2.

    But what happens if rho>sigma_1/sigma_2 ?? In that case the weight w gets negative, and the weighted average L* we have written above goes outside of the "psychological" bound [y1,y2]. The reason for this behaviour is that with a large positive correlation the two results y1, y2 are likely to lie on the same side of the true value! On which side ? The one where the measurement with the smallest variance lies!

    How can that be ?

    It seems a paradox: you take a measurement of the stick length, get y1, and then somebody tells you:

    "Here's another method to measure the length. It will give a result, y2, partly correlated with the one you got. Please try it, but beware: if the stick is shorter than y1, this other method will most likely give you y2>y1, i.e. tell you that the stick is longer than y1; while if the stick is longer than y1, you will most of the times get y2<y1."

    Your reaction might be "WTF -your other measurement makes no sense to me!". But it does make sense, and averaging (in the correct way!) y1 and y2 will improve the accuracy of your combined result.

    The problem is that we are not accustomed to deal with measurements with large correlation among them: it goes beyond our intuition. Here is a dialogue between John and Jane which should shed some light in the issue.

    John vs Jane

    John: "I took a measurement, got x1. I now am going to take a second measurement x2 which has a larger variance than the first. Do you mean to say I will more likely get x2>x1 if L<x1, and x2<x1 if L>x1 ??"
    Jane: "That is correct. Your second measurement 'goes along' with the first, because your experimental conditions made the two highly correlated and x1 is more precise."
    John: "But that means my second measurement is utterly useless!"
    Jane: "Wrong. It will in general reduce the combined variance. Except for a very special case when the correlation coefficient is equal to the ratio of standard deviations of the two measurements, the weighted average will converge to the true L. Least-squared estimators are consistent !!"

    John: "I still can't figure out how on earth the average of two numbers can be ouside of their range. It just fights with my common sense."
    Jane: "You need to think in probabilistic terms. Look at this error ellipse (see graph on the left): it is thin and tilted (high correlation, large difference in variances)."

    John: "Okay, so ?"
    Jane: "Please, would you pick a few points at random within the ellipse?"
    John: "Done. Now what ?"

    Jane: "Now please tell me whether they are mostly on the same side (orange rectangles) or on different sides (pink rectangles) of the true value."
    John: "Ah! Sure, all but one are on orange areas! (See graph on the right)".

    : "That's because their correlation makes them likely to 'go along' with each other. And I can actually make it even easier for you. Take a two-dimensional plane, draw axes, draw the bisector: the latter represents the possible values of L. Now draw the error ellipse we have just seen around a  point of the diagonal. Any point, we'll move it later."

    John: "Done. What next ?"
    Jane: "Now enter your measurements x=a, y=b. That corresponds to picking a point P(a,b) in the plane. Suppose you got a>b: you are on the lower right triangle of the plane. To find the best estimate of L, move the ellipse by keeping its center along the diagonal, and try to scale it also, such that you intercept the measurement point P."
    John: "But there's an infinity of ellipses that fulfil that requirement!"
    Jane: "That's correct. But we are only interested in the smallest ellipse! This is the one which has a tangent in P(a,b) which is parallel to the diagonal in the graph. Its center will give us the best estimate of L, given (a,b), the ratio of their variances, and their correlation."
    John: "Oooh! Now I see it! It is bound to be outside of the interval!"
    Jane: "Well, that is not true: it is outside of the interval only because the ellipse you have drawn is thin and its angle with the diagonal is significant. In general, the result depends on how correlated the measurements are (how thin is the ellipse) as well as on how different the variances are (how  big is the angle of its major axis with the diagonal)."

    Try it yourself!

    If the above discussion does not convince you that the best estimate of a unknown quantity you have measured with two correlated methods may fall outside the range of the two results, I have another way to try forcing it into your brain, and it might actually be fun. You will need a square table on which to perform your measurement, a stick to be measured, a ruler, and another long rod.

    First of all, measure the stick as accurately as you can with the ruler (using a ruler with smaller divisions would be a good idea): x. This will be your "true value" of the stick length. Then you proceed to set up two correlated measurements of the stick length in a rather funny way. First, place the rod on the table such that it looks orthogonal to the edge of the table on your side, and at a distance a few inches longer than the stick length from the right side of the table. Please do not make the orthogonality requirement a too strict one -a small tilt will make things more interesting. Fix it with tape in that position. Now measure as accurately as you can (use the more precise ruler if you have it) the distance along your side of the table from the right edge of the table to the point where you fixed the rod, and call it y.

    Using a pencil and the ruler, now mark on the table two set of points at, say, five and fifteen inches from the edge of the table, in order to allow locating the stick in a way it is parallel to the edge of the table.

    Finally you perform your measurements by placing the stick (in yellow in the figure above) such that it passes precisely by the two "five inch" marks and then the "fifteen inch" marks. Each time, you move the stick toward the left until it hits the rod (in red), and then fix it there. Then with the ruler you measure the distances, d1 and d2, between the right edge of the stick and the right edge of the table. Don't make these measurements very precise, for added fun.

    From the way you have performed your measurements, your estimated stick lengths will be x1=y-d1, x2=y-d2. These two measurements are subject to some degree of random error inherent in the process of reading off the length on the ruler, and other methodological issues; but the largest effect is probably that they are correlated because the rod on your left is not exactly orthogonal to the edge of the table. And of course, x1 will be more precise than x2, because it suffers less from the non-orthogonality.

    Now compare x1 and x2 with x, the true value of your stick's length. You should either find x<x1<x2 or x>x1>x2. Which one depends on whether your "orthogonal" rod is tilted toward the left or toward the right. In both cases, however, x1 is closer to the true value x, as we have been discussing: the large correlation between the two measurement errors makes the true value most likely to lie on the side of the higher-accuracy measurement.

    And if that, too, does not convince you... I give up!


    "uncertainty s1=0.1 cm, and when measured with method 2 the result was x2=11 cm, with estimated uncertainty s2=0.5cm."
    Well, got to disappoint you, but I immediately went for answer number 5. Why? Because the error ranges do not even overlap a little, which tells us that at least one of the experimenters totally under estimated the error in her setup. If at least one of the results is obviously wrong (true value not even close to error range, let alone overlap), one should not take any average.
    This is correct in principle Sascha, I totally agree. However the two results do "overlap" as stated, in the sense that their consistency is at the level of less than two standard deviations, something which happens more often than once in twenty cases. Physicists do average results which are at this level of agreement routinely.

    consistency is at the level of less than two standard deviations, something which happens more often than once in twenty cases
    Not in my lab! ;-)

    Seriously, you just gave an example where the true value is actually smaller than 10 cm. So, it is more than two s2 out from 11cm. That person needs to do better error estimation.

    Error estimation does not come down to statistics only - that is just how you guys (according to your again insulting side stab "physicists") are doing it. We "non-physicists" doing proper physics know that although the math is using a bell curve, I can use my ruler with a mm accuracy a bazillion times and still never get once an error of a whole meter! (Except as an outlier because I was stoned that day, but that error will not be distributed according to a Gauss curve with 1mm sigma either!)
    Sascha, here is a direct stab if you don't like side ones: you appear to have such a large ego that it is hard not to step on it one way or another. Only you can see something insulting in the sentence I wrote above.

    About errors, we are not talking about meter errors with mm accuracy measurements, but rather of 2-mm errors with mm accuracy measurements. But I am not interested in arguing about this, which is off-topic, so I'll concede that I should have made it 10+-0.1 and 11+-0.7, when you would have had to read the meat of the post to find something arguable. The math works the same way.

    "less than two standard deviations, something which happens more often than once in twenty cases
    Not in my lab! ;-)"

    But it does at MINOS.

    I was a little more 6 than 4, but immediately understood why 5 was better than 6.

    My day job is reformatting lots of data, I don't know statistics, but I have gotten a feel for the statistics of large sets of human generated data, and that should have led me to 5, but I've spent the week installing a wood floor, and my inner carpenter got involved.
    Never is a long time.
    This is just silly, Tommaso and Sascha. The sentences labeled 5,6 are not even answers to the original question which is "the best guess". If one knew what the correlation is, or if he knew any other additional information, one could make a better calculation, but given the given information and no other data, the only right thing to infer is to assume that the two measurements are independent and 4 is the right answer. Of course that with the extra information, if it is extreme, the best guess may be different or even far from the interval.
    The sentences 5,6 which their "negativistic" message are really spreading fuzzy anti-science attitudes, "nothing can be said", "we don't need to learn any formulae", and so on. If you could say absolutely nothing, you shouldn't have measured it. You should have shut up from the beginning. The two measurements make it extremely likely that the answer won't be too far from the interval 10-11 and the formula 4. is one of the things whose logic people should be led to understand and reproduce when they need it, instead of learning to say "we cannot ever say".
    So much for my 'feel'....................
    Never is a long time.
    Hi Lubos,

    I respectfully disagree. There is no "negativistic" message in explaining that sometimes inferences from the data must be made with caution. I perhaps agree that the question was ill-posed (and should have said "best estimate" instead of "best guess"), but this post is didactic and the question posed at the beginning only has the purpose of raising interest in the reader about a non-trivial issue.

    Dear Tommaso,...

    OK but wouldn't agree that inference, especially from experimental data, must *always* be done with caution? Does it mean that we should always say, after we do experiments, "we can't say anything specific unless we get some extra information"?

    And if you happened to agree that this is not the right reaction, do you really want to claim that measuring the stick twice is more subtle than other experimental approaches to questions, like deducing the properties of the Higgs sector from the CMS data?

    The formula in 4 is pretty, under some circumstances exact, and this is the kind of formulae that all of us should celebrate and readers of our blogs - and others - should try to internalize. Of course that there are (always) possible extra twists that invalidate or correct anything we learn but promoting the idea that we can always find an excuse not to answer is just not a right direction.


    Thor Russell
    Your column made me think so it should count as a success :)I agree that the question is pretty ill-posed and open to interpretation so you can't really give an answer, it just raises an issue that many people probably havn't considered. I assumed the measurements followed a normal distribution otherwise you just have to tell one of the guys they are wrong. 
    On a different note, I was assuming the question of course was a trap so was trying to think of everything that could influence it, so what about a-priori beliefs about the length of the stick? (If you measure a stick with a ruler you are already making an assumption that it is about the same length as the ruler.) Does it follow a rectangular distribution from 0-infinity cm whatever that means, a log-normal distribution, can this make a significant difference? (I don't see how you can sensibly take it into account given all the other uncertainty etc)
    Thor Russell
    Dear Lubos, dear T!

    You both forget that the audience is not just HEP guys who are so far removed from the measurements by intervening transformations and "proxy" data that it would indeed be almost impossible to do anything better than say equation 4.

    In other branches of physics and science in general, the message should be that you use your head/experience/intuition first and then math!

    I have many years of experience with data taking that are closer to TD's table/stick example than what the LHC is doing, see for example
    Golden Rules of Error Calculation
    No Mysterious Symmetry In Ultracold Helium Nanodroplets
    and I assure you that if you get two measurements like the above, at least one is totally messed up, possibly both. Sure, if you really need to decide right away and have no time to investigate a little on who took the data and so on, than the experience with so called crowd wisdom (Wisdom of crowds) suggests to take equation 4, Lubos is correct, but if the issue is important and you have a little time, it is 100% answer 5, i.e. Sascha the Megaego is as usual correct, because you better spend the time on looking into who did what than cranking something like equation 4 and than end up with stuff that does not work and messes up everything down the road.
    Bonny Bonobo alias Brat
    Here is a question for you today: What is your best guess of the length of the stick? To make things interesting, let me give you a few possible answers from which to pick.
    Well I would like to add a seventh option for my guess, to make things less interesting but more likely :-

    7. Something between 9.5 and 12.5 cm, which is a guess that covers 3 standard deviations or at least 99.7% of the measurements for both methods, regardless of there being any more information about the stated accuracies or variances.
    My article about researchers identifying a potential blue green algae cause & L-Serine treatment for Lou Gehrig's ALS, MND, Parkinsons & Alzheimers is at
    Surprisingly insightful. The first time that I agree with Helen on anything.
    The Mad Hatter
    OPERA takes the mean by method 4. of 6 results - namely the time delays from extractions 1&2 from 2009, 2010, 2011 - see Figure 12 in the preprint.

    Your article suggests that if the six measurements are somehow correlated, the correct mean might be outside the range of the 6 results.

    In any case, OPERA or otherwise, how do we go about finding correlations between what are seemingly independent measurements?

    For OPERA, by measuring Figure 12 screen coordinates on the computer,  I extract for the 6 points
    Mean(ns)        Standard error(ns)
    1015.9    30.0   - 2009 extract 1
    1084.8    21.7   - 2009 extract 2
    1035.3    14.2   - 2010 extract 1
    1027.3    12.3   - 2010 extract 2
    1056.7    19.5   - 2011 extract 1
    1053.0    18.3   - 2011 extract 2

    If you plot these on a mean versus standard error, the first point look like an outlier.
    The remaining 5 points sorted by the mean are
    1027.3    12.3   - 2010 extract 2
    1035.3    14.2   - 2010 extract 1
    1053.0    18.3   - 2011 extract 2
    1056.7    19.5   - 2011 extract 1
    1084.8    21.7   - 2009 extract 2

    The 5 points give a least squares fitted line of error = 0.167*mean - 158.37, with R2=0.924.
    I mention this as a curiosity, I don't think it means anything.  But if we had a lot of measurements and the mean and standard error were correlated, could we deduce anything from it?
    if the six measurements are somehow correlated, the correct mean might be outside the range of the 6 results.
    Precisely, like for example if something below the velocity of light is not inside the range. ;-)
    Hi Mad,
    if the mean is correlated with the uncertainty, generally this means that a significant part of the uncertainty is coming from a scale. Id est, one estimates that a certain part of the uncertainty is due to a scale error, such as sigma_x = k*x with k a constant. It is not uncommon in experimental physics to have a scale uncertainty... In this particular measurement I however do not see immediately from where it could arise (but have no time right now to check in the table of systs of the paper).

    Surely it depends on who is doing the measuring. If it's a scientist or engineer then, despite the disagreement, the true value is probably close to 10.5. If it's an economist or a banker then the true value is probably rather less than that and if it's a politician then there is no true value. Hope this helps.

    Hmmm not sure it does Raphe, but thanks for the contribution.


    sorry, but even your question is malformed. there is little value in a one-number guess of the stick's length. the true question to ask is about the confidence interval, which at least includes an estimate of the systematic error.

    Dear Chris,

    I concur -and in fact I often warn my students that an estimate without error is worse than an error without estimate. However, this post is didactical and I do not care about this kind of criticism, since there is a clear lesson to learn in it. Please let us discuss the lesson instead than the prologue of the post.

    Fantastic post about handling error-ridden measurements. I learnt a lot from it and also, being a visual kind of guy, love the graphical construction method. One practical question: how do you find out how correlated measurements are? How do you determine your covariance matrix?

    Sacha, every time I read your comments I smile. Seriously, I wouldn't change you for anything! You are almost always right, and absolutely always argumentative.

    You are almost always right, and absolutely always argumentative.
    Well, how can I argue with that???
    Hi AG,
    the determination of the covariance matrix depends on the detail of the experiment. In some cases it can be determined from the data directly, such as when two measurements use some, but not all, of the same data. In other cases one resorts to toy Monte Carlo generation of pseudo-data.
    Sasha Bonkers is a pain in the ass. Hope this insight helps.

    Sascha Bonkers? Wash your mouth anon, you are insulting the Alpha Mule!

    This was the perfect example of a strawman, a.k.a. an artificial image to attack.

    You're in the news ... .

    Physicist Tomasso Dorigo, who works at CERN, the European Organization for Nuclear Research, and the U.S. Fermilab near Chicago, said in a post on the website Scientific Blogging that the ICARUS paper was "very simple and definitive."

    Hi John,

    thanks - yes, I know: there is at least a dozen newpaper and sites of newscasters that have quoted me directly or indirectly in the last couple of days, and all for having written in a post a month ago that ICARUS had a result in conflict with the Opera one.

    It asked for best guess and I thought 10.1 sounded like a good guess based on the smaller error range but weighted to include the other measurement.

    Excellent post.
    The first time that I got a negative weight by applying the BLUE method I was so convinced of its impossibility that I spent a ridiculous amount of time in checking my (simple) code for bugs...