Banner
    The Jeffreys-Lindley Paradox
    By Tommaso Dorigo | February 22nd 2012 09:02 AM | 25 comments | Print | E-mail | Track Comments
    About Tommaso

    I am an experimental particle physicist working with the CMS experiment at CERN. In my spare time I play chess, abuse the piano, and aim my dobson...

    View Tommaso's Profile
    About a month ago I held a three-hour course on "Statistics for Data Analysis in High-Energy Physics" in the nice setting of Engelberg, a mountain location just south of Zurich. Putting together the 130 slides of that seminar was a lot of work and not little fun; in the process I was able to collect some "simple" explanatory cases of the application of statistical methods and related issues. A couple of examples of this output is given in a post on the fractional charge of quarks and in an article on the weighted average of correlated results. But there is more material to be dug out.

    Among the many things I had marked down for subsequent discussion in this blog is a very interesting statistical "paradox", which highlights the opposite conclusions that a researcher might reach if she were to adopt either of the two "schools of thought" of theoretical Statistics. Examples of this kind are important because we use statistical procedires for most of our scientific results, and as scientists we are accustomed to think at the data as giving unequivocal answers. Statistics, however, messes things up, since the experimenter's choices have an impact on the final result.

    The paradox in question is called "Jeffreys-Lindley" and arises in cases when one tests a hypothesis with a high-statistics dataset. Rather than discuss the generalities, I think it is most proficuous if I go straight to a practical example, leaving the discussion to the end.

    Imagine you analyze data from one of the LEP experiments: these were four detectors (L3, ALEPH, DELPHI, OPAL) that looked at electron-positron collisions at a center-of-mass energy of 91 GeV. Such collisions are very clean -the projectiles are structureless, and their interaction is rare enough that only one collision every few second is produced. Further, the final state has zero total electric charge, and the "hermeticity" of detectors for e+e- collisions allow the detection of a large fraction -close to 100%- of the charged particles that get produced. So the researcher might decide to study with high precision the possible charge bias of her detector by just counting tracks with positive and negative curvature. The track curvature is due to the strong axial magnetic field existing at the center of these detectors (except L3, where the field was absent).

    After counting many tracks, a result significantly different from zero would clearly indicate that the detector or the software have a different efficiency to reconstruct tracks curving in one or the opposite direction under the action of the axial magnetic field.

    Let us say that she collects n_tot=1,000,000 tracks, and gets n_pos=498,800 positive and n_neg=501,200 negative tracks. The hypothesis under test is that the fraction of positive tracks is R=0.5 (i.e., no charge bias), and let us add that she decides to "reject" the hypothesis if the result has a probability of less than 5%: this is referred to as a "size 0.05" test.

    A Bayesian researcher will need a prior probability density function (PDF) to make a statistical inference: a function describing the pre-experiment degree of belief on the value of R. From a scientific standpoint, adding such a "subjective" input is questionable, and indeed the thread of arguments is endless; what can be agreed upon is that in science a prior PDF which contains as little information as possible is mostly agreed to be the lesser evil, if one is doing things in a Bayesian way.

    A "know-nothing" approach might then be to choose a prior PDF by assigning equal weights to the possibility that R=0.5 and to the possibility that R is different from 0.5. Then the calculation goes as follows: the probability to observe a number of positive tracks as small as the one observed can be written, with x=n_pos/n_tot, as N(x,σ), with σ^2=x*(1-x)/n_tot (we are in a regime where the Gaussian approximation holds, and the variance written above is indeed the Gaussian approximation to the variance of a Binomial ratio). N(x,σ) is just a Gaussian distribution of mean x and variance σ^2.

    If we have a prior, and the data, we can use Bayes theorem to obtain the posterior PDF: in formulas,



    (If the above does not make sense to you, you might either want to check Bayes' theorem elsewhere, or believe the calculation as is and try to make sense of the rest of this post -the math is inessential).

    From the above value, higher than the size α=0.05 and actually very close to 1, a Bayesian researcher concludes that there is no evidence against R=0.5; the obtained data strongly support the null hypothesis.

    Frequentists, on the other hand, will not need a prior, and they will just ask themselves how often a result as "extreme" as the one observed arises by chance, if the underlying distribution is indeed N(x,σ^2), with x=0.5 and σ^2=x*(1-x)/n_tot as before. One then has:



    (we are multiplying by two the probability in the second row, since if H0 holds we are just as surprised to observe an excess of positive tracks as a deficit!).

    From the above expression, the Frequentist researcher concludes that the tracker is indeed biased, and rejects the null hypothesis H0, since there is a less-than-2% probability (P'<α) that a result as the one observed could arise by chance! A Frequentist thus draws, strongly, the opposite conclusion than a Bayesian from the same set of data. How to solve the riddle ?

    There is no solution: the two statistical methods answer different questions, so if they agreed on the answer it would be just by sheer chance. The strong weight given by Bayesian on the hypothesis of a unbiased tracker is questionable, but not unreasonable: the detector was built with that goal. Notice, however, that it is only the high-statistics power of the data that allows the contradiction with a Frequentist result to emerge. One might decide to play with the prior PDF and make it such that the Bayesian and Frequentist answers coincide; but what would we learn from that ?

    Comments

    Well... there are some important points swept under the rug here. The wikipedia article on the paradox is actually pretty clear. [Disclosure: I'm a stats prof who works with Bayesian methods, so I may be a tad biased.] First, it's important to realize two things: (1) that frequentists ALSO use priors, but they just don't talk about the, them (specifically, a flat prior in whatever parameterization they happen to be working with, which is often a terrible assumption for binomial probabilities); and (2) the "equal weight" in the problem is nothing of the kind: the Bayesian is putting half the weight on 1/2 exactly, and the other half EVENLY DISTRIBUTED OVER ALL OTHER VALUES. This places a ludicrously heavy prior weight on one specific value, 1/2, a point mass, in fact.

    In any case, this is why statisticians came up with maximum entropy priors, and other methods that are relatively impervious to parameterization. A Bayesian analysis should give you the same answers whether you are estimating a binomial probability or, for example, it's log-odds, whereas a frequentist one may not. So, this "paradox" is less a paradox than an observation that, if you stack the odds in favor of one hypothesis over the other, you'll be unlikely to abandon it when data come in.

    dorigo
    Dear Anon,

    I am not used to sweeping things under the rug. I make simplifications sometimes, because this is a popularization site and I try to not be pedantic.

    Your points are valid, but I fail to see how they "solve" the paradox. The paradox is a good example showing how the methodology one uses affects the conclusions one draws.

    About the frequentists using flat priors: this is unknown to high-energy physics. Could you give me some reference ? In HEP, frequentist measurements sometimes include hybrid methods to estimate the effect of nuisances. We also use Bayesian techniques, often using a flat prior in signal strength, which is quite annoying since the choice of the prior is nothing but a choice of a preferred metric, and in particle physics we would like our results to be insensitive of our preconceptions.

    Best,
    T.
    Hi Tommaso,

    My choice of phrasing was unfortunate. "Swept under the rug" would have been more accurately put as "a key element of this analysis isn't suitably highlighted here." And that element is that the choice of prior in the example is directly driving the results. So, it is less a question of "The paradox is a good example showing how the methodology one uses affects the conclusions one draws" than of "The paradox is a good example showing how the PRIOR one uses affects the conclusions one draws." In this particular example, the paradox comes about because half the entire prior is concentrated at the value 1/2. This seems uniform or flat, but isn't, since no other value gets that much weight. One could take this to absurd lengths and even have an IMPROPER prior, like 1/2 on the integer zero, and 1/2 spread out across all other integers (suppose the null was that a scale added n pounds to your weight, and you've concentrated the mass on zero).

    When I said frequentists "use" flat priors, it was like saying that everyone wears glasses, just some people's have zero refractive index: a metaphor to get across that the frequentist analysis is a *special case* of the Bayesian, one where priors are ignored entirely.

    In any case, I do agree that there are paradoxical elements here, but one has to choose the prior carefully to get such diametric differences in conclusion. Jeffreys is (was) a whole lot sharper than I, and I don't mean to contradict his interpretation (although he himself, I believe, didn't append the "paradox" term to it).

    I wrote a fairly long post to Tomasso in reply about an hour ago. Is it gone, or there is a delay? I didn't save it, and am loath to type it all out again.

    Hi, I'm a physics student in a Italian University and I attended a "high energy statistical course" in the last semester, I'd like to know if is it possible to have the presentation you used for your lecture? I think it would be interesting to see some good examples from someone who works in high energy field..

    Regards
    Alessandro

    dorigo
    Hi Alessandro,
    the links are in a post of three weeks ago. Scroll down...
    Cheers,
    T.
    As pointed out by the previous commenter, this is a classic case of "rubbish in = rubbish out". Beyond that, I'm no sure what else this paradox shows.

    The final qn: "what would we learn from that" has an important answer. We learn what is required of the prior to make this happen, which, as pointed out above, is something quite radical and not, on the face of it, very sensible. So I really don't understand why you disregard this sort of question as meaningless - surely it's the most important lesson of the paradox.

    Ah, I see my prior post did show up.

    It's a bit misleading to portray the Lindley paradox as an argument between Bayesians vs Frequentists;

    * Use of Bayes Factors (or posterior probabilities of the null) is not the only Bayesian approach to testing. Other Bayesian tests will give almost exact agreement with use of p-values, without fiddling with the prior.

    * Bayesian testing procedures have Frequentist properties too. If you want to use the posterior probability of the null as a test statistic in a Frequentist test, go right ahead, you'll get great agreement between those Bayesian and Frequentist methods.

    dorigo
    Thanks for your points Anon, food for thought.
    Cheers,
    T.
    This discussion has led to another nice discussion at Andrew Gelman's blog
    on statistics. Since he literally wrote the book on some modern Bayesian modelling it is in some depth, but the result is similar to the point made just above that it is a bit off to distinguish this as a Bayesian vs Frequentist split I think.

    > The hypothesis under test is that the fraction of positive tracks is R=0.5

    And THAT is really where you have gone wrong, and there is indeed no solution to your riddle without moving back a step.

    Can I change the example slightly? Assume the sample was CONSISTENT with R=0.5 under a classical hypothesis test. But then someone comes along with another analysis method - not, not Bayes, that's too controversial - let's call this method #2' "Physical reasoning". I'm not a physicist, so the details of what I say may be nonsensical but I hope the spirit is right- but I'm imagining one might say the following. "The equipment, and the environment, encompasses many charged particles exerting electromagnetic force. By and large there's always a -ve very near and symmetrically surrounding a +ve, but this is not 100% always so. And the slightest failure of this inevitably means that the environment has a different effect on +ve vs -ve in any given give spot. Things are expected to balance on the whole, so the difference might be unbelievably miniscule - maybe 1/(10^10^10^10^10^10^10^10)- but it is physically implausible that things balances out _exactly_ (i.e. truly, absolutely, at every single decimal place zero.) Any electronic device (e.g. a lab worker using an ipod 1 mile away) breaks perfect symmetry" Therefore, "physical reason" says, you best go with "R != 0.5". "Physical reason" goes further to point out sample is irrelevant since the imbalances that are almost certainly present lead to effects so inconceivably tiny that no practical sample size could rebut them, but it is almost inconceivable that the don't exist to some miniscule degree. So: we conclude, it's near inconceivable for anything other than R != 0.5.

    So now there's a conflict between hypothesis testing and this new method, "physical reason". Same riddle! Less emotion! And here it's easier to resolve, since it's "obvious" that "physical reasoning" isn't really answering the question that we wanted to ask, even though it _is_ addressing the question we did ask. (R = 0.5 or not? It's almost certainly not, whatever the sample says).

    I intend this example to be convincing that the original question ("hypothesis... R = 0.5?") is unclear, ambiguous, or maybe just flat-out wrongly stated. Step #1 in any resolution to this is to go back and be rather more precise about what one actually wants to know.

    (N.b. my belief is that hypothesis testing is a good method in response to certain possible "clarifications" of the "what do you really want to know" question, but that it's not plausible that anyone would really ask these questions unless their only purpose was to find uses for hypothesis testing. Bayes? That's murkier - but the fact that you can't just turn an automated crank to get an "answer" to an incompletely described question is a small point in its favor).

    dorigo
    Hello bxg,

    thanks for your interesting comment. It is of course true that R=0.5 is an idealization. But it is also true that frequentist statistics allows you to precisely test to which extent that idealization holds in the face of the size of data you have to deal with. For a physicist, at least, that is a very, very important question to ask. If the R=0.5 hypothesis (exactly 0.5) cannot be ruled out, for a physicist it means he won't have to worry about a process under study actually suffering a significant systematic uncertainty in its rate determination from the asymmetric tracker.
    (There are many examples where a CP violation effect, just to make an example, would need to see a difference in two charge-conjugate processes, so you see the question is truly relevant).

    So, to summarize: idealizations are important to test for a physicist. It is because of that, I think, that frequentist paradigm is so radicated in what we do. Otherwise, we would have been quicker than the rest of the pack to jump on to Bayesian statistics thirty years ago, when computer power allowed the needed calculations.

    Cheers,
    T.
    > But it is also true that frequentist statistics allows you to precisely test to which extent that idealization holds in the face of the size of data you have to deal with

    I disagree with the details of your claim ("precisely?") but accept the spirit - frequentist hypothesis testing sort of does what you say. But when would we ever care? It doesn't tell us whether the idealization is true or false (we can be fairly sure it's false). It doesn't tell is whether it's practically true or no (it has no thresholds, no utility function, whereby we can tell it what is important or not - so it surely isn't a useful tool at getting at practical significance). It's like an artificial courtroom: you need to decide on X, never mind why, and the only evidence you are allowed to consider is this data "D" (nor can you use any external knowledge or reasoning, you must just consider this as an abstract maths equation) - now what say you? But what I want to know is where - and especially in physics of all places - is this very constrained question of genuine interest? Whether we think X is true or not - sure! Whether X is "practically" true - sure! But what confidence testing will answer - ???

    dorigo
    I tried to make it explicit in my answer before: we need to know whether we can trust the approximation that R=0.5, to the extent that it does not affect our measurement of other physical quantities, which rely on a similarly sized set of data (if not the same set).

    Idealizations (as you call them) or approximations (as we refer to) are normal in experimental physics. If we found that R=0.5 does not hold up given the size of data we test it with, we would have to assign a systematic uncertainty to whatever we measure that relies on R being exactly 0.5.

    You raise the point of decision theory and utility functions. These are what Bayesian theory gives us that we do not have in frequentist statistics (not a complete decision theory). But we do not do decisions based on our data -or we very seldom do; and when we do, it's definitely outside of the scope of pure science (for instance, we may decide to ask for more funding based on a hint of a new signal in the data, or to build a new detector etc). What we do with the data is usually at getting a result, without using the latter to take a decision. It is the sum of many different experiments and confirmations from other analyses that allows our field to progress, and knowledge to enlarge.

    Cheers,
    T.
    The paradox isn't. It, like any other paradox, is a failure of assumptions. I examined Dorigo's example on my webpage (should be clickable on my name). I'm also a Bayesian statistician, and agree with Anonymous that the Bayesian in the example is nuts.

    dorigo
    Welcome here William. I am not questioning the fact that the Bayesian assumption is less than appropriate. I am just making a didactical example here.

    Cheers,
    T.
    > I tried to make it explicit in my answer before: we need to know whether we can trust the approximation that R=0.5, to the extent that it does not affect our measurement of other physical quantities, which rely on a similarly sized set of data (if not the same set).

    That's nearly nonsensical to me. We know (as an almost moral certainty) that R != 0.5 - I'm suspecting you don't really disagree. So I agree the question is whether we can trust the approximation. But, we are relying on an approximation FOR A PURPOSE, and your method has no knob that lets us dial in what our purpose is. Maybe 0.1 precision is fine; maybe 1^-10 precision is needed, maybe 1/1010^10 - but you won't even show us how to take that into account. It doesn't have to be utilities or decision theory, but if your method doesn't have a control as to amount of difference is or is not important there's clearly a bug. Physicists - who sometimes care about tiny orders of magnitude that any other discipline would judge as just below de minimus - should see this most clearly.

    Ok, yes, I see your restriction "measurements ... which rely on a similarly sized set of data" - where does this show up in your original question? If your thought experiment showed the hypothesis tested has passed, and a later student published a result based on 0.1x as many samples, please tell us where in the publishing or advisory process would one get to stand up and shout: we haven't validated accuracy for _that_ quantity of data. Or is it:
    "accuracy hypothesis passed", please proceed to "go"?

    If you are trying to test whether an idealization is "good enough", and you don't have some sort of knob controlling or informing as to what "good enough" means for for your situation, IMO there is self evidently a problem.

    dorigo
    Dear Bxg, I am afraid you might know little of the review process in high-energy physics, which is the tightest of all hard sciences. Suffices to say that we are the ones blocking our own papers for publication most of the time -we do not need to rely on an external referee to do that for us in most cases.

    To be more specific, a (frequentist) test like the one described above is done anywhere one wants to publish a measurement which may be affected by a charge asymmetry. We do not rely on past history, but remeasure all systematics every time. The test discussed above would indicate that we need to assign a systematic uncertainty to the measurement due to the assumption of a charge symmetry, and possibly decide to correct for the bias.

    Best,
    T.
    I wonder if the disagreement between the two methods would be reduced if one were to repeat the analysis testing the hypothesis Pr[R < 0.5 | x,n] > Pr[R > 0.5 | x,n] with a uniform prior on R. (Equivalently, check the hypothesis that Pr[R < 0.5] - Pr[R > 0.5] > 0.) In this case, the Bayesian analysis gets something like 98.3% chance that the detector is biased low, if one uses a flat prior on R.

    I wonder if the disagreement between the two methods would be reduced if one were to repeat the analysis testing the hypothesis Pr[R < 0.5 | x,n] > Pr[R > 0.5 | x,n] with a uniform prior on R. (Equivalently, check the hypothesis that Pr[R < 0.5] - Pr[R > 0.5] > 0.) In this case, the Bayesian analysis gets something like 98.3% chance that the detector is biased low, if one uses a flat prior on R.

    dorigo
    Interesting. Indeed, the result depends strongly on the assumption one makes, despite the high discriminating power of the large dataset.
    Cheers,
    T.
    Reading this, I was intrigued by the "high discriminating power of the dataset" phrase that you used. I wondered how "discriminating" the data is, in terms of point estimates, so I compared PDF(R = 0.5 | counts) and PDF(R = MLE | counts). MLE in this case is 0.4988, and PDF is the Beta(n_pos + 1, n_neg + 1) function.

    PDF(R=0.5) ~= 45
    PDF(R=MLE) ~= 800

    So it's not really that discriminatory for comparing those two point estimates. If we were just trying to choose between 0.5 and the MLE point estimates, the normalized posterior for R=0.5 would be around 0.053, not even below the traditional cutoff of 0.05.

    Given the rather fast falloff around the peak of the Beta, it's not that surprising that when we compare the posterior of H_0 and H_1 in the Bayesian formulation above, the former seems much more likely. If we make a back of the envelope approximation, then the posterior of H_1 is going to be proportional to PDF(MLE) * width_of_PDF * 0.5. (note that this is actually biasing it a bit in favor of H_1.)

    Call the width around 0.005 (the PDF falls to below 0.01 if we're 0.0025 from the MLE, compared to 800 at the peak). 800 * 0.5 * 0.005 = 2. Compare to the (unnormalized) H_0 posterior, 0.5 * 45, and I think it becomes clear what's happening.

    - The likelihood at 0.5 is not that much smaller than the MLE's likelihood (PDF(0.5) ~= 1/8 PDF(MLE))
    - The likelihood for R is very peaked around a small region around the MLE.
    - The combination of a point prior for H_0 versus a diffuse prior for H_1 means that the posterior for H_1 takes the width of the "effectively nonzero" portion of the PDF into account in its calculation, but the posterior for H_0 does not.

    So it's a combination of the data not actually being that discriminatory for the two point estimates (MLE vs. 0.5), and the posterior for R being very narrow around the MLE, while H_1 is actually being evaluates for R within all of [0, 1].

    dorigo
    Very interesting comment. I will need to ponder over it... Thanks.
    I wish people who leave constructive comments also left their name, though!
    Cheers,
    T.
    Sorry, I overlooked changing the posting name. Chalk it up to laziness, not trying to remain hidden.

    I spent some time over the last few days revising the Wikipedia article to (hopefully) make it clearer what is going on in the two methods. (If I've made any errors, I would truly appreciate someone letting me know.) Another thing I'd to add to the article is a plot of the posterior PDF for R that results from the peak-plus-flat prior. I think it might be instructive. Note that this would be prior time the likelihood for each value of R, and not the same shape as the prior.