In general, whenever you are measuring a unknown quantity, be it the mass of the top quark or the correlation between chocolate consumption and the rate of Nobel prizes in your country, you have a choice to make: you can use a Classical (aka "Frequentist") formalism or a Bayesian one. This will define the way you derive your "confidence interval" (or "credible interval" in Bayesian jargon) for the quantity you have measured. And if you choose the Classical setup, as is commonplace in exact sciences, you have to do with p-values - probabilities defined as the success ratio in the limit of a large number of repetitions of your experiment.
I will skip the discussion on the merits and problems of Frequentist and Bayesian practice here, as the focus is the former, with the ubiquitous appearance of p-values to size up the significance of observed effects as compared with a given hypothesis. And I will start by stating that there is nothing wrong in using p-values to report results, as long as we speak with scientists who know what we are talking about. The problem is that the same language, diffused to non-scientists or in general with an audience that is not equipped with the required knowledge, generates all sorts of nocuous side effects.
Or maybe I should say that the above is one problem. There is in fact another problem, maybe a more nagging one. The fact that there are arbitrary thresholds in what is considered "significant" by the scientific community causes a very annoying effect in what is viewed as an important or a non-important result. Null results - ones where the p-value is above 0.05, in most fields of research - are less sexy than "significant observations" (p<0.05) of something not predicted by the reference model. This causes a bias (a proven one) in the results that get published, and also all sorts of incorrect habits by the scientists. They may be tempted to repeat a null result close to threshold more than they do if they find a much larger p-value.
Now the ASA (American Statistics Association) has decided to voice their warnings. In a recently released document they state several important principles and a set of do's and don'ts. I think this is a step in the right direction, but I feel it will take much more than a statement to change matters for good. In any case I am happy to provide here some excerpts of the most important stated principles from the document.
- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- Proper inference requires full reporting and transparency
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Since this blog focuses on particle physics, I am compelled to say that I have discussed the matter of statistical significance (a number connected by a one-to-one map to a corresponding p-value) in excruciating detail in a seminar that I gave in Kolimbari, Athens, Louvain-la-Neuve, and Warsaw. This also resulted in a paper. In a nutshell, in the paper I explain that particle physicists have stuck to the five-sigma significance threshold for discovery claims, and I argue that there are reasons why we should be more flexible. Some more discussion is also offered in this blog post.