This week's edition of the CMS Times features a short piece by A.Rao where some points are made on the issue of correct statistical analysis of high-energy physics data.

If you read the piece, you will find a few quotes from myself. Since January 1st I am in fact the chair of the CMS Statistics Committee, which is the body charged with the supervision and the suggestion of good statistics practice in the production of our physics results. It should not come as a big surprise to long-time readers of this blog, since I have manifested several times my interest in statistics, and I also attempted in various ways to bring attention to often overlooked subtleties, which at times catch professionals just as unprepared and unaware as laymen.

The task of leading the Statistics Committee of CMS, which is a 3000-strong experiment which publishes O(100) new physics results per year, will take most of my research time for the next couple of years. But I will only be little more than a coordinator: luckily, I can count on a crew of 10 super-selected members (why, I did select them!). Most of them know more statistics than you and me, and three in particular probably are on par with the most statistics-learned physicists on the planet. I feel (and I am) a dwarf in comparison to them.

Among the first few challenges we will take on, there is the issue of unfolding. Unfolding is a statistical technique whereby one infers the original distribution of a quantity before the smearing effect of one's detector produced the observed distribution. Imagine, for instance, that a physics process produces two jets with a flat distribution of their polar angle: that means that you should see as many jet pairs at 30 degree distance, 60 degree distance, 90 degree distance, and so on. Now, if your detector is not perfect, jet pairs produced at some particular angle might be collected with less-than-100% efficiency. You are then going to observe a dip in the angle distribution. Unfolding techniques may allow you to infer the original distribution.

One should ask: what is the purpose of unfolding the data ? Can't one take the theoretical model one wants to compare to, and pass the resulting jet pairs through a simulated detector response ? The resulting distribution will be directly comparable with the observed one!

The answer is yes, one can, and that is what is usually done; but imagine one day some other theorist comes up with a different model. If the experiment is over, the theorist will have a hard time understanding how to smear his jet pairs, to compare with the published data. Publishing the unfolded angular distribution allows a future theorist to make a much better use of the experimental result.

The problem with unfolding techniques used in High-Energy physics is that they often produce meaningless results if one is not very, very careful. In particular, there is a nasty method called "bin-by-bin correction" which ignores the correlations between the different measurements, and may at times produce unfolded spectra with bogus error bars. Other methods using iterative procedure may also incur in a underestimate of the statistical uncertainties. But these are just examples: the technology is vast, and there is no "agreed-upon", recommendable technique that we may enforce in data analysis. I hope this will soon change.

Among the other issues that the Statistics Committee of a large HEP experiment needs to address, there is the one of how to compute upper limits to searched processes. This is another complex topic, on which views differ. Not only is this putting on two sides physicists who marry the Frequentist and the Bayesian schools of thought: there are many subtleties within each of the two classes of methods. Putting order in the matter is a tough job.

The most important task of the Statistics Committee is anyway not to deal with the above "massimi sistemi", but rather to handle day-to-day questions produced by the collaborators, and to check that analyses do not fall in some of the typical traps of bad statistical practice. Physicists are autharchic, because they are smart and so they typically tend to believe they can solve whatever problem by themselves. But this is a misconception, and flies in the face of the fact that there is a huge literature in professional statistics where the same problems are solved in better ways. Bringing order in the autharchy of my colleagues is by far the most grievious task!