Yesterday I worked from scratch at a problem which certainly others have already solved in the past. I have mixed feelings with such situations: on one side I hate to reinvent the wheel, especially if there is an easy way to access a good solution; on the other I love to invent new ones...


Anyway this time I have decided I will ask you for some help, as collectively we may have a better idea of the optimal solution to the specific problem I am trying to address. But before I explain the problem, let me give you some background on the general context.

Searches for new physics at the LHC

We are of course talking of hadron collider physics at the high-energy frontier: analysis of LHC proton-proton collisions collected by the CMS experiment. And we are talking about searches of new physics. 

Now, there is a new physics model I want to investigate, which depends on several parameters. These model parameters influence the phenomenology of the produced signal in LHC collisions, such that the observable kinematics of the events of interest depend on them. But they do so in a non trivial manner: as opposed to an easy (and not atypical) situation when, say, by increasing the hypothetical mass of an unknown new particle I expect all its produced decay bodies to get proportionally larger momenta, here a linear change of one parameter produces non-linear modifications in the kinematics. 

If one wants to conduct a broad search for the observable signatures of the new physics model, one needs to "scan" the model parameter space. In principle one would like to design a single analysis of the data which is sensitive to the unknown process regardless of the model parameters, but in this case it is not possible: the model produces quite different situations depending on the values of those parameters. One then has to try and identify "macro-regions" of the parameter space such that a small set of different data selection strategies can "cover" as many as possible new physics manifestations.

What we are arriving at is the concept of "benchmarks": these are specific choices of the model parameters that identify typical situations for the resulting signatures. Benchmark models are routinely used to study the sensitivity of one experiment to the model, e.g. in comparison with other experiments or past searches; but here we are more concerned with categorizing the model's parameter space such that we design the analysis in an effective way.

Cluster analysis

What is described above is a problem best addressed by what Statisticians call "cluster analysis". There is a huge variety of techniques to group elements of a set such that the subsets contain elements similar to one  another, and choosing the most appropriate method for one's specific application is no trivial matter. So let us look a bit more into the details of my specific task.

I have 52 different computer simulations of the characteristics of the events resulting from the new physics model; they depend on 52 different choices for the model parameters. Of course I could generate more of them, but let us concentrate on these data samples as they are a good starting point.

By running a multi-dimensional hypothesis test (for the time being this is Zech's Energy test, if you care to know - the test statistics is the "potential energy" that a set of positive and negative charges would have, in a configuration where the positive charges belong to set 1 and the negative ones to set 2) on each of the 1326 pairs (51+50+49+...+1) of simulated datasets, I can produce 1326 p-values. 

Technically, each of those p-values tell us how probable it is that two datasets come from the same parent distribution - i.e. how distinguishable are the two corresponding parameter space points. What I want to do is to use those p-values to identify a few among the 52 datasets - say ten - which are the "most typical", in the sense that taken together they describe as large a part of the whole parameter space as possible. 

Mind you, here you should not be misled by the term "representative". There is no democracy at work here: it's not like I need to elect congressmen in the right proportion to the different political opinions of the voters. Here one single representative for all people thinking the same way would suffice, even if they were the large majority of the voters.

If done properly, the choice of those 10 "representative" datasets is such that each of the 47 left out is similar to at least one of the 10 selected ones. And now let us consider how we could turn this wish into a well-defined criterion: that is the question I want to discuss with you.

Different criteria, different answers

If we let the index "i" run from 1 to 10 and the index "j" run from 11 to 52, and agree that samples 1 to 10 are the chosen ones (by a procedure which is still to be determined) and 11 to 52 are the ones left out, we can concentrate on criteria employing the set of p_ij values (p_ij indicating the p-value returned by the two-sample test ran on samples i and j) that define the ten chosen sets; and then discuss their respective merits. For instance:

1) One could search for the selection which makes min_j(max_i(p_ij)) the highest. The pro of this choice is that "no parameter space point is left behind"; the con is that by concentrating on the outlier (the sample with worst matching with any of the ten selected ones) we lose touch with the collective behaviour of the 47 samples left out.

2) One could instead search for the selection which makes median_j(max_i(p_ij)) the highest; or concentrate on, say, the 10% percentile point.

3) Or one could decide to fix a minimum acceptable p-value (say, 5%) and find the selection which maximizes the number of rejected sets which have max_i(p_ij)>0.05.

Other choices are of course possible. Whether one picks criterion 1, 2, 3, or something else of course depends on how important it is to scan the whole parameter space as opposed to how important it is to identify benchmarks with good overall properties. It would be nice if you could suggest other choices for the selection criterion, or comment on the above.

The other issue is of course how to operatively implement the selection of the ten benchmarks, given any one of the above choices (or others). Once you ask yourself the question you realize you are in trouble: there is a large number of possible ways to pick ten sets in 52 (the number is 57.4 quadrillions, if you care to know). We certainly cannot afford to evaluate any of the figures of merit above 60 quadrillion times. Or maybe there are shortcuts ? I let you ponder on this.

So, there you are. Can you suggest a smart algorithm to implement one of the above criteria in a CPU-manageable way ? I must say I have preferred to spend my time writing this post than on the problem itself, as I find this entertaining... If you suggest a good procedure and we choose to implement it I promise I will acknowledge you in the internal CMS document we will produce on the analysis!