Following my strong belief that science dissemination, and open borders science, is too important to pursue as a goal to constrain it by fears of being stripped of good ideas and scooped by fast competitors, I am offering here some ideas on a reserch plan I am going to follow in the coming months.

The benefits of sharing thoughts early on is evident: you may, by reading about them below, be struck with a good idea which may further improve my plan, and decide to share it with me; you might become a collaborator - which would add to the personpower devoted to the research. You might point out problems, issues to address, or mention that some or all of the research has already been done by somebody else, and published - which would save me a lot of time!

So, here is what I plan to study. You might have read in this blog (here or here) about RanBox, a unsupervised learning algorithm I developed to perform anomaly detection searches in standardized multidimensional spaces. I recently published a preprint describing the algorithm and detailing its performances on a set of different use cases ranging from HEP to neutrino physics to fraud detection - the preprint is here, and it has been submitted for publication on Computer Physics Communications. Now, the idea is to extend the method of RanBox to semi-supervised searches

While in a unsupervised search the algorithm is shown some data, and it is tasked with making sense of it without knowing which is the origin of each data example - signal or background, in typical applications -, in semi-supervised learning you provide it with information on the identity of a part of the data. In our case, this will be the signal component. 

The task I am considering is the one of searching for a signal of well-known properties, in a background which, while being in general accessible (large amounts of background-dominated data exist), cannot be precisely modeled by simulation or other means. I have a specific use case in hand, coming from a search for a rare b-physics process in CMS data, but that is irrelevant here. Let us consider what we could do using the technology of RanBox - but first, let me summarize what RanBox does in an unsupervised setup.

What RanBox does


The original version of the algorithm considers that HEP data, particularly at the Large Hadron Collider (LHC), which is the accelerator hosting the CMS experiment to which the algorithm is originally designed to apply, come in with features that are highly dis-uniform: the momenta of particles, the reconstructed mass and energy of processes, are heavily asymmetric, with densities that decrease by orders of magnitude as you go from one end to the other of the spectrum of the measured features. This is due to the fact that the protons we collide in the LHC are consituted by partons that carry a fraction of the proton's energy - and the fraction is most likely very small; the distribution functions are called "parton distribution functions" and they drive much of the phenomenology of proton-proton collisions.

So if you want to design an algorithm capable of focusing on a localized overdensity of the data in a multi-dimensional space, you have to first standardize the features, such that the algorithm does not immediately focus on the low-momentum, low-energy part of the spectra. This can be accomplished by the integral transform, a stretching-shrinking of each of the observable features of the data such that each the transformed variable looks perfectly flat - it has a Uniform distribution between 0 and 1. What you obtain if you apply the integral tranform to all the features of multidimensional data is what is called "Copula space". All the information about the interrelation of the features has not disappeared, but it is only contained in a hypercube.

RanBox executes this transformation, and a few more preprocessing tricks, and then searches for overdensities in the multi-D space by casting a small "box" and predicting how many data points the box is expected to contain under a uniformity hypothesis. The prediction is simple: if you have N events in the total space, the predicted events in a box of volume V is just N*V, since the total space has unit volume! The ratio between observed and predicted events becomes the test statistic that the algorithm maximizes, in search for the most striking event excess.

Going semi-supervised

Now, the idea of the semi-supervised version of RanBox (which I might end up calling "FlatBox" for reasons that will become clear in a minute) is to use the real data to define the integral transformation, deriving a data-driven Copula space which constitutes the sample where a box is sought for, as for RanBox. At variance with RanBox, however, the box does not try to maximize the ratio of observed versus predicted events. In a given box, one may determine what fraction of simulated signal events are captured. 

A test statistic may then be constructed by considering the ratio of the captured signal to the square root of the predicted number of contained data events (I will in reality use one function which is more accurate than that, but it's a detail). The logic is that such a test statistic is a proxy to the significance of the excess of events due to the signal in the box. 

Note an important point: the metric change that defines a Copula space using real data (which are background rich) is not going to produce a Copula space in signal events, because signal events have different marginal densities than the background! Hence, the transformation (which we learn with real data) will produce non-flat signal distributions, and it will be easy for the box to locate high-density signal regions, by maximizing a test statistic like the one above.

The semi supervision allows us to locate by gradient descent the multidimensional interval - the box - which contains a large amount of signal, while containing the smallest possible amount of background. Now, you might stop me and say "well, doh, isn't the background expectation proportional to the box volume anyway?" Well, yes, but here comes a trick.

Since the multi-dimensional features in the background are correlated, there will be regions of the space which are more populated of background, and regions that are more sparsified. So it is more accurate to predict the background in a box by looking in the multi-dimensional "sidebands" that you may construct by considering a box of double volume than the search one, centred on the signal box. This is also done by RanBox (you may specify a parameter to enable that in the code). 

Here, though, a problem arises. If the algorithm maximizes a figure of merit which has the background prediction at the denominator, by moving the box around, it is certain that the background prediction will be underestimated in the end - this is an intrinsic bias arising from the multiple testing you are doing when you move the box around. 

So what you can do is to design a second sideband, and thus have three boxes incapsulated one into the other. You will use the first for the original background prediction, and the maximization of the test statistic as the box moves around. The second will be an independent sample used for the actual prediction you will perform at the end, when you have identified the most promising search box.

The details are still sketchy, but the idea is clear. I will code it in the original RanBox algorithm's c++ implementation in the next few weeks, and then test it on some real physics signals that I know exist in CMS data, to validate the potential of the technique. And then we will be ready to search for our real signal... But we will also need to demonstrate that the multiple testing that the gradient descent algorithm is doing as it moves the search box around is not biasing the results and the background predictions. This will be tricky to pull off, so I expect this work will take a few months to be completed.

So that's it. I hope you will now come up with suggestions, objections, scepticism, or encouragement. Your call!


---

Tommaso Dorigo (see his personal web page here) is an experimental particle physicist who works for the INFN and the University of Padova, and collaborates with the CMS experiment at the CERN LHC. He coordinates the MODE Collaboration, a group of physicists and computer scientists from eight institutions in Europe and the US who aim to enable end-to-end optimization of detector design with differentiable programming. Dorigo is an editor of the journals Reviews in Physics and Physics Open. In 2016 Dorigo published the book "Anomaly! Collider Physics and the Quest for New Phenomena at Fermilab", an insider view of the sociology of big particle physics experiments. You can get a copy of the book on Amazon, or contact him to get a free pdf copy if you have limited financial means.