Anomaly Detection: When Old Statistics School May Still Beat Super-Duper Machine Learning

One of the most suprising results of the "Machine Learning for Jets" (but really, for particle physics in general) workshop I attended in New York City two weeks ago was the outcome of a challenge that the organizers had proposed to the participants: find a hidden signal of some new physics process in a dataset otherwise made up of some physics background, when no information on the new physics was given, nor on the model of the background.<\p>

The problem is called, in statistical terms, as one of anomaly detection. In other words, you have an otherwise homogeneous dataset (with many different features per each event -okay, per each "example" if you are a statistician-, which may or may not be contaminated by some small extraneous events, drawn from a different multi-dimensional probability density function.

The problem of detecting a small contamination to a "regular" dataset is a very common one and it does not arise only in physics research: on the contrary, it is ubiquitous. Quality control in a production line, fraud detection, spam filters are just three examples of applications of anomaly detection in the outside world. Physicists, however, believe their anomalies are more interesting, rare, and have peculiar characteristics, and so they want to develop customized tools. On top of that motivation, of course there is the fact that trying to outsmart the rest of the world because, you know, physicists know better, is FUN!

I discussed in a couple of recent posts an anomaly detection finder I created last spring, RanBox, to find small new physics signals in large amounts of collider data. You might imagine that I would not miss the chance to try my new tool to the challenge put forth by the ML4Jets folks: but no, after last June, when I presented the algorithm at a CERN machine learning workshop, I put it back to the drawer, as I did not have any more time to devote to it, nor did I have interested students who could pick it up. Both things have now changed, and indeed in the next few months I will dig it up and work at it with a couple of bright young collaborators.

But this post is not specifically about my RanBox algorithm. In fact, I want to discuss the results of the challenge (dubbed "LHC olympics"), which were suprising for many of the participants to the workshop - but not for me and for those who know a little bit of statistics.

The signal that had been injected in an otherwise "normal" data sample of events with hadronic jets was a new physics process that created resonant decays, which were however only evident in some of the kinematic distributions one could produce with the event features.

A dozen groups tried their hand on the data, most of them approaching the problem with machine learning algorithms of different sorts. One tool which was supposed to perform well is the "variational autoencoder", a unsupervised neural network which "encodes" the information of every given example in the given data in a way that summarizes at best, and in the most economic way, all the relevant information. The variational autoencoder looks like a neural network with a decreasing number of hidden nodes in each successive layer, until a bottleneck is reached; after that, all subsequent hidden layers replicate symmetrically the structure of the first half, with an output which has the same structure of the input. The goal of the machine is to make the output resemble the input, so that the compression and decompression did not cause loss of information. When you give to such a machine an event not seen before, belonging to a different class, it can be easily categorized as anomalous.

Among the many submitted solutions to the challenge, one stood out for two reasons: first, it was the only one submitted by a group of outsiders. These were a group of astrophysicists from Berkeley, who had found the challenge interesting enough to spend their time on it. The other reason why their solution stood up was that they had used a "simple" density estimation tool, based on the principle of standardizing the data in a way extremely similar to what my own algorithm, RanBox, does.

Lo and behold, the statistical method based on simple density estimation ended up providing by far the most accurate answer: not only the algorithm could find most of the anomalous events and estimate the mass of the injected particle signal. It also correctly assessed the size of the anomalous dataset, something which the other groups could not nearly come close to doing.

So, a big success for the Berkeley group and a little embarrassment for the physicists who saw outsiders beat them at their own game. But was it really a surprise? To many, it indeed was. Why, we have been told that neural networks can outsmart everything... But to me it was not really a surprise.

The reason is that anomaly detection is a unsupervised learning task. What this means is that you are faced with a very undefined statistical problem. As opposed to supervised classification, e.g., when you know the features of the signal and the background, and all you have to do is to figure out the best way to discriminate them, in unsupervised learning tasks the signal is not known.

In statistics there is a lemma, the Neyman-Pearsons lemma, which explains that for "simple" hypothesis testing (when e.g. you want to compare the "null" hypothesis that your data is only drawn from a background distribution, to a "alternative" hypothesis that e.g. the data contains both background and a specified signal) the most powerful test statistic is the likelihood ratio of the two densities (describing signal and background). No machine learning or god-given algorithm can do better than that. On the other hand, if you do NOT know the density of signal then the alternative hypothesis is unspecified. This creates the situation that no test statistic may ever claim to be more powerful in distinguishing the null and alternative hypothesis, as the power of any given test statistic will depend on the unknown features of the signal. In other words, it does not matter how fast you run if you don't know where you are going.

So, the win of a basic statistical learning tool over complex deep learning tools should not surprise you. And, since the future of machine learning is in unsupervised learning problems, as many experts in the field have been pointing out for some time now, you well realize how good old statistics practice remains pretty much under the spotlights, and will continue to do so in the future.

Related articles

Comments

Know Science And Want To Write?

Donate or Buy SWAG