The third "Machine Learning for Jets" workshop is ongoing these days at the Kimmel centre of New York University, a nice venue overlooking Washington Square park in downtown Manhattan. I came to attend it and remain up-to-date with the most advanced new algorithms that are been used for research in collider physics, as I have done last year. The workshop is really well organized and all talks are quite interesting, so this is definitely a good time investment for me.
The first thing I noticed as I joined the 100-strong audience yesterday was that I am at least 20 to 25 years older than the vast majority of the participants, and I am probably the third in terms of age. This is not surprising and it actually did not surprise nor depress me: in fact the opposite is true - I am happy to be able to be working in a field full of young new scientists and compete and collaborate with them. And there is just so much to learn in this field, at the crossroads between Statistics and Computer Science, that one cannot get bored.
Also, as a senior person in a young audience, I feel I have something to offer, the perspective and the broader view powered by my past experience, which somehow young grad students and postdocs still lack. In fact, many of the talks I am listening to, while very interesting, are very technical. The authors usually have been given a tough analysis problem and have attacked it with one of these fantastic new tools, but did not bother thinking at the bigger picture. Something on which us old schoolers can give a bit of input. I will give a couple of examples of what I mean below.
Yesterday afternoon a session was devoted to discussing solutions to the problem of producing huge simulated datasets in view of the next running phase of the Large Hadron Collider, when ATLAS, CMS and LHCb foresee they need more CPU than they really can afford, if they are to continue doing data analysis as they did in the past. As most of the time taken by the full simulation of collision events is spent in modeling the interaction of produced particles with the detection elements, many have attacked the problem by using Generative Adversarial Networks, in the hope these instruments can provide an effective shortcut.
Generative adversarial networks (GAN) work as follows. A neural network trains itself to produce artificial events that look like examples from a training dataset. In so doing, the network is helped by having the goal of trying to fool a "critic", a separate network that tries to find differences between the artificial events and the original examples. The interplay of the two networks converges to produce artificial events that model the real thing very well. Since once the training is done the generation is very fast, this could be an effective solution to providing a fast simulation of collider data.
The state of the art as I perceived from the talks I listened to yesterday is that the technique works, but that it is still not perfect. Usually the outputs of these GANs are good enough to model well the bulk of the probability density function of the data in most of the parameter space, but some modeling imperfections still exist and are hard to beat down. A large effort by several groups is ongoing to address those issues. So at the end of the session I made a comment, which I report below.
I think the approach of using GANs is valuable and extremely interesting, but I do not believe that the fast simulated samples will be useful to model the tail of the kinematical distributions where we perform most of our searches for new particle signals; the reason is that those tails are undetermined since they are dominated by large modeling uncertainties - we do not know them as we have not studied them in detail yet.
On the other hand, there is no reason to believe that we need to constrain ourselves to keep doing searches and measurements the way we have until now. In other words, while the approach of solving CPU demands with the use of GANs is valuable and can be a good solution, it is time for us to also invest time and thinking in the task of freeing ourselves from the slavery of having to produce huge simulation samples. The new machine learning tools available on the market today could be used to invest in this direction. By being creative we can sidestep the demand of a scaling up of the size of simulation samples coming from the larger datasets we will produce in the high-luminosity phase of LHC running. We can use real data to model the bulk of distributions, instead than large simulation samples.
One example comes from my own research. A few years ago I had the problem of modeling the background from multijet events from quantum chromodynamics, for a search of the rare process of Higgs boson pair production (when both Higgs bosons decay to b-quark pairs). Rather than wrestling with the generation of huge QCD samples, I devised a fully data-driven technique based on the idea of "hemisphere mixing". Let me briefly explain what that is.
You can think of a QCD event as two quarks or gluons kicked out in different directions, which later independently emit additional radiation and produce multiple jets. If you take a data event and identify the two "hemispheres" created by this process, you can cut the data events in two, and then remix these hemispheres by applying some physical constraints. This way you can produce huge amounts of synthetic data that model extremely well the multidimensional features of real data, with practically no CPU consumption. The search for Higgs pair production was published last year and you can find more information on the technique in the article. So the conclusion of this is that by being creative you can sidestep the demand of huge CPU demands.
The second example comes from this morning session, when several contributions discussed ways to decorrelate the output of classification algorithms from one of the variables characterizing the events. This is an old problem: in a new particle search you want to study the distribution of invariant mass that you can reconstruct from the observed decay products of the hypothetical particle.
Physicists like to see the "bump" in an otherwise smooth mass distribution which is produced when a resonant signal is present in the data in addition to the non-resonant background, and decades of particle hunts have proven this is the decisive proof of the new particle existence. If you have a classifier that can reject background like events based on their characteristics, you may use it to make the signal more prominent; but the problem is that the classifier will use the signal features to distinguish it, and among them there is of course the mass of the particle. This has the unwanted effect that the background mass distribution gets "sculpted" and looks more similar to the signal after the selection. In general, the classifier has helped the discrimination, but if one is not completely confident of the modeling of the mass distribution for background events, one may be unhappy of the result.
Many different approaches were discussed on how to let the classifier (a boosted decision tree, or a neural network, typically) do its job without messing with the mass distribution, thus producing a score uncorrelated with that variable. So at the end of the session I commented that while I find these attempts quite interesting and valuable, I think they go in the direction of keeping us doing business as we used to many years ago, in the face of much more advanced tools we have nowadays, which can provide us with the confidence we need to do statistical inference without sticking to the need of seeing a bump in a flat background.
So why not instead investing more effort in the task of modeling the full inference process, end to end? What we want is to prove that a new particle exists in the data, given some mismodeling uncertainties and other nuisance parameters. The technology now exists to incorporate the effect of those nuisances in the loss function of a neural networks, constructed such that the network is tasked with optimizing our signal discovery chances (or minimizing the uncertainty on a parameter to be measured with the data) rather than just producing the best "systematics-blind" signal discrimination. We should simply invest more in that direction, rather than conforming to the old-fashioned bump hunt business. Otherwise it looks like we are making a step in the right direction, but much shorter than we could be doing it.
Last year I published an article on Computer Physics Communications with my student Pablo de Castro. In it, we described precisely the outcome of producing a neural network architecture that included knowledge of the use of the classification output in the inference problem (the estimate of the fraction of signal in the data) and of all systematic uncertainties that affected the estimate. The network, properly trained, was shown to learn how to directly minimize the uncertainty in the parameter of interest, rather than focusing on the intermediate goal of just producing the highest discrimination (which would come at a price of large systematic effects). So the technology exists... I in fact this year will be working more at this topic.
This afternoon the session is on anomaly detection for new physics searches. I hope I will have the stamina to report in this blog on those techniques, too... Stay tuned if you are interested, or just visit the workshop web site.
Comments