The title of this post is the same of a non-technical presentation I gave today at the 2021 USERN Congress. The USERN (Universal Scientific and Education Research Network) is an organization fostering the diffusion of science, which provides prizes to researchers who distinguish themselves for their scientific advancements, and strives for science across borders. As a member of its advisory board I was invited to give a presentation in the first session of the virtual congress, which deals with human versus artificial intelligence.
Having only 20 minutes to say something meaningful, I decided to explain how unsupervised learning is making its way into research practice in the field of particle physics, and how the rigorous way we handle the problem of multiple testing may be a good example to other disciplines, which are affected by the grievous problems of "p-hacking" (the tampering with data in search for a significant effect, such that the results have more impact) and by a general reproducibility crisis.

I spent a few minutes to give a very short introduction on the history of artificial intelligence - I paste below a few of the slides I helped myself with (as usual with my slides, they are full of text - which makes the online use awkward, but helps offline consumption).







Then I introduced supervised learning by explaining how classification is at the heart of our cognitive processes, as summarized below:





This allowed me to introduce and explain what machine learning and deep learning are; e.g., for the latter:





I then described what we do in particle physics, helping myself with some visual description of the LHC and the experimental apparata, and what are the uses of ML in our field:




After touching on the issue of reproducibility and p-hacking in other sciences, I described what we do in particle physics to address those issues. In particular, we have long recognized the problem of multiple testing, so we set the p-value threshold for significant effects to be a very small one - 3x10^-7! I discussed this topic in detail in the past here, so I don't think we need to do that again here...

I eventually described one search done by CMS which rather than focusing on specific regions of interest of the data, looks at all of them together. This analysis creates a distribution of p-values of many different subclasses of events (regions of interest), such that one may compare it with what one expects based on the multiple testing applied. A new physics signal would produce a tail at low p-values, while the data conform to expectations.




Finally, I managed to quickly mention my own contribution to unsupervised learning for new signal searches, describing what the RanBox algorithm does:




All in all, I am not sure my listeners were able to follow everything I said, but I think they got at least a couple of important points, which I also evidence in the concluding slide:







---

Tommaso Dorigo (see his personal web page here) is an experimental particle physicist who works for the INFN and the University of Padova, and collaborates with the CMS experiment at the CERN LHC. He coordinates the MODE Collaboration, a group of physicists and computer scientists from eight institutions in Europe and the US who aim to enable end-to-end optimization of detector design with differentiable programming. Dorigo is an editor of the journals Reviews in Physics and Physics Open. In 2016 Dorigo published the book "Anomaly! Collider Physics and the Quest for New Phenomena at Fermilab", an insider view of the sociology of big particle physics experiments. You can get a copy of the book on Amazon, or contact him to get a free pdf copy if you have limited financial means.