Choosing Your Bets: The Selection Bias

As some of the long-time readers of this blog know, in this column I have occasionally discussed probability calculations in the context of gambling and betting. A long time ago I also famously won a $1000 bet on the LHC not discovering any new physics. Below I will mention a similar bet that ended up not being agreed upon by the parties, for the sake of discussing a subtle effect one has to worry about when placing bets: the selection bias.

A successful gambler must be capable of sensing whether the conditions for placing a bet are favorable or unfavorable, and acting accordingly. Think about it: you might be an excellent blackjack player, capable of counting cards faultlessly, and still fail to gain an edge over the dealer, if the game is rigged! Now, it is true that when you are placing or accepting bets on scientific research or similar human-driven activities, you usually do not run such a risk: there are no bad actors and the game is fair. Still, you can still walk into traps in certain situations.

The common pitfall called "selection bias" in gambling occurs when the choice of placing a bet depends in some way on the state of the system under study. Consider the following example, which will hopefully clarify what I mean.

Where are they all going anyway?

John and Jane are on the highway, and a shiny red Ferrari overtakes them. While the car disappears into the distance, John mentions that it’s quite rare to see cars of that brand on the road. Jane is not convinced, and a discussion ensues. They decide to settle it by betting on the matter: if they won’t see another car like that during their journey John will win, otherwise Jane will. John’s confidence in winning easy money is shattered as he sees not just one, but three similar Ferraris overtaking him in the following twenty minutes. What is going on? Very simply, a get-together of Ferrari owners is taking place somewhere nearby!

While John feels fully justified in blaming his bad luck, he has a share of responsibility, as he could have considered that in probabilistic terms the presence of the first Ferrari affects the odds of seeing others in the area, precisely because some Ferrari owners like to attend gatherings and meet others who share their passion for that famous and expensive brand. The bottom line of the previous example is clear: always be on the lookout for effects that might correlate your decision to accept or propose a betting arrangement with the system you are betting on. The correct identification of those effects may be all that is needed to land you on the right or wrong side of the bet!

One should note that John and Jane decided to bet on the appearance of an additional Ferrari on the road based solely on their own gut feeling about the rarity of that kind of car: no precise calculation can be carried out on such an ill-defined system. The actual odds are hard to assess by objective means, as they depend on unknown factors – e.g., the traffic conditions along the road. Even if John worked in the car industry and thus had a pretty good hunch of the number of Ferrari cars in circulation, it would be very difficult to calculate the odds of seeing or not another one of those. Because of that, the existence of a selection bias could be argued to be irrelevant – it is just another “fudge factor” on top of others, making the system only slightly more undetermined than it already is.

But then, let us leave John and Jane to their trip and consider the following situation: a racetrack during a Formula 1 event. Two cars of each brand are competing, so there are two Ferrari, two McLaren, two Renaults, and so on. In total, 20 cars are taking part in the race. You are sitting on the stands placed halfway along the track, and soon after the race starts, you see a Ferrari pass by. You had no previous knowledge of the starting order of the cars. What is the chance that the next car is also a Ferrari? The “neutral” estimate would be one in nineteen, as you have to exclude one of the two Ferrari both in the numerator (as a positive occurrence) and in the denominator (the total pool of racing cars trailing the first Ferrari is 20-1=19 cars).

On the other hand, despite differences in skills of the competing pilots, the two cars of each brand are identical: there must then be some correlation between their performances. One must then admit that the odds of seeing another Ferrari next are higher than 1/19: the selection bias again produces a correlation between the two observed passages. The apparently “fair bet” of 1:19 is, in fact, biased in favor of the party betting on seeing another Ferrari next. You do not need to know how large that bias really is: you only need to know it exists to understand where to put your money!

A selection bias like the one of Ferrari get-togethers or racing events can indeed be identified if you consider carefully the objective situation. Yet there are cases when you are required to consider factors that may be extraneous to the system, such as the potential for foul play. We all like to trust the honesty of people we interact with, but in a probabilistic sense we must factor in the possibility of that occurring as well.

A bet that did not concretize

In the summer of 2016 I was asked by the scientific collaboration called CMS, a particle physics experiment at the CERN Large Hadron Collider, to take part in the team in charge of the internal refereeing of a physics analysis. The authors of the analysis had found a very significant signal of a previously unknown physical process, and they had grown convinced that what they were seeing was due to a new particle, most likely a new kind of Higgs boson. That would have been a huge discovery!

I carefully scrutinized their work, looking for some flaw –but I could not find any obvious shortcoming. Having discarded mistakes or systematic effects as a source of the observed signal, I became convinced it was simply due to a statistical fluctuation: an oddity of the data, which can sometimes produce something resembling a new signal although there aren’t any. In forming that opinion, a big role was played by my firm belief in the theory called “Standard Model”, which very successfully describes the phenomenology of subnuclear physics and does not predict the existence of new particles such as the hypothetical one seen by the authors: given my prior belief, any evidence in favor of some new phenomenon faces quite an uphill struggle to convince me.

In discussions with the main author of the study, my colleague Alexander Nikitenko, we found we had diametrically opposite opinions on the nature of the signal: while I claimed it was spurious, he was certain it was due to a new particle. To me, that looked like the perfect setup to place a bet with high confidence! I thus offered a $1000 wager to my colleague, with a stipulation to the effect that “if the signal is proven to be the real manifestation of a new physics phenomenon beyond the Standard Model, T. Dorigo will pay S. Nikitenko the sum of $1000; otherwise, S. Nikitenko will pay T. Dorigo the same amount”. And Sasha was quick to say he was very interested in accepting it!

My interest in the gambling aspect was, I have to say, not only financial: I have observed through years of scientific blogging how these wagers attract the interest of laypersons, drawing them to learn more about fundamental science. I was thus planning to build a story around the wager and write extensively about the relevance that the discovery of a new Higgs boson, one not predicted by our currently accepted theory, could have had on our understanding of the universe. In addition, I have always tried to support with new evidence the correctness of my innate skepticism toward extraordinary claims based on inconclusive statistical evidence.

We soon started exchanging emails to see if we could agree on a precise stipulation for our wager. The exact phrasing in such a situation is very important, as nobody wants to end up in a grey area where it is not immediately clear who is the winner. All the while, I brought myself to ponder on the soundness of betting on the spurious nature of Nikitenko’s signal as objectively as I could. First of all, in general terms the fact that somebody is very happy to accept your bet offer is always a very precious piece of information: evidently they see the matter in a different way from yours; understanding their line of reasoning is then very useful if you want to be objective. This can be tricky, as human beings are often irrational, and oftentimes their bets reflect more on their wishes than on the rational assessment of information they have at their disposal.

In the case at hand, there was however more at play. While the 2016 analysis had been performed on data collected during the previous years, the experiment had already collected additional data from an equivalent amount of particle collisions during the past few months. Under normal circumstances, the new data would not have been looked at yet, as prior to analyzing it the authors would want to finalize and publish the earlier result; but in this particular case, maybe the authors did have a strong reason to “peek” at them, to verify whether the alleged new physics signal could be seen there, too. As particle physics experiments used to perform their searches for new particles in a “blind” way –hiding the data until the last moment, to avoid influencing their own conclusions– the act of looking at new data would have been an illegitimate one, at least from a methodological point of view; but the temptation to do so would be very strong. I would in fact not rate it as a scientific misconduct, but rather as a consequence of a scientist’s unstoppable urge to acquire more knowledge about Nature, when he or she is able to do so.

Let us examine the problem from a probabilistic perspective. Imagine that there is a 50% chance that the signal observed in the old data is genuine: an even bet on its reality is then apparently fair. Further, imagine that there is a 20% chance that my colleague had indeed given a peek at the new data, without informing his peers. If we consider 1000 independent scenarios like the one above, in 200 of them (20%) my colleague peeked at the new data in search for confirmation, with a chance of obtaining firm, if not conclusive, evidence of the genuine or spurious nature of the claimed signal.

To simplify the calculation we may imagine that the evidence from the new data is clear-cut, so that among those 200 cases my colleague obtained a confirmation in 100 of them, and a disproving evidence in the remaining 100. Now, fast-forward to the moment when I offer him the wager: for sure my colleague would not agree to bet on the genuine nature of the signal if he had seen no similar effect in the new data! Instead, he would definitely accept the bet in the 100 other cases. As for the remaining 800 scenarios, he would be deciding to accept the bet according to his own belief without cheating. Let us say that he would do so 50% of the time, so in 400 of those 800 cases. Adding these to the previous ones we find that he would be willing to bet in 500 cases out of 1000.

Now, what are his overall chances to win? Of course, these largely depend on what Nature chose for our Universe – whether, that is, there are indeed more Higgs bosons around than what the LHC found in 2012. For the sake of arguing, let us imagine that the chance that the signal seen in the 2016 data is real is indeed 50%, i.e. correctly aligned to our agreed-upon payoffs. In that case, my colleague would win all of the 100 bets placed when he had cheated and found a signal in the new data; and he would also win 50% of the remaining 400. Overall, he wins 300 of his 500 bets, or 60% of the time! His edge is guaranteed by having cheated in a fraction of the cases.

What should we conclude from the above calculation? Of course, placing a bet against somebody who has potential access to privileged information is a bad idea – the bet will more likely be accepted in cases which favor him over us. Yet if we reasoned on first principles without considering the cheating action, it would be hard to change our mind: “The signal exists in 50% of the cases under these conditions, so betting with even odds cannot be unfair or rigged”. This reasoning is not accounting for the fact that there is a selection bias at play: the bets that get played out are but a subset of all the possible situations, and they correspond more likely to situations where the advantage is with your opponent.

I should not fail to mention, in discussing this particle physics example, that in the end Sasha and I did not finalize our wager. Indeed, we agreed that we were too involved with the situation to be able to use it as a gambling pretext. It was a wise decision, as from a professional standpoint a scientist should not mix up pastimes such as recreational betting with his or her duties – in my case, the review of the analysis; in Sasha’s case, the presentation of its conclusions and the writing of a scientific publication on behalf of the collaboration. The signal, by the way, did turn out to be spurious in the end; as it happens in the complex research field of particle physics, though, that realization took several additional years of investigation, and a lot of further internal debates and refereeing within the collaboration.

Related articles

Comments

Know Science And Want To Write?

Donate or Buy SWAG

Books By Writers Here