I got an email from an analytics group pitching an article about Valentine's Day movie results. 

It promised:

"If you’re planning to celebrate Valentine’s Day by watching a romantic film you’ll probably end up watching Isn’t It Romantic, according to the latest" blah blah blah (which) "analyzed Google Trends data of IMDBs list of ‘100 romantic films for Valentine's Day’ to reveal which films were the most popular in February 2019.

"The most popular Valentine’s Day film by search interest in the United States was Isn’t it Romantic tied with Two Weeks Notice, followed by Harold and Maude, You’ve Got Mail, and Gone with the Wind. Globally, the film The Fault in Our Stars ranked number one, followed by Love, Rosie, Titanic, Me Before You and 500 Days of Summer.

"Who is the leading man of romance? Hugh Grant.  Grant appears in seven of the 100 films, the most of any actor. Meg Ryan is the next most likely to be on viewers' screens this month with five appearances."

"The Fault In Our Stars"? Hoo boy.

Prior to founding Science 2.0 my career was in physics software - we made tools that used Maxwell's equations to solve complex problems on things like circuit boards. 

One thing you learn early on is that you can arrive at the wrong answer with complete confidence. We see it happen all of the time today, like in epidemiology studies which can link any food or chemical to any benefit or harm.  You can understand how to fool yourself pretty easily. Picture that you want to figure out where a table might break.  If your analysis focuses on the middle of the table and you dive down to the granular level there you can arrive at a precise weight to the gram you can be 99.999% confident the table will hold before breaking. The problem is that the table being designed might not need to have weight exactly in the middle.

High energy physicists spend a lot of time arguing about "convergence" on results to avoid having a solid confidence interval for a completely wrong answer.

The HEP community would have done a different analysis here and determined the top Valentine's Day movie will be "Sleepless In Seattle." And they would be right. Because search engine results are not behavior.


"Sleepless in Seattle" is synonymous with Valentine's Day. Any data mining result should be calibrated by that accordingly.

Misunderstanding the difference between behavior and polling is what happened here. I believe their confidence in the answer. But not their methodology. People who are doing searches for Valentine's Day movies either are bored with ones they have or don't know any at all. That has already contaminated your sample. If I am one of them maybe I am more likely to have clicked on "Isn't It Romantic". It's a recent movie, and it stars the delightful Rebel Wilson. I own that movie. But it has nothing to do with Valentine's Day and little to do with romance. It is instead about a woman who hates romantic comedies and gets mugged and wakes in a fantasy world where she's stuck in one.

So data mining for searches doesn't tell us much for the same reason surveys don't. Behavior is all that counts. When I noted in Science Left Behind that there were anti-science positions held by the left and the right, academics (on the left) wrote blog posts claiming I was wrong in stating that more Democrats were anti-vaccine and opposed GMOs. I was using CDC data, shopping data, and legislative efforts, and they were using ... surveys.

On surveys, almost no one claims to be anti-vaccine. Because in their minds they are pro choice about vaccines. Or they want more studies done. It's the same with GMOs. My CDC data showed California, Oregon, and Washington ruled the nation in vaccine denial, what they claim on surveys was irrelevant.

So it goes with using search results to pick a top Valentine's Day movies. "Sleepless in Seattle" was not on their list, even though it stars Meg Ryan, who is the top female showing up in searches, and even though another movie she was in, "You've Got Mail", was remake green-lit solely so they could pair her with her co-star from "Sleepless In Seattle", Tom Hanks, due to the $126 million that romantic comedy made in 1993.

The ending is literally on Valentine's Day, at the top of the Empire State Building. In the image she's even holding a teddy bear. It is without question the top Valentine's Day movie of all time, not "The Fault In Our Stars" or "500 Days of Summer."

Data mining has been getting this kind of thing wrong for a long time. Back when Alexa was a data mining algorithm to rank websites and not an Amazon device recording everything you say, they consistently ranked Google as the number 2 website in the world. They had an answer that was completely wrong and yet their algorithm said must be accurate. They finally fixed it by hardcoding Google in at #1. And that's what these companies need to do with Valentine's Day movies as well. If "Sleepless In Seattle" is not number one, it is ignoring the real world and it needs to be re-calibrated.