Banner
    The Dangers Of Data Mining
    By Tommaso Dorigo | June 14th 2014 09:48 AM | 15 comments | Print | E-mail | Track Comments
    About Tommaso

    I am an experimental particle physicist working with the CMS experiment at CERN. In my spare time I play chess, abuse the piano, and aim my dobson...

    View Tommaso's Profile
    On my first day at the Erice School of Science Journalism this past week I attended a lecture by Alessio Cimarelli, who discussed "When Data Journalism meets Science: a "Hackathon"". The speaker (who owns the site called "dataninja") showed several examples of how to mine the web to construct databases and display results on several topics. It was quite interesting to see the techniques he used, but I felt compelled to interrupt him at some point, in the interest of the school participants.

    The fact is, he was showing his results as if they were accurate measurements of the researched feature, while for a journalist it should be very important to be able to keep a critical attitude toward whatever "result" is discussed - especially those one can extract by oneself by automatic means.

    I took his web page where he was showing data for migrants killed in trying to reach Europe, and in a few clicks I got to see some of the original "data" entries which were the basis of the map his software had produced.

    The third one I browsed was a newspaper article on the French magazine "Liberation", who discussed how a migrant had slipped on a stone while washing himself on the beach, and died.

    So I could point out to the speaker  - and to the audience - how as a scientist I have a reverence for the data, and I pay extreme care to avoid any spurious entries in a dataset I use for some analysis. The Liberation piece, included in the automated search for "migrant killed", was an example of how an automated search collecting data for killed migrants was liable to produce a biased result and how spurious data could easily make it into the analysis.

    In the end, it boils down to the fact that a scientist values (or should value) more the error bar around an estimate than the estimate itself, as an estimate without error bar is more useless (and potentially deceiving) than an error bar without a central value: the latter tells at least something precise about the accuracy of the measurement, while the former says nothing at all.

    I do not know whether my point was understood by the audience - I played the arrogant scientist, and I know I did not excel in sympathy when I do that. But it was on purpose: if they drove home the fact that they should be more skeptical of what is erroneously or deceivingly called "raw data" by their peer (or even by scientists who should move on to some more suitable occupations in their lives) I did not waste my listeners' time.

    A couple of days later, it was the turn of Ayelet Baram Tsabari, who discussed "Using the web to analyse and increase people's interest in science". She discussed in detail how "Google trends" can be used to extract information on the interest of internet browsers in scientific topics based on their search terms and the graphs that the site provides. It was again my turn to play the hard-nosed professor of statistics as I interrupted her when she was showing one of those trend graphs, which had a large peak in coincidence with an important news event, and then a small secondary peak very close to it.

    Although the proposed explanation for the secondary peak looked quite plausible, I felt compelled to explain that since Google trends only provides relative frequency graphs - the original absolute numbers are hidden - one can hardly associate an error bar to the graph, and thus a "feature" one observes in the graph cannot, in general, be taken as proof that something particular has occurred causing people to search for a particular term.

    (By the way, can anybody guess what is the search used in google trends to produce the graph above?)

    My comment was not understood by the speaker, who referred to the graph as "the data" while I tried to explain to her that it was not raw data but a statistic (a statistic is a function of the data). She insisted that it was not a statistic, and I decided that her punishment would be to have her live with her ignorance... But I believe the discussion did allow the listeners to bring home the point - which was again the former one: data without an error estimate can only be taken as qualitative indications, and prove nothing in general.

    Comments

    This is the graph without error bars:
    http://www.google.ca/trends/explore#q=higgs%20boson

    Michael Martinez
    I think this chart is more colorful and has LOTS of correlations to play with.
    http://www.google.com/trends/explore#q=higgs%20boson%2Cgod%20particle%2Ccern

    Michael J. McFadden
    Hmmm...  that peak is about halfway through 2012, so I'd guess it had to do with the US presidential election.  It would have to deal with someone fairly unknown, but who strongly peaked back during the early primaries. Or it could be a news item like Benghazi or Gaddifi.
    I'd lean toward Ron Paul and the Iowa Straw Poll, or maybe just the Straw Poll itself since it also had a bump in late 2008.

    It *could* be the name of the Paul campaign bigwig who dared to air a commercial in which he was ::gasp!:: smoking a cigarette (I'm sure THAT produced a similar spike in that rough time period!) but I don't think the preliminary spike would be there, nor the one back in 2008.  Same sort of thinking leads me to think it's not Benghazi: I don't remember a late 2008 story though.

     :?
    MJM
    dorigo
    Hi Michael,

    hint: look at previous comments above ;-)
    T.
    Michael J. McFadden
    Ahhh!  Thank you Tommaso!   Next time I'll have to remember to say a prayer before making a guess... :) MJM
    Hi Tommaso,
    first of all thank you for attending the course and contributing with your lecture and comments.
    As you know I was chairing both sessions you refer to, and I will try to add my perception of the interesting and useful exchanges (although maybe sometimes a bit over the line).
    One important premise is that the School focuses on science journalism and communication, and is meant to bring together journalists, communicators and scientists from diverse disciplines and backgrounds.
    Another premise is that journalism has to deal, by definition, with things happening now, around us.

    Getting into the core of the matter, I find your post extremely interesting, because it allows us to explore in more depth two crucial issues that you have just touched very superficially, in my view.

    Jokingly, I might say that as a scientist you repeated here that you'd just discard most if not all the topics we have been discussing in the School in Erice, because of the lack of reliable data.
    I wrote an article last year titled "The Number Needed to Inform - What we talk about when we talk of science journalism" (http://ebph.it/article/view/8816/8002), and used this quote to introduce it:
    «Science values detail, precision, the impersonal, the technical, the lasting, facts, numbers and being right. Journalism values brevity, approximation, the personal, the colloquial, the immediate, stories, words and being right now. There are going to be tensions. Quentin Cooper BBC Radio 4’s Material World».
    In addition, journalism covers also (I'd say mostly) issues that are not the subject of some rigorous scientific discipline.
    In particular, it seems clear to me - who think about clinical research, when it comes to statistics - that you apply two very powerful lenses: not only that of statistics, but also that of physics.
    As we said in Erice the best statistics used in biomedical research has proven inadequate (see John Ioannidis' work at Stanford), but still the scientific community *assumes* that systematic reviews and meta-analyses are *somewhat* better (and much more scientific) than the available alternatives.
    I am sure that the work by Cimarelli on the migrants has little to do with science, but has an enormous social value: maybe some scientists will pick up on his work and study the same issue with more rigour. That's the way science has been improving in recent centuries, after all.

    Getting to the second point, I am sincerely disappointed by your depiction of the exchange with Baram-Tsabari, and I fear that you missed one of the important messages I have been trying to convey in Erice: we all have to learn how to cope with *our own* ignorance, which becomes evident as soon as a scientist – any scientist - tries to get out of his own quite small world and talk to - and ultimately take relevant decisions with - the people outside.
    Decisions are actually taken on totally irrational and unscientific grounds. Every day. In every context. In every country.
    Baram-Tsabari brought to Erice the perspective, the tools and the reflections of a soclal scientist on the data on searches that are currently made available by Google, which are full of limitations that we have listed and discussed.
    The society we live in (well, the one I'd like to live in, at least ;-) ) values the knowledge produced by physics AND the knowledge produced by social sciences. I have my personal view about the reliability and meaningfulness of some social research, and I have similar concerns about most scientific disciplines.
    But I learned that calling an ignorant those I don't understand/agree with doesn't help the other people in the discussion.

    dorigo
    Dear Fabio,

    thanks for your comment. Of course my post above was superficial, as all my blog entries - I am doing journalism here! ;-)

    The question of rigour in the analysis of internet-mined data is of course quite important, and can't be discussed here any more than it could at the conference. What I cared to instill in the listeners of Cimarelli's talk is that they have to decide if they are scientists or journalists before they embark in that business, while to my perception, maybe a biased one, the speaker was confusing the two roles, perhaps driven by his own background. So I hope we agree on the bottomline, which is that data mining can be very useful but also quite dangerous if mishandled, and the way it is done has to tailor to the goals.

    As for my depiction of the exchange with Tsabari: I believe I accurately described it above. If I stressed the fact that she does not know what a statistics is, it is because I found it quite surprising - she is a professor and applies statistics in her studies. Alas, readers of this blog are more aware than others of the fact that the ignorance of statistics (and even of what statistics means!) is widespread and one of the sources of the communication gap between laymen and scientists.

    If a social scientist does not apply scientific rigour to her analysis, or if she fails to explain the limitations of some tools she describes, I do not think she is doing a great job as a researcher or paying a great service to students of journalism. I did not call her an ignorant at the school nor here, I only said she ignores the definition of statistics (which is a fact).

    Cheers,
    T.



    What a great depiction, in a few lines, of the top-down approach and related "deficit model"!

    You'll be surprised to learn that sentences like yours «the ignorance of statistics (and even of what statistics means!) is widespread and one of the sources of the communication gap between laymen and scientists» were widespread in many variants several years ago, and were later identified by many scholars as one of the main sources of the lack of effective communication between the scientific community and the rest of us.
    Better: between each highly specialised sub-group of scientists and the rest of us, including the majority of the scientific community (we are all ignorants, and this is a fact, too).
    I found really enlightening an exchange I had right after Baram-Tsabari's talk with Umberto Dosselli (as you know co-director of the school, and a particle physicists at INFN) and the other speaker Donatella Lucchesi (also a physicist from INFN): I told them that we all have to deal with uncertainty and approximation, which change dramatically in size the more complex is the subject we try to observe and describe. To which Lucchesi immediately replied - almost as a reflex: «Yes but we approximate with method!». Which was exactly my point: each discipline excludes parts of reality for which its methods are inadequate, but journalists still need to rely on someone's "guesstimates", frequently based on a loose application of the scientific method, but still - presumably - the best we can have on earth, the alternative being just: we don't know.

    So, I agree with you - and I wrote it several times - that there are instances in which the only honest scientific answer to a complex question is «We can't say» (and showing graphs can be misleading), but then we also need to distinguish those who try to apply some variants of the scientific method (legitimate scholars who accept the scrutiny of their peers) from the crooks who want to sell their snake oil.

    Hi Tommaso,

    first of all thank you for your attention in this topic. In general I agree with your post, but as co-author of the investigation you cited, The Migrants' Files, I have to point out some important issues I think you have missed or misunderstood.

    Here you can find the official page of the investigation, with a discussion about the methodology and in particular about known errors: http://www.detective.io/detective/the-migrants-files. Very important point: original data were collected by hand from journalistic sources by Gabriele Del Grande and United for Intercultural Action (and a very little subset from Puls project), not at all from automatic search.

    You can read indeed that we used a computer-assisted workflow and not at all an automatic approach to harvest and analyze data about dead and missing in migration towards Europe. In many phases of this process we controlled subsets of data by hand, expecially after massive editing of original data (i.e. extraction of geographic entities from not-structured textual descriptions). So your sentence "an automated search collecting data for killed migrants was liable to produce a biased result and how spurious data could easily make it into the analysis" is totally true, but we haven't used "automated search" so it cannot refers to The Migrants' Files investigation.

    You cited an example of event we have analyzed, I think this one: http://www.detective.io/detective/the-migrants-files/event/54537, linked to this original article on the Lille blog of Liberation: http://lille.blogs.liberation.fr/saberan/2009/06/interviewcelinedallery..... For this dead you use the term "spurious data"... did you read the description other than the title? The original news is in French, ok, but the description is in plain English: "A man died because he wanted to take a shower. As there was no shower in Calais, migrants washed in a canal. One of them slipped and died". In our opinion this 20 years old man was a victim of migration. Not a spurious datum, but a victim.

    So you have no spurious data? Sure, we have errors, we missed a lot of unknown victims, some linked sources are no longer available, etc. And if you or other people will find some errors, please report them to us via email (debug@themigrantsfiles.com) or twitter (#migrantsfiles)! :)

    Finally, I totally agree with your discussion about error bars, so I send you back to the text about methodology on our website: http://www.detective.io/detective/the-migrants-files (the third paragraph). If you prefer, a note in Italian is also present at the end of the article on L'Espresso: http://speciali.espresso.repubblica.it/interattivi-2014/migranti/index.h....

    dorigo
    Hi Alessio,

    indeed if the data is mined by hand the conclusions change - thanks for pointing out the information about your study. In your talk it was altogether not clear - or maybe I was sleeping - that the collection was done by hand.

    About the migrant, that is indeed the datum which I took as an example. Irrelevant point once it is specified that the collection was not automated - but indeed it is opinable whether it belongs to the statistics or not. To me it doesn't, as I perceived that graph as showing the result of the migration, not of the condition of being a migrant. A migrant remains a migrant for months or years before he or she integrates with the community; if he or she dies in the process, do we count that datum ? I wouldn't, if I were interested in the effect of the migration phenomenon per se; I would, if I were interested also in the aftermath of the migration. Probably I'd construct two separate statistics.

    Cheers,
    T.
    Our original sources (Gabriele and United) collected news by hand, we processed and merged the two dataset in a computer-assisted manner with some sample checks by hand.

    We are interested in dead and missing related to the migration process, not to the simple status of migrant. We have also events in which migrants dead in hospital because of heart attack: he would have died anyway even at his home, or not? Impossible to say, maybe... so we chose to exclude cases like this one. But that case was related to the European management policies for migrants (if there were showers, he would not have died in the canal), so we chose to include it.

    Ok, I apologize for my lack of clarity, the lesson was not about The Migrants' Files, but I use it as a recent example and I should have been more specific.

    Cheers

    Stellare
    Excellent and entertaining post, Tommaso! This "or even by scientists who should move on to some more suitable occupations in their lives" made me laugh out loud. :-) or should I say LOL!

    For some reason statistics seems to be very hard to understand for the majority of people! It is in itself amazing that this is so. It even goes for hard scientists too, I am afraid. Although, I remember we amused ourselves with tons of misunderstood statistics jokes when I was at the university. 


    You are brave to try to educate the educators - keep up the good - and for us readers - entertaining work that you do! 

    Bente
    Bente Lilja Bye is the author of Lilja - A bouquet of stories about the Earth
    Fred Phillips
    The goals of news and science are at odds, no statistical pun intended. I tell my students, usually in the context of forecasting, that the purpose of forecasting is not to be right, but to be ready. "Your forecast will be wrong," I say. The error bar helps you guess how wrong you're likely to be, and how much being wrong is likely to cost you.
    News media like to lionize the guy whose forecast was right. Out of millions of bloggers and pundits, a few will be right on any occasion purely by chance. But then they're "news," regardless.

    dorigo
    Hi Fred,

    yes, yours is a sound way to put things in the right perspective... However, there is a class of journalists who have a strong scientific background (eg. former PhD in physics) and who fall in
    between the two categories. They sometimes will forget that the lack of rigor of their analyses
    (driven by time constraints, e.g., as you pointed out) invalidates the results; those are still
    good for journalistic purposes, but the trouble is that it is not easy to discern where they come
    from at the end of the day...

    Cheers,
    T.
    dorigo
    Hi Fred,

    yes, yours is a sound way to put things in the right perspective... However, there is a class of journalists who have a strong scientific background (eg. former PhD in physics) and who fall in
    between the two categories. They sometimes will forget that the lack of rigor of their analyses
    (driven by time constraints, e.g., as you pointed out) invalidates the results; those are still
    good for journalistic purposes, but the trouble is that it is not easy to discern where they come
    from at the end of the day...

    Cheers,
    T.