George Dyson. Credit: edge.org
If you read about Big Data for very long, a quote from science historian George Dyson is sure to come up: "Big data is what happened when the cost of keeping information became less than the cost of throwing it away."
That will be a platform to talk about the challenges, etc.
But there is a bigger problem that shows the challenges of Big Data - that isn't what Dyson said. But like with Einstein quotes about bees, in a Google world, where accuracy is measured by how often you are repeated and thus make it to the top of search engines, the Big Data problem is accuracy, not volume.
George Dyson was quoted by Tim O'Reilly, considered the father of Web 2.0, at a conference. I say the father of Web 2.0 but there is just one problem with that characterization - Web 2.0 had been in use 5 years before he mentioned it. Now, I came up with the term Science 2.0 after seeing an O'Reilly talk about Web 2.0 in 2004, but there is just one problem with that: other people say I was not first at all. Who was? No one seems to know, but in a world where Wikipedia is considered accurate, if some marketing person on Wikipedia declares that Scientific American invented Science 2.0 in 2008, years after this actually started, and reverts any corrections, at some point mythology becomes reality. That is the real peril of Big Data.
Dyson's actual statement was "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." It's actually a dramatic difference but the offhand paraphrase was immediately retweeted 789 times and then picked up by OpenTracker as one of the definitions of the term "Big Data." It became a definition by virtue of nothing but the cultural authority of the person who got the quote wrong. Last week a Business Spectator columnist still used that wrong quote over a year later.
It's the curse of Big Data as much as it is the curse of Google. In that kind of world, Kim Kardashian could declare Kanye West the greatest American since Abe Lincoln and it would be considered authoritative in a Big Data algorithm.
Budd Illic, the Business Spectator writer, was making a point about clutter. If you take a photo, you save it somewhere, maybe edit it, upload it to social media, and then have it backed up. Even if no one else stores it, that is multiple copies of the same data. Illic quotes IDC, who estimates that 60 percent of stored is actually copy data - and costs $46 billion.
Is that any more accurate than the Dyson quote? I have no idea, but that is the problem that Big Data should be able to solve, yet doesn't.
So far, Big Data is being stumped by simple problems, like accurate quotes, because it is Brute Force programming. Imagine the issues in Science 2.0, where a lot of very complex data - 10 Inverse Femtobarns and up in particle physics - longs to be understand using more than a Monte Carlo analysis. There is ongoing work in dimensionality reduction and sub-linear algorithms and those are all important, but it is still only the top layer that has to be addressed before Science 2.0 can really take off for Collaboration and Participation; one of the real struggles will be making sure the data itself is accurate.