Science 2.0: How The Math Of 10 Million Data Points Per Day Can Help

They're data mining our children, notes Politico writer Stephanie Simon. She is talking about education technology startup Knewton and their use of data analytics to find out how kids think. They want to be able to predict who will struggle with fractions next week.

Exciting, right? Obviously this can be misused and the fact that its potential problems (if they can forecast it, they can manipulate it) are so obvious is why policymakers will address that. The brilliance will be what this sort of capability can do for science.

First, we have to calibrate what they mean by invoking '10 million data points', since a data point can be almost anything, even a 0 or a 1 - technically, the ancient game Star Raiders was 4,000 data points, since it was a game that ran in 4K of memory. Brady Fukumoto at EdSurge.com does that calibration. Even if by 10 million they only mean 10MB:

Take a school of 300 students and assume they use the aforementioned phonics software 20 days every month. So 10 MB X 300 students X 20 days = 60,000 MB (60 GB) of data transferred and stored every month. Over a nine-month school year, that comes out to 540 GB of storage.

The more people want to know, the harder it will get. He likens it to a bridge that needs to get both wider and longer.

What may help? Cloudera big data platform technologists have gotten companies to support a new idea. Cloudera, IBM, Intel, DataBricks, and MapR want to port Apache Hive onto Apache Spark. Spark is a cluster computing system created at Berkeley and Hive is data warehouse software to query data stored in Hadoop using a method similar to SQL.

Why is that good? In the beginning of movements, there need to be a lot of options to compete in a 'survival of the fittest' way. But then things need to consolidate or there will be a lot of tiny players doing little. A lot of people had an .mp3 player before the iPod came out but Apple really put digital music in the mainstream and by shucking off private fiefdoms for the benefit of all, they are improving the ecosystem. Science 2.0 is going to need open source tools that have already passed the acid test, scientists do not want to be pioneers getting arrows in the back when it comes to data usage, and if Hive is going to be the standard for doing batch on Hadoop right now, then it's good to start to embrace it.

Related articles

Comments

Know Science And Want To Write?

Donate or Buy SWAG