Big data is the current trendy phrase that covers many different areas. Big data describes equally well having a huge volume of data generated in a short period of time (like molecular simulations of DNA), having a huge volume of data that needs to be indexed and archived (like PubMed or Web of Science), or wanting to analyze different types of data that wasn’t collected for a given purpose (the CI-BER project uses a variety of data types collected over the years to study a neighborhood in Asheville, NC).
The Virtual School of Computational Science and Engineering (VSCSE) exists to fill the gap in people’s training. Many people having big data situations were trained as physical scientists, life scientists, or social scientists, not computer science folks. Even those trained as computer science folks, librarians, or archivists were not trained to deal with this volume of data if that training was completed more than a couple years ago. VSCSE sponsors workshops and short courses in targeted areas for motivated people who know they need more education, but are not going to go to a formal college or graduate degree program.
The virtual part refers to the fact that these folks are extremely comfortable with technology. The Data Intensive summer school held 8 July-10 July 2013, for example, was being teleconferenced to more than 400 participants at 15 listed sites. I sat in a room at the National Center for Supercomputing Applications in Urbana, IL, but the speakers have been broadcasting from their home sites (e.g., San Diego, Chicago, North Carolina) and questions/comments come from all the sites. As one speaker put it, it’s kind of weird to speak to people you can’t always see, but I’m ok with that.
Three main problems exist when dealing with big data: storage, interaction, and analysis. The summer school covered all three aspects. I will report here on things that may help others who did not sit through the summer school. I do not have answers to all the questions I encountered, but I am going to share some interesting questions that have made me think recently.
Several of the talks touched on just the scope of having too much data. Even with terabyte drives being cheap and readily available, the storage problems are still the same as when I was in school: data can be generated far faster than one can analyze it and will exceed local storage if one is not careful. Even a petabyte can fill quickly with multiple users who are each generating terabytes.
In fact, the volume problem can be worse than it used to be because of the size of individual files, the number of files, or both. Ending up with several thousand files that are each tens, if not hundreds of gigabytes, of data is becoming a common problem for people who do the kinds of simulations that I do. Worse is that those files are probably generated on one supercomputing cluster, but need to be analyzed or stored somewhere else because that’s how things were done years ago. The system admins become downright rude about people filling up scratch space, but, in terms of researcher time management, leaving generated files on the scratch drives for a few weeks is often the best solution. Dropbox, cloud storage, and Google documents are great for people who only need to share a couple of reports or presentations, but it’s not at all useful for the researchers who are routinely dealing with thousands of files that are each huge.
The good folks in the Chicagoland area (Argonne, University of Chicago) have made Globus. Globus is a way to efficiently transfer files for supercomputer and similar cluster users that takes into account how much babysitting the big files need to transfer. I have not used it, but I will try it the next time I am in the situation of having my data being generated offsite and need to transfer to storage.
Several of the talks at the summer school described the difficulty of interacting with huge amounts of data. Automated searches are nice, but how are things indexed in the first place? What about data sets that consist of widely varying things that are hard to describe? For example, many databases are great with formatted records, but what about data that is generated and is unformatted, free-form, or wildly different formats (not just Excel, plain text, and Kaleidagraph, but movies, pictures, maps, pointers to books or physical objects)? Limiting oneself to things that are easily described and indexed using traditional means missing out on perhaps as much as 70-85% of the data currently in electronic form.
I was particularly taken by one speaker who mentioned that many of the easily dealt with data are misleading. For example, a numerical poll may tell you much less about what people are thinking than the free-form responses. However, tabulating a million numerical poll results is computationally easy and dealing with a thousand free-form written responses is not. Searching something like PubMed by keyword is easy, if you know the right keywords, but is extremely difficult if the words you want were not what someone else considered important enough to index, especially if the records are not searchable text. The novel Mr. Penumbra's 24-Hour Bookstore has a very nice example of the protagonist searching a database for the record of a physical object that must be in storage somewhere and coming up empty because he can't figure out the proper keywords. I will not share the spoiler on how the protagonist solved the problem by hacking the carbon interface instead of the silicon interface, but it is very clever and a good example of how computers are not as smart as people are even now.
I still have more questions than answers, because that is the state of the field, but I look forward to trying some things for myself and reading/talking with people who are pushing hard to find solutions for their mixed huge data set problems.
People who have data usually want to perform analysis to turn data into information. This is also far trickier than one might expect if one is doing anything other than simply scaling up an analysis to a bigger data set of coherently formatted records. Even simply scaling up an analysis can become tricky when machine memory will be exceeded or a program has a hard limit on the number of entries. For example, I have overwhelmed Excel by having too many rows of data to read in and I am told limits exist for many other programs. Once people start talking about databases with millions of entries and tens of thousands of possible keys for every entry, the unwieldiness is evident. That is indeed big data without good solutions and we haven't even gotten to what do you do when a data set is not in a formatted way that can be automated and you aren't quite sure how to use it anyway.
In the summer school, we were introduced to some functions in R that might help people plot and learn something. My favorite is get_map, which allows you to use the data from Google maps (a coherent description with examples is here). Another interesting tool was the Google motion chart, which I suspect is how Michael Marder makes his interesting plots on what matters for science education.
More flexible tools in R were in the package ggplot2 (a nice overview of the functionality is here). If you want slick looking, complex plots, then this is for you. I am continuing to experiment with facet_grid because I often have multiple small plots to make to look at slices through my data by extra variables. Having a program that will do it for me saves me a lot of time.
Big data is the latest buzzword that affects some of us (you do know about the requirement for a data management plan for your next NSF proposal, don't you?), but it is real problem for people doing cutting-edge science. If you find yourself overwhelmed with your data, then I highly recommend that you attend a workshop or seminar series to share ideas, even with those who are far outside your field. It's time well spent.