Fake Banner
    The Petabyte Problem
    By Alex "Sandy" Antunes | August 25th 2009 03:11 PM | 6 comments | Print | E-mail | Track Comments
    About Alex "Sandy"

    Read more about the strange modern world of a day laborer in astronomy, plus extra space science-y goodness....

    View Alex "Sandy"'s Profile
    What do you do with a petabyte of data?

    The question came up during a lunch today with two NASA computing people, on in IT and the other in supercomputing.  Modern satellites are returning petabytes of data, and there are many satellites.  This is far more than any human can expect to personally look at, and in fact more than they can fit into their local machine.  How do we make these huge amounts of data useful?

    We can't ship it to the user's desktop-- there's no room, it'd take forever, and the user doesn't have tools that can browse massive data sets.

    One concept in cloud computing is that the user runs the program where the data is, rather than bringing their data to their program.  This works well for cases where the scientist knows what the data is, and simply needs to run on it.

    For example, Facebook lets anyone write a program that runs on the entire Facebook population.  A user can write a program to fetch a subset of the data that fits certain criteria, then do interesting stuff with it.  The data resides on Facebook's servers, and the user only accesses that subset which they need.  This works because Facebook has very few data items available to the programmer, and a simple relational database fetches all relevant data.

    But in science data sets, you don't know what the relevant data is until after you've solved the problem.  You have to first do data exploration.  And NASA has (for example) Earth data in multiple wavelengths and sensor types, resolutions, timing.  If you're not an earth science expert, you may not yet know which data is important.

    We can't ask the user to run their analysis remotely as a 'batch' job, because then the scientist has to predecide which data they need.  You can't explore the data in a series of request-and-wait runs, that'd take forever.  It'd be like watching a movie in slide show format, one slide an hour.

    There needs to be a way for a user to a) explore, b) select, then c) work with massive data sets in an interactive fashion.

    Let's look at the example of the starship Enterprise from Star Trek.  It has a nigh omniscient computer and a vast array of sensors.  Commander Data wants to find a hidden Romulan ship.  He talks to the computer to use it.  Here is his technique [paraphrased]:

    "Computer, visual onscreen now."  [no ship seen]
    "Computer, show thermal signatures."  [still no ship seen]
    "Computer, show subspace anomalies." [voila!  the outline of a Romulan ship appears!]

    This is a terrible model for attacking a large data set!  A good system would work like this:

    "Computer, show me all ship-like objects, in any profile.  Ah, there it is."

    Why should a scientist have to drill down through all possibilities?  The problem becomes worse when scientists access data from different domains.  A sociologist may want to explore 'temperature' with 'crime rate' for different regions of a city, to see if there is a connection.

    But where is 'local temperature' in the petabyes of the Earth Observing System (EOS)?  There are many temperatures... water temperatures, air temperatures, buildings radiating heat.  There's humidity to factor in ("it's not the heat, it's the humidity.")  You need your temperature data to match the timing of your crime stats. If you have only one crime figure per day, what do you use as a daily temperature-- an average, a peak, a sampling?

    To answer these, you need to fully understand all the possible data sets, and then be able to make a judgement of which are relevant.  But you may want to try different assumptions, because (again) you don't know the answer until you get it.  Having so much data is a blessing, but it creates a very large task.  A task made worse if computers and IT aren't your field.

    And really, the user doesn't need a petabyte of data.  They just need a subset of that.  The problem is, they don't know which subset they need until they explore it.  Then, and only then, they need an analysis run on a particular subset of that data-- a distilled product.  Which subset, they may not know until they do data exploration.

    We've talked about how there is too much data to just explore it all frame by frame, so some additional structure is needed.

    One approach is semantic data.  This is a meta-data, or description language put alongside the actual data images and whatnot, that describes what the data is and what it can be used for.  In theory, you can do a semantic query and not have to specify any dataset per se.

    The difficulty in semantic data is it tags the data from one perspective, which may not match what users (particularly from other fields) wants.  In the end, semantic data is still a data-based schema imposed from outside.  It may not map to how people think.

    Another approach is what I'll call the liaison approach, and is based on library science (or systems analysis, if you remember that subfield).  At the library, you can search directly or browse the shelves.  If you need help at a library, you ask a person who understands both what people want, and how libraries are set up.  These 3 approachs are:

    1) search the catalog for titles and keywords that match = semantic data
    2) browse the shelf that has books on that topic = data exploration
    3) ask a librarian  = liaison

    The advantage of the librarian is they know books, and they know people.  So they can connect the two.  So if I am looking for a book on, say, Asperger's Syndrome, a librarian can help me narrow down whether I want to read about possible causes (medical), diagnose someone (psychology), find out if there is legal accommodation (law), read about people who have it (biography), or read fiction with aspie protagonists (fiction).  None of these are directly addressed by a keyword search and there isn't a single shelf that has all these categories.  The research librarian provides the interface between need and data.

    The final method is what I'll call building a map.  For this, a dedicated scientist formulates a problem, gets up to speed on all the possible data, decides what is relevant, then solves the problem.  She publishes her method and results, and from that point on, others can do similar analyses using her map.

    This is how science works at present.  It requires individual scientists blessed with wisdom who are (by magic, usually) able to secure funding for unproven cross-disciplinary work.  It also has a time lag, since a research project can easily take years from funding to publication.  But we are acquiring data faster than we are creating new maps for analyzing it.

    If this was an academic paper, I could end now, having raised the questions and offered possible solutions.  But this is ScientificBlogging, and we are allowed opinions.  So here goes.

    To solve the petabyte problem, I would set up a data researcher position, akin to reference librarian.  Then I would ship those data-saavy souls to conferences in different fields, to jam with their colleagues on what new problems could be solved using cross-disciplinary approaches.  Together they'd define how to approach and tackle the problems.  Said colleagues could then get proposals funded for the approach and create the maps for others to follow.

    And, I want a pony.

    Alex, the Daytime Astronomer

    The Daytime Astronomer, Tues&Fri here, via RSS feed, and twitter @skyday


    Great discussion.

    A similar trend is occurring in brain imaging research, where more advanced technology is starting to give researchers too much data, increasing the odds of spurious findings (which then fuel grant support).

    I think you've tapped the curse of empirical science: We wanted data, now we have data.

    Data libraries and librarians might help us sort out what we already have. But the larger issue, which gets at the root of empiricism, is that pointless data is just...pointless. When looking at data, quantity is no substitute for quality.

    My knee-jerk reaction is: Why did we bother collecting data that we didn't know what to do with? That seems as pointless as cryogenically freezing cancer patients for the day that we have a cure for cancer.

    Was the idea that someone will find it useful someday somehow? But if you don't know what you're looking for, then how do you know where to find it? Moreover, if you then you stumble on something that looks interesting, how do you know it's not a spurious finding?

    It seems like the solution would've been to have a better theoretical reason for choosing what to collect. Because if we don't know why we're collecting it, then how do we know that we're collecting the right stuff? It's like planning to build a house and ordering 10 billion bolts.
    I agree, sometimes people go for bigger is better just out of habit.  Not always-- it's not always "data for data's sake".  Sometimes, we need ultra high resolution at very fast frame rates to see something interesting.  To pull from a recent result in solar physics, they finally had high enough resolution to see nanoflaring sites, which potentially solves the mystery of why the corona is so hot.

    But do we always need full resolution?   We can be clever and trigger high capture rates during significant events, then drop to something more manageble for the rest.   But this usually happens against the wishes of the science team, who want more data all the time.  We can only downlink so much data via TDRSS and the DSN, and other satellites compete for that resource.  Otherwise, we'd have even more of a data flood.

    Ultimately, I think the detector folks have outstripped the data design folks.  Hardware is easier to get funded, everyone wants bigger and better, and there's always been the thought that the database and programming people will figure out a way.  A little catch-up is needed before the next big data jump, but of course you can't fund infrastructure until you need it, so infrastructure will always be playing catch-up.
    I think your solution is a good one. It would provide employment for the next generations of young scientists who have an interest in this kind of archival research.

    That said, I really jumped in here to point out that even though we have satellites that are dumping petabytes of data, a lot of that data stream is of no interest to the user community. For example, consider a 24 bit data packet coming down from the mission I'm currently working on: What with encryption, CCSDS packet headers, randomization, etc... the total number of bits required to get those 24 bits of actual raw data to the ground is slightly over 48kbits, representing a factor of 2000 in data formatting and encoding overhead. So your petabyte has become something more like half a terabyte of actual science data. It's still a huge amount of data, and it still needs the sort of solution you're proposing, but it's not quite as bad as it appears at first blush.


    The Economist just ran one of their big reports on this:


    Way to see the story 6 months ahead of time!
    Surely the way to start is to analyse the software put onboard the satellite and see just what it is asking the satellite to return as data? Is it open source? There must be different datastreams from different sensors, which may yield different interpretations if taken separately or together
    I would have thought this to have been planned in, when writing the initial design strategy??

    Don't worry about petabytes of data, its only lots of terabytes, and we have them already - wink
    What we don't have yet are home clustercomputers, or communal networked hard drive arrays at libraries, for example...but it's getting there...
    & Home AI can't be far away....?
    A pony may take a little longer, the replicator's on the blink again ;-)