Each cell in the body contains a whole genome, 3 billion of "letters" known as bases, so the data packed into a few DNA molecules could fill an entire hard drive.
Instead of having one reference genome for study, more and more people are having their DNA sequenced, and that is a truly massive amount of data that will require massive computational and storage capabilities beyond anything previously anticipated, says a new assessment from computational biologists and computer scientists at the University of Illinois and Cold Spring Harbor Laboratory.
The team of experts compared data needs of genomics with three of the biggest players in big data: astronomy, Twitter and YouTube. They projected growth in each area through the year 2025 and found that genomics is poised to be a leader in data acquisition, storage, distribution and analysis.
"As genome-sequencing technologies improve and costs drop, we are expecting an explosion of genome sequencing that will cause a huge flood of data," said Gene Robinson, a professor of entomology at University of Illinois. "The only way to handle this data deluge will be to improve the computing infrastructure for genomics.
"Astronomy, Twitter and YouTube represent three diverse domains that generate and use a huge amount of data, albeit with huge differences in computing needs. The diversity of these three forms of Big Data provides an excellent framework for comparative analyses with genomics," he said.
Like YouTube and Twitter, genomics data are highly distributed, coming from many different sources. However, both Twitter and YouTube have standard formats for their entries, while genomic data can assume many different formats, making sharing and storing more complex.
The authors estimate that the genomics information so far, from sequencing different organisms and a number of humans, has produced data on the petabyte scale (a petabyte is a million gigabytes). However, over the last decade, genomic sequencing data doubled about every seven months, and will grow at an even faster rate as personal genome sequencing becomes more widespread. The researchers estimate that by 2025, genomics data will explode to the exabyte scale - billions of gigabytes. This surpasses even YouTube, the current title holder among the domains studied for most data stored.
Yet the sequences are only one element of genomic data.
"The DNA sequence in itself is not particularly useful for realizing all the great possibilities that genomics technology promises," said co-author Saurabh Sinha, a professor of computer science at Illinois.
"The sequence data have to be analyzed through sophisticated and often computationally intensive algorithms, which find patterns in the data and make connections between those data and various other types of biological information, before they can lead to biologically or clinically important insights. All of this makes the goal much more challenging than just sequencing DNA and storing that information."
The need for complex analysis is similar to astronomy, but with an important difference, the authors say. Astronomy generates vast amounts of data but incorporates several processing technologies at the time of data collection, requiring less time and computational power later on. The researchers suggest that integrating similar processing methods could cut down on the storage needs for genomic data as well. But there's a catch: The whole genome may offer insights not yet anticipated, as new understandings may emerge as more people are sequenced.
"In the future, we may have to take the hard decision of storing only the processed form and not the original, and that, too, in heavily compressed forms, to drastically reduce the storage needs," Sinha says.
The authors urge new technology development to handle the expected explosive growth in genomics data beyond what is predicted for social media and astronomy.
"Genomics will soon pose some of the most severe computational challenges that we have ever experienced," Robinson said. "If genomics is to realize the promise of having a transformative positive impact on medicine, agriculture, energy production and our understanding of life itself, there must be dramatic innovations in computing. Now is the time to start."
Published in PLOS Biology.