Scanrail - Fotolia

Big Data Summit panelists: Petabyte data stores fuel data science

Data scientists who study topics such as disease and climate change require robust storage systems to hold petabytes of crucial research.

WASHINGTON, D.C. -- If you're still coming to grips with big data and how to best use it, you're not alone. Even people whose jobs depend on analyzing data for national security and healthcare are still searching for definitions.

"I never understood the term big data," said Tony Scott, CIO for the U.S. federal government, last week at the Federal Big Data Summit. "Is it fat data, is it skinny data, or is it just a lot of it? But it is a good term to describe good work going on in data science or data analytics."

One panelist at the Big Data Summit, Hoot Thompson, advanced technology lead for NASA's center for Climate Change Simulations (NCCS), presented a perfect example of what big data looks like.

Thompson said his group has 30 petabytes of rotating storage and 40 PB of tape storage connected to its supercomputer. Thompson said NCCS bought 20 PB of storage in 2014 alone, although he pointed out that was higher than most years.

The center runs models to predict climate patterns over 50 years. One model alone can create between 3 PB and 4 PB.

"We're not consumers of the data," Thompson said. "We're trying to put the data on the table so other organizations can make decisions. All of our data is sharable."

Scientists can watch simulations in the NCSS Data Exploration Theatre, which includes 15 monitors on a nine-foot-high wall at the Goddard Space Flight Center in Greenbelt, Md.

The NCSS team has 20 PB on a DataDirect Networks SFA12K Fibre Channel SAN with 6 TB drives and 10 PB on six racks of NetApp E-Series systems. For performance, it includes 48 TB of solid-state drives in a Cisco Invicta SAN to store metadata running on IBM General Parallel File System. Data is backed up and archived on Oracle StorageTek 10000D tape drives.

Thompson said he is an open source fan. He has been running a version of Gluster for two years and is planning an OpenStack-based storage cloud on 14 bricks each consisting of a 2U server and 4U JBOD connected with InfiniBand.

The 3,000-node SGI supercomputer is a mix of SGI and IBM nodes, combining for more than 3 petaflops of processing and 138 TB of memory. A Hadoop cluster is used for analytics.

"How do we bust this data up and make it useful?" Thompson said. "We built our own Hadoop cluster."

Analyzing the data is the real key to big data, and not just to study climate change. Another panelist at the Big Data Summit, National Institute of Aging health scientist administrator Suzanna Petanceska, outlined her group's project of studying data from thousands of brains to try and prevent Alzheimer's Disease. That requires not only storing the data, but finding important information inside of it.

Federal CIO Scott said there are three kinds of people required to efficiently sift through and analyze data. "You need people who are good at framing an issue," he said. "They say, 'What if we could. … If you just had the data to understand this, here's a problem we could solve.'

"The second group of people is good at the application and manipulation of the data. They're good at figuring out how to do it. The third group is made up of people who can interpret the data. That's why I like a diverse team of people with different skill sets."

Next Steps

Big data users face speed, capacity challenges

Big data architectures should address capacity, throughput

Effective big data analytics tools help businesses thrive

Dig Deeper on Big data storage