Isilon was one of two vendors EMC acquired last year under its big data banner. Analytics vendor Greenplum was the other. Isilon customers commonly store close to a petabyte or more of file data, and several said their data capacity grows by terabytes daily.
“I’ve been with the company 10 years now. We’ve never not been dealing with big data,” said Paul English, director of IT at 3Tier, which provides weather forecasts and historical patterns that help renewable energy companies around the world determine the best sites for wind farms.
“From Day 1 we’ve had big data sets,” English said. “We’re generating two terabytes a day. We use analytics to pluck out what's useful, and have the luxury of being able to throw a lot of data away.” Still, English said in a few decades, 3Tier could have millions of petabytes of climate data in its repository.
Jim Lowey, director of network and computer systems at Translational Genomics Research Institute (TGen), said he has approximately 500 TB of Isilon storage, and it grows about 10 TB a week. “I’ve been with TGen for nine years and it’s been fascinating to watch how quickly the amount of data has grown,” he said.
TGen creates genomic sequences and other biological data to help develop treatments for diseases. Lowey said initial data that comes off a sequencer is 2 TB in size and grows over time.
“And we can’t throw data away,” he said. “When I built the first supercomputer at TGen, we had a 1 TB file system. Now, we say ‘a terabyte here, a terabyte there, it’s not a big deal.’ Unfortunately, storage technology hasn't advanced at as quick a pace as genomic sequencing. It’s led to all kinds of challenges.”
The challenges go beyond merely storing the data. Like 3Tier, TGen runs analytics to get value from its scientific data. LiveOffice CEO Nick Mehta said he has approximately 4 PB of Isilon storage to support his firm’s cloud-based email hosting firm’s service. LiveOffice offers unlimited storage at a flat rate, but the value of its service is to provide search and discovery capabilities. LiveOffice also must encrypt customer data.
“We have to keep all our customers’ data as if it was on their own storage,” he said. “Storage is still important, but there’s been a psychological shift to ‘How do I do what I thought was impossible before?’”
TGen's Lowey said other challenges with big data include providing metering and chargeback for multiple internal customers, and making the research data available across multiple sites.
“The challenge is how to do chargebacks and measure the true amount of consumption,” he said. “The biggest thing is to have a data management tool that sits on top of all this storage to do that, [and] we don’t have that now.”
The challenge of moving data for TGen comes from having its sequencers in Phoenix, and its supercomputer is about 12 miles away in Tempe. “I had a 1 gig [Ethernet] link and that didn’t last,” he said. ‘Now we have a 10 gig link and we’re playing with InfiniBand over that link.”
Role of storage clouds in big data
Storage clouds are another option for managing big data.
“We’re getting a push from senior management to go to the cloud,” Lowey said. “I’m exploring a hybrid cloud model. That gives me the elasticity to expand my computational needs on a whim.”
I’m exploring a hybrid cloud model. That gives me the elasticity to expand my computational needs on a whim.
Jim Lowey, director of network and computer systems, TGen
Lowey said he’s tested Penguin Computing’s public cloud for sequencing, and is considering object-based systems such as EMC Atmos, DataDirect Networks’ Web Object Scaler (WOS) and Dell’s DX for a private cloud. “I think everything will move to object based,” he said. “Block-based storage is dead, it’s like caveman technology at this point.”
Lowey said there's still a major hurdle to object-based storage, though. “None of them have a true open API to move data back out,” he said. “They’re going to hit a wall because people like me are telling them ‘We’re not going to make an investment until we know we can take it out easily.'”
3Tier's English said a public cloud may be in the forecast for his firm, for data that customers need in a rush. For instance, when the government offers an occasional credit for wind power and 3Tier customers need data quickly to map potential wind farms.
“We’re dancing with it at the moment,” he said of the cloud. “The research we’ve done so far shows the cloud is more expensive for storage and more expensive per CPU cycles. Our goal is basically to stay in-house for the baseline mode that we always have, and then take the spikes and put those in the cloud as much as possible where we can potentially deliver results to customers much faster than they would expect.
“When the government puts out a big energy credit, everybody wants to build, and they all want information yesterday," English noted. We say, ‘We don’t have any more computers than we did yesterday.’ But if everybody is asking for the same thing, we can deliver it faster to them because we can ship that out to the cloud.’”