Choosing storage for streaming large files in big data sets
A comprehensive collection of articles, videos and more, hand-picked by our editors
With more enterprises needing to store growing volumes of data, storage vendors and administrators alike need to determine the best practices for storage in a big data environment. With this in mind, we're hearing more and more about things such as Hadoop and the cloud. But according to Jon Toigo, chief principal partner at Toigo Partners International, those methods aren't necessarily best for a big data environment.
In this podcast, Toigo discussed the evolution of the storage market with regard to big data with associate site editor Sarah Wilson. Listen to the audio or read the transcript below to find out his thoughts on how storage, the cloud, and backup and disaster recovery might adapt for a big data environment.
How do you think the storage market is evolving to better accommodate big data?
Jon Toigo: Well, I think you're seeing two different trends here. One is that Hadoop is sort of joined at the hip with the concept of big data. We're talking about Hadoop-style clustering. The industry is basically dissing shared storage at this point -- SAN [storage area network] and NAS [network-attached storage] -- and preferring direct-attached storage [DAS], particularly DAS that uses flash. IBM has gone up on stage -- the director of their storage group -- and they've specifically said, 'We see flash as the future of everything.' So they're trying to push flash-based storage, direct-attached to clustered units as the modality for storing data that will be used for big data analytics. I don't know [if] that's the best approach to do this stuff, and I think we're going to find out and we're going to spend a ton of money on it. It sets the clock back to pre-1999 in terms of storage architecture and it re-introduces issues that we had two decades ago and currently have forgotten about regarding how you protect your data in a world of isolated islands of storage. You're going to have to have a whole lot of replication going on from one node to another in order to find out how much bandwidth that's going to utilize to provide protection to isolated storage components. It does, however, sell more gear, which is something the industry likes to see given the general trends toward slowing storage sales.
The other side of the house is trying to look at this from a holistic perspective. They're saying, 'We spent the last ten years deploying shared storage, deploying Fibre Channel fabrics. We had the potential to grow that technology going forward in significant directions, whether it was using InfiniBand, or SAS, or whatever the next generation of that is going to be. But why would we want to aggregate all our stuff for ten years and then segregate it again?' It doesn't make any sense. So you look at companies like DataCore Software, or to some extent IBM with SAN Volume Controller, and a few others who are trying to virtualize that storage infrastructure so they can present virtual volumes up to the servers that pretend or behave as though they're direct-attached storage for those servers. That makes a heck of a lot more sense and also gives you the ability to manage holistically all the infrastructure that's associated with storage. And I think a lot more work needs to be done in that space rather than segregating all the storage and doing a bunch of direct attachment on a bunch of servers. And I think we're eventually going to get to that latter model.
Cloud seems to be a popular option for storage in a big data environment, too. Do you think cloud providers are adapting their services to work better with large volumes of data?
Toigo: You know, I have a mixed mind as far as clouds go. I'm not a big advocate of cloud technology generally -- of the public variety. However, I did think that maybe one of the better models for cloud going forward -- a sustainable business model for cloud, would be cloud that is specialized in holding huge repositories of certain kinds of data. I asked experts about this. Jeff Jonas at IBM, I asked him, would it make sense for a cloud service provider to stand up big data so I don't have to buy the infrastructure myself? And I thought that might make sense for a company that doesn't want to spend big bucks on the infrastructure to support Hadoop for a business analysis project that's only going to be used once, or very infrequently, like voter registration analysis. Why would you want to stand up a multi-million dollar infrastructure to analyze one aspect of data and then go home and basically turn it off? It doesn't make any sense to me.
[Jonas] didn't think much about that idea though, and I was kind of scratching my head about that, but he explained his view. He said the amount of time it takes to position data in a cloud, and then the bandwidth that you have to pay for to get access to the data in the cloud, and then the initial security issues, resiliency issues associated with data and the cloud, and many other aspects of cloud operations are not necessarily the best places to host data for big data analytics.
Now, I think that, assuming some of these problems can be worked out, and that's a big assumption, you might find a cloud provider that says, 'We handle all the data for the national institute that has all the treatment data for oncology for cancer treatment. We've properly taken out all references to the patients themselves, and the original data is all here.' Now if Johns Hopkins wants to run a big data analysis on a new drug trial they're doing, they should be able to, as a service, plug into that data set and include it in their analytical model. And that would make sense, because then you've got multiple customers in need of that kind of data.
Would I put my own data up in a cloud? Probably not. I don't today, and I wouldn't going forward because a cloud service provider is hampered by the fact that he doesn't own the network that provides the connectivity to my shop. So how can he with a straight face promise me a service level? He doesn't control the mechanism by which I access the server. If my phone system goes up and down several times a month, it doesn't matter if I have a super stable cloud service or not, I'm not going to be able to access it. So I don't understand why I believe anything a cloud service provider tells me. I have a hard time believing my information is secure if it's in the cloud. Now that may not be a big issue if I adopt some sort of one-way hash like [IBM's Jonas has suggested] and I depersonalized all the data that's up there and I don't have anything to worry about. But for my business processes that are mission-critical and for my business transactions, my financial information, credit card information, whatever, I'm sure as heck not going to put it up there. Bottom line: I have issues with cloud and I'm not sure it's everything it's made out to be. Also, [on] the numbers on cloud -- I read an article recently that said there was a 340% increase in cloud adoption, but it was 19 people they had in the survey.
How do backup and disaster recovery methods change in a big data environment?
Toigo: If you follow the Hadoop model, which is basically to break up your shared storage and deploy it on individual nodes and direct-attached storage modalities, you run into a huge, huge problem with how you're going to replicate and safeguard that data. That's a major issue. We're already encountering that in shops that have adopted VMware, because VMware doesn't perform well at all with traditional shared storage modalities. What VMware is asking you to do is break up your SAN and deploy your storage in direct-attached configurations right next to each VMware server in the cluster. That creates an issue where you have to rely on back-end replication and mirroring from one box to another, and the problem with mirrors is nobody ever checks them. It's a pain to shut a mirror down, quiesce the application, flush the data out of the cache onto the disk, copy that disk over to the secondary mirror, and then shut the whole operation down and do a file-by-file compare, and then cross your fingers and restart everything and pray to God that you don't have a career-limiting event where things won't synchronize again. For that reason, nobody ever checks mirrors. And that's a key Achilles' heel of the way big data is architected over Hadoop.
Now, I will say that using big data analytics to model and monitor expansive storage infrastructure where I've got a whole lot of information in tidbits coming in from things like smart technology on disk and various element managers on the various sets of storage components that I've got out there -- being able to correlate all that and spot problems in real time or be proactive, I'd actually be able to avoid a lot of disasters. So on one hand, I like big data for what it can offer from a better disaster recovery standpoint, giving me better information and management of the infrastructure so I can avoid those kinds of risks. On the other hand, I don't like what Hadoop infrastructure means in terms of complicating my life from a standpoint of data protection.