Analyst Randy Kerns discusses the state of file-based storage

Big data analytics and cloud storage are just two technologies that are fueling file-based storage requirements. In this 2012 Chicago Storage Decisions TechTalk interview, Evaluator Group Inc. senior analyst Randy Kerns discusses the effects of these new data center trends on file-based storage, as well as machine-generated data and its effect on storage. Why are files growing so much faster than block data?

Kerns: Two reasons: First, there's an awful lot of people running particular applications of their own or [keeping] personal data, and they're storing it either as email attachments, but more usually in their home directories. They keep everything in massive amounts.

Another reason is that there are a lot more applications available now than there used to be [using] strictly file-based, as opposed to database, accesses. So, you see a plethora of applications that are all using files. Are traditional network attached storage filers [NAS filers] still the most effective way to deal with growing file data?

Kerns: Up to a point, [and then] you're scaling the support and operations. If you start at a particular point and say, "I have this many files," and then you say, "I need more NAS storage," a lot of times you'll buy another NAS system. A lot of that has to do with the amortization schedules of the product because, the way most of them are scaled, you have to add more capacity into a particular platform.

A lot of people will add another NAS system. What you end up with then is some isolation. I don't have the global namespace across those platforms, so I have to administer two different entities, and then three and four and five. It just keeps growing, so the operational expense goes up and the headache of where data is bouncing becomes a problem.

Adding more NAS is absolutely the correct answer. But it's how you do it. Do you do it as individual platforms? It doesn't take long before that becomes too expensive. Then you look and say, "Well, maybe I need a scale-out NAS or clustered NAS system, or I need another NAS file virtualization offering to federate multiple NAS systems together." When we talk about big data are we talking mostly about file-based storage?

Kerns: Correct. A lot of big data, it's really big data analytics -- let's keep that in mind because you can get confused. But it's the information that's machine-to-machine transfer data. That comes from an area of the industry called pervasive computing, where there are lots of things producing information. The information coming in for analytics is file based. So, it's very different, and it's file-based data because of the way it's produced. Then you work with different systems that can do analytics on the data in real time.

The issue is what do you do with that data? You have both the original information coming in, and then you have the intermediate and final results. Obviously, the intermediate and final results you want to keep. That may or may not be file-based data. It may be database updates or things like that. But the original data is file-based data, and it's massive.

You have to decide. There's one camp that says, "I've already looked at it, so let's just get rid of it." Most people, especially those in IT, say, "Wait a minute, there may be more value in that data. Let's keep it. We may want to do more data mining against that data." So, now I have a massive amount of file data that I've already maybe run through my analytics processing, but I want to keep that. We're starting to hear a lot about 'machine-generated data.' How does that impact storage environments?

Kerns: It's pretty simple. A machine can generate data faster than a human can. There have been some studies that say that this amount of data in a data analytics environment -- where you're introducing it into traditional IT -- instantly has four times the amount of storage required and a faster growth rate than what IT had without the big data analytics or machine-to-machine data coming in. It's like a step function in capacity, a very large step, and then an increase in the scale. So, it's a big deal for IT. How important is clustered or scale-out NAS for handling large sets of file-based data?

Kerns: Well, it's like we were talking about earlier. You have the situation where I've got to continue to grow with capacity. If I don't have the ability to manage it as a single entity, to maybe have a global namespace across it, the management of that becomes overwhelming and my operational expense gets very high. So, clustered NAS, or however it's implemented, scale-out NAS is really critical.

The other part of it is the financial aspect. I want to continue to scale, but as I buy a new storage system, it has a finite lifespan. So, maybe I buy a new system and it has a four- or five-year life span. It's on an amortization schedule of four or five years. As I scale and want more, I want to bring another platform in. When I bring it in, I've got a new starting point for its life span each time, and I have a new amortization schedule.

The ability to scale performance and capacity with a new starting point for the evaluation of that becomes very critical. Just throwing more disk drives into the current platform causes me a whole lot of other problems. Not including the performance issues, imbalance can be created but the economics of managing life span becomes a real problem. That's why it's very important to have a scale-out or clustered NAS offering where I can scale independently and then retire assets as they reach their financial end-of-life independently. Do you anticipate more businesses using online file-sharing services?

Kerns: I think there will be more users. The question, more importantly, is what information are you going to have there? There are a lot of companies that are very security-conscious, and a lot of them that have bandwidth challenges to be able to do that. There are a lot of different reasons it won't be as universal as a lot of people would like to think. But there are great instances for global distribution, global sharing of data and for putting fixed content that has maybe a low demand cycle that can make a lot of sense. The problem, again, is the price of bandwidth and the availability of bandwidth. Is object storage a good way to handle unstructured data?

Kerns: It is, but there have to be some caveats here. A lot of people have a preconception of what object storage means, and a lot of it is equated with the earlier success EMC saw with Centera and their view of object [storage]. What people are talking about today is a little different. They're really talking about file-based storage, and then have additional metadata added to it.

That additional metadata is where the real value is. I can put things in there with the metadata about the retention period, the type of data protection required, authorizations for deletion and compliance requirements. I can add that metadata associated with that file, and that's what object storage is today.

So, it's still file-based data, but it's getting to its object storage location differently, and then the key is that additional metadata. How does that get there? You'd like that to come from the application and then obviously by proxy from the user. Applications change slowly, so you may have to put an intermediary system that looks like a NAS device that can, by a template or some other rules, add that metadata and then put it onto an object storage device.

View All Videos

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.