This content is part of the Essential Guide: Choosing storage for streaming large files in big data sets

Storage considerations for a big data infrastructure

Jon Toigo discusses what big data is, the popularity of object storage and how to determine the best storage for a big data infrastructure.

According to Jon Toigo, CEO and managing principal of Toigo Partners International and chairman of the Data Management Institute, big data analytics once referred to the process of mining large amounts of data to find specific pieces of information, but the term big data is now commonly used in a broader sense to describe large volumes of growing data.

Toigo believes object storage is one of the best ways to achieve a successful big data infrastructure because of the level of granularity it allows when managing storage. He even sees it as the "future of storage." But when determining how to store big data, he said, administrators must first consider what the big data is being used for; for example, capacity demand might be more important to one big data infrastructure than privacy is to another.

In this podcast with associate site editor Sarah Wilson, Toigo shares his thoughts on what big data is, the best ways to store it and some of the problems storage administrators of big data infrastructures might come across. Listen to the podcast or read the transcript below.

What are some of the storage challenges IT pros face in a big data infrastructure?

Jon Toigo: Well, first of all, I think we have to figure out what we mean by big data. The first usage I heard of the term -- and this was probably four or five years ago -- referred to the combination of multiple databases and, in some cases, putting unstructured data into some kind of framework to mirror real-time analysis. The point was we were going to gather all this data together. We were going to associate it with each other, and we were going to allow the data, in many cases, to inform us of changes and give us information that was useful.

The classic case was finding a terror suspect based on what you know about the potential terrorist maybe [using] the documentation he has from various countries, the airline booking databases that are out there for major airlines and the routes he might take, vehicle renting information if he's going to rent a vehicle and fill it with explosives -- all these things could be correlated. It was finding that proverbial needle in a haystack, and that was what big data analysis was all about. Another use was spotting potential voter fraud by looking at death records and drivers' licenses, white pages directory listings and voter registration databases, and combining all that stuff to look for folks who registered to vote who may already be deceased.

The bottom line is that these were the classic examples that we were going to use big data analytics for. And it expresses both what big data is and what big data analytics are. Big data is simply the collection of stuff to be analyzed. Today, I'm hearing a lot of vendors use the term big data to refer to just about anything. It's a reference to the reality that we all confront -- we've got a lot of data and it's growing, mostly in the form of files, and we have problems curating that data, storing it, and using it efficiently and cost-effectively. So like most terms these days, big data seems to have been co-opted by marketers, and now it means whatever the marketing department says it means for the marketing department that's articulating the value.

So problem No. 1 is to define what you mean by big data. And then we get into a whole other set of problems when figuring out how to store a growing volume of data for a really long time versus finding a way to stand up many data sources that can be used collectively to achieve some analytical purpose.

So if you understand what I'm saying, the challenge here is diffused by the fact that nobody has a good definition for what big data is.

In the case of really large volumes of data, what makes object storage so popular?

Toigo: First of all, I think object storage is the future of storage, and I don't know anybody who would disagree with that. There are a lot of vendors who are kind of ahead of the curve, and they're evangelizing it. Object storage is like the next evolution of how we're going to store data. It's the only way to wrangle files, which now constitute over half of the data we're storing. And they're mostly controlled by users so we don't have a lot of information about what's going on inside that file. It's sort of an anonymous piece of data. If we want to create some sort of organized methodology for storing them for a period of time -- maybe migrate them through a tiering process and pay attention to what their relevance is from a business standpoint -- we're going to need a more granular way to do data management, and that's what object storage has offered to do for a very long time.

Theoretically, certain types of big data analysis processes could be facilitated by object storage. For those who focus on metadata -- counting operations, for example -- object storage removes some of the problems associated with unstructured data, depending on how you implement the object-oriented system itself. It could allow a mix and match of certain kinds of files, plus reorganization for comparative purposes and so forth. A company called Caringo is one I've been following for about five years now, and it has a great story to tell in this space; [it's] been evangelizing object storage with zeal for quite some time.

The one caveat is that there are a lot of different protocols for doing object storage that started to appear in the market, partly because of the popularity and trendiness of things like big data. And like so many technologies, the industry seems absolutely hell-bent on creating exclusive stovepipe object storage methodologies that are going to create issues downstream for mixing and matching data organized using different object storage paradigms. I think that's going to hurt, for example, the use of clouds as object storage repositories, because you're going to have multiple clouds, and each cloud might be organized around a different object storage paradigm and it won't be able to share its data with another cloud. That could be a big impediment going forward.

Object storage basically is a way to put some additional value -- think of it as an additional metadata construct -- over the top of the file and give it a unique identifier so it doesn't get stepped on, and [you can] include it in some sort of a database structure so you're able to move it around, use it and reference it. If you think about the way the NSA [National Security Agency] is supposedly using telephone record data and all these conflicts that have come out, they're not interested in the contents of the telephone conversation specifically; they're interested in the network of contacts [or] the relationships between data. That's something object storage would be ideally suited for: a mechanism to connect the dots and show metadata relationships between data as opposed to the contents of the data itself. So in terms of the number of people who visit a website, it's not as important for me to know which people visited that website as [it is to know] the number of people who came there after a certain article was posted. These are places where simple counting operations may substitute for a detailed analysis of what's going on inside the file. So I see object storage definitely as the future. Will we get there in my lifetime? I don't know; it might just be like how long we've been waiting on holographic storage.

In addition to object storage, what other types of storage are best for a big data infrastructure?

Toigo: Now, again, this comes down to what you want to do with your big data. There are some practical issues obviously to consider. For example, you may need to have a method for reducing the amount of space that's occupied by big data to constrain storage capacity demand and the associated expense, and you'll need a way to do that that doesn't impact your object storage method. So if you use object storage, your ability to derive information expeditiously might be confused if you take these object-oriented data entries and compress them or deduplicate them, or something of that nature. That might spoil that data and make it unusable from an analytical perspective. You need to be very circumspect with what you're going to do with the data, how you're going to store it and, if you're going to reduce it, what the impact of that reduction will be.

Another example is the problem of privacy, which enters into a lot of big data analytics processes. There may be a desire to share data sets. For example, treatment modalities inside health care -- how many people have had [a certain] kind of cancer and what was the effectiveness of a particular drug treatment on that cancer -- but there are constraints in sharing the data that belongs to those health records. HIPAA [Health Insurance Portability and Accountability Act] prevents you from being able to disclose private patient health care. You can't have the patient's name, social security number or patient identification associated with the data itself. That's also a problem with the NSA. That's one of the problems behind the big NSA surveillance program, but it's a big deal for health care's big data efforts in the future.

So how do you minimize the alteration to data to redact sensitive stuff, but don't dilute or harm the data in such a way that it would reduce its value from the standpoint of analytics? Contemporary encryption, which is a technology some people would like to use to protect privacy, is probably not going to work or play well with big data analytics as we understand it today. I chatted with IBM's Jeff Jonas, a chief scientist who deals with their big data stuff, about this at IBM's Edge conference about a month ago, and he expressed a need for what he would call a one-way hash. And for those people who aren't familiar with what a one-way hash is, it's a mechanism that preserves data usefulness and integrity but excludes the sensitive details from the data. He described it this way: You can give somebody some pork and a grinder and they can use it to make sausage, but if you give them sausage and a grinder, they can't reverse-engineer it to make a pig. That would be an ideal way that one-way hashing could be used to retain the value of data without revealing any of the details that are better not shared. So I think about that and say there's still some room for technology improvement here to determine how we're going to share the data and how we're going to store it.

And finally, some continuity planning is sort of my shtick. I find myself asking, if this big data complex -- the infrastructure we're storing this big data on -- is so important and so mission-critical, how are we going to protect it from loss or corruption when all the devices it's stored on are subject to a broad range of problems, whether they be man-made threats or natural threats? I wouldn't deploy a big data complex without figuring out first how to provide its continuity, availability and integrity. So in other words, I'm not going to give you a prescription for the best options on storing data as much as I'm trying to express that there are a number of considerations that you better have in your head before you just start to throw data out there on storage.

Dig Deeper on Big data storage