Do you know when you need to keep certain data, or when you're clear to send it on its merry way to be deleted? Storing everything for an indeterminate amount of time is hardly practical, but knowing what's safe to get rid of and what you should keep is an issue many organizations struggle with when dealing with the growth of unstructured data.
The problems associated with the growth of unstructured data aren't new to many people, and some may even convince themselves that it's not an issue. But that's likely not the case, according to Randy Kerns, senior strategist and analyst at Evaluator Group.
"Almost everybody has some percentage of growth," Kerns explained. "What we typically see is somewhere between 20% and 35% growth rate, organically. Then we also see a significant number of our clients that have [the] introduction of some big data projects that came from outside. IT wasn't part of it -- that just got dumped on them -- and so they end up with some huge spike they have to deal with."
In his Storage Decisions session, "Scale-up, scale-out: A survival guide for coping with capacity growth," Kerns discussed the growth of unstructured data, and how those who have to deal with it are probably not completely prepared.
So, how do you deal with unstructured data? Defensible deletion can be effective, but it's not always an option. Many in IT are reluctant to delete data because they don't know who it belongs to, or if it will be needed down the line.
"They keep it, and, primarily because they're not sure, or who owns the data hasn't authorized it, deleting data isn't really an option for most," Kerns explained. "And so, we get all this data in, and we just keep it, and it costs us."
If you can't delete data, and it keeps growing consistently, the situation might seem helpless. However, Kerns outlined a few best practices. First, deal with it early. As capacity grows, problems will develop, so face issues from the start.
Having a strategy in place will help you deal with concerns that may arise from the growth of unstructured data, and protect you from possible surprises. A scale-up or scale-out approach as more capacity is needed, and taking advantage of practices like external tiering, can help with capacity issues and performance.
Transcript - How to survive the growth of unstructured data and capacity limits
Editor's note: The following is a transcript of a video clip from Randy Kerns' presentation, "Scale-up, scale-out: A survival guide for coping with capacity growth," at Storage Decisions 2016 in Chicago. The transcript has been edited for clarity.
How many of you get tired of seeing all the vendors put up these charts about how much data [there is] and how fast it's growing? I get so sick of it. I'm not going to do it again. We've seen them. A lot of people tell me that those numbers don't apply to their environment. Well, a lot of them do, and you've got to think about it.
Almost everybody has some percentage of growth. What we typically see is somewhere between 20% and 35% growth rate, organically. Then, we also see a significant number of our clients that have [the] introduction of some big data projects that came from outside. IT wasn't part of it -- that just got dumped on them -- and so they end up with some huge spike they have to deal with, and that happens at some irregular intervals.
But the issue is that they do have growth of some type. The organic growth being between 20% and 35% is very normal, so that's really what you've got to plan for.
Almost all the growth is in unstructured data, and unstructured data means data that's in files or object format. Data that's in databases actually grows very slowly, and there's actually a couple of very good reasons for it. Most of the data that's structured data, that's in databases, are in SQL databases. They get very slow as they get very large, and so people control that much better. They don't dump a lot of data in there that shouldn't be. Meanwhile, we have all these other things generating files besides users, but machine data, log files, etc., are very, very common.
So, when we get this growth, it creates a lot of problems. We have to spend money to acquire storage, and we have a certain amount that has to be primary. Maybe a lot of it doesn't really belong on primary [storage]. Certainly, a lot of the file-based data we get should probably never end up on primary. A lot of times, we put it there, and that's just maybe the practice we have, but its actual location, maybe, should be different from the start.
We may not be as discerning as we should be of what that information is, where its value is and where it should land to start with. Maybe we start with primary, then, over a period of time, we'll have some automated policies that recognize that this data hasn't been accessed in X amount of time, so we move it to another tier.
And one of the guys at our company, Evaluator, wrote this article about defensible deletion, and that turned out to be one of the more popular things for people to read because most people in IT won't delete data. They keep it, and, primarily because they're not sure, or who owns the data hasn't authorized it, deleting data isn't really an option for most. And so, we get all this data in, and we just keep it, and it costs us.
Now, here's probably the biggest problem. Let me give you an example. One of the hospitals I was working with ... Their PACS [picture archiving and communication system] system gets all these MRIs, and each generation of PACS system of modality gets more and more dense. They take these images, and they take more and more as time goes on. They basically store this data, and they charge for when they do that MRI.
This data's stored and, in some cases, by law or by policy, they'll keep that data for 20 years, 30 years or whatever. But the charge is at the time the MRI was taken. The cost for storing that is ongoing. How do they account for the extra cost of taking larger and more images, if you will, for 30 years? Think about that.
Most companies, organizations and, certainly, hospitals, haven't come to grips with the lifespan of data, and how do you allocate funds for keeping that information around?