Depending on the analyst you consult, data growth is driving data storage capacity demand within businesses at a rate of 40% to 650% annually. If that strikes you as an extraordinarily wide range for an analyst estimate, it is. And there are two explanations.
First, no one really knows how fast data is growing. Second, capacity demand trends have little to do with actual data growth trends. They are based instead on estimates of how much capacity consumers buy year over year, not on how fast data is growing.
That means planners who want to work out a capacity management strategy are starting with little more than a mandate from management to bend the storage cost curve -- recognition of the fact that storage now accounts for between 33 cents and 70 cents of every dollar spent on IT hardware. The heavy lifting of identifying real capacity requirements, growth drivers, and procedural and technological approaches for reducing capacity demand is entirely on them.
In 2011, Framingham, Mass.-based IDC projected there were 21.2 exabytes of external storage deployed worldwide. This was used to store not only production data (roughly 55% of which are files, according to the analysts), but also data duplicates and dreck. According to the analyst, we used about half of our disk to store copies of the data written on the other half. And our reluctance to throw away anything has made our storage infrastructure into something approximating the kitchen junk drawer.
Disk isn't the only storage modality. The industry has defined at least two kinds of disk -- low-capacity, high-speed Tier 1 and lower-cost, high-capacity Tier 2 -- and acknowledges an entirely separate tape tier (Tier 3) used primarily to store backups and archival data.
Recently, with the introduction of flash memory-based storage devices, so-called silicon storage devices, a "new" Tier 0 has been introduced into the storage hierarchy. Technically, silicon storage has always been a part of storage tiering architecture. IBM's hierarchical storage management (HSM) paradigm -- in existence since the earliest days of mainframe computing -- typically included system memory, direct access storage devices (DASDs), which are essentially disk arrays and tape.
The purpose of multiple storage tiers, and the software functionality inherent in HSM to move data between tiers, was simply to manage storage capacity and cost. The scheme was predicated on data access frequency and data modification frequency characteristics. Data that was accessed and updated with high frequency used silicon storage. However, this storage was extremely costly and limited, so data was migrated as quickly as possible to DASDs, from Tier 0 to Tier 1, where access and update could be accomodated at fairly high rates. In a classic HSM strategy -- articulated when DASDs were the size of refrigerators, provided limited capacity, and required their own buildings (DASD farms) to handle power and HVAC requirements -- pressure was on to migrate data as quickly as possible from disk to tape, which was the storage capacity tier (then Tier 2) optimized for storing data that was much less frequently accommodated at fairly high rates of access or modification.
Without belaboring the point, tiered architecture and HSM provide a straightforward methodology for capacity management but one, unfortunately, that did not transition into the distributed computing environments deployed in many firms. Part of the reason is historical and technical: Early distributed computing environments relied on low-speed LANs to interconnect minicomputers (servers) and microcomputers (PCs) that could not handle the burden of HSM data movements. Moreover, the industry sought to expand disk products to provide specialized capacity storage that would compete with tape. High-capacity, low-cost SATA disk arrays, some featuring "data reduction" value-added software (so-called deduplicating virtual tape library [VTL] appliances) were among the first, followed by tiered storage arrays that provided trays of both Tier 1 and Tier 2 disks, as well as HSM software to automatically move data from one tier to the other; and finally, massive arrays of idle disks were tested in the market as a new capacity storage tier.
But the cost of specialty disk appliances, especially with the price acceleration generated by value-added software embedded on the array controller, has limited adoption. Where products such as deduplicating VTLs have been adopted, they've mostly been relegated to a niche role -- augmenting rather than replacing tape, which continues to store roughly 80% of the world's data.
What's needed to manage data storage capacity isn't an appliance that crams more data onto the same number of spindles, but a strategy that leverages the right storage tier to store the right data. Instead of focusing narrowly on capacity allocation efficiency -- which is the point of data reduction technologies such as compression and deduplication -- planners need to consider capacity utilization efficiency. That's a fancy way of saying that an effective capacity management strategy includes not only tactical space management (deduplication and compression), but strategic data management (archiving, for example).
The process begins by analyzing your situation. Using a storage management reporting tool such as SolarWinds' Storage Manager (formerly Tek Tools Storage Profiler), you can run a report that identifies files that haven't been accessed or modified in the last 30, 60 or 90 days. Sorting these files by their owners (also in the file metadata) will provide a way to begin a dialog with the user (or his or her manager) who owns the files so that those files can be moved into an archive or deleted.
As much as 40% of the data stored to disk currently could be more cost-effectively hosted in an archive platform, whether disk-based, tape-based or in a cloud service. The savings from archiving data and returning 40% of your capacity back to productive use may provide enough savings to pay for your entire data storage capacity management strategy going forward.
1. Making the most of disk storage: How not to waste space
According to Jon Toigo, CEO and managing principal of Toigo Partners International, and chairman of the Data Management Institute, one of the biggest problems with data storage capacity is wasted space. This is largely due to enterprises storing stale data, such as duplicates, data with low re-reference rates or orphan data. Furthermore, many enterprises don't have a method in place to determine which data can be deleted or moved to an archive.
2. Refuting common capacity containment methods
The idea behind thin provisioning is that storage admins know well in advance when they'll have to add more capacity to an environment, and can avoid purchasing excess capacity by waiting until it's actually needed. The problem, Toigo says, is that thin provisioning does nothing to reduce capacity directly; rather, it alleviates the cost of additional disk arrays.
Dedupe and compression
3. Some capacity fixes only work short-term
Deduplication and compression are seen as surefire ways to reduce data storage capacity, and there's no denying they can make a difference -- to an extent. Deduplication can eliminate copies of data that aren't needed, but arrays with a deduplication process built in are often more expensive. Compression allows the same amount of data to be stored on a smaller amount of capacity, but it's not clear whether the capacity savings are worth the price tag.
4. How the right information lifecycle strategy can save capacity
Data lifecycle management (DLM), also referred to as information lifecycle management (ILM), isn't a new concept, but it's often overlooked when it comes to keeping data storage capacity requirements under control. Creating policies that automate the movement of data is the basis for DLM. For example, all data created by a certain department within an organization can be tagged as such in the metadata, and from there can be directed to specific storage. This is a big benefit to storage professionals when it comes to determining which data is stored where, and controlling the amount of capacity on a given array.