Tip 2 in Jon William Toigo's tip series on maximizing efficiency in data storage environments covers capacity utilization efficiency, including data archiving technologies.
The objective of capacity allocation efficiency, as noted in the previous tip, is to realize a “just-in-time” inventory management process, leveraging the decreasing cost-per-GB and increasing capacity trends in disk drives and ensuring the customers your IT shop serves never run out of storage space.
Capacity utilization efficiency means something very different: It's a measure of how well your disk -- or more precisely the data being stored on your disk -- is being “groomed” so that useless data is removed from storage, while other data is hosted on the most appropriate type of media from the standpoint of access characteristics, business value and storage cost.
Disk grooming is a somewhat arcane term that dates back to the first business mainframes. If you're as long in the tooth as I am, you may recall that mainframe storage consisted of three distinct storage tiers: a limited amount of system memory, refrigerator-sized direct-access storage devices (DASD) and magnetic tape. “Grooming” data was a very important matter given the reality of early mainframe operations; data needed to be moved swiftly after creation from limited and expensive memory-based storage onto DASD, then from DASD to tape as quickly as possible to forestall the need to grow the DASD “farm” beyond the confines of a physical equipment room.
Times have changed, of course. But while disk arrays have smaller footprints than 1970s DASD, they're also considerably more expensive to acquire and operate. Plus, external disk arrays consume significant electrical power that's not only increasing in cost, but also increasingly hard to come by in some parts of the U.S.
Read the entire Toigo tip series on storage efficiency
Five ways to achieve efficient data protection
Evaluating green data storage technologies
Evaluating storage performance efficiency
At the same time, the volume of data being generated and stored is many times greater than was the case in the early days of corporate computing, as witnessed by the 20-plus exabytes of external disk storage worldwide reported by analysts last year. This reflects a largely successful effort by the storage industry to promote a “disk storage for all data storage” model for at least the past decade.
Bottom line: The combination of unmanaged data growth, a preference for “everything on disk” and spiking energy costs has created a perfect storm. Once again, we find ourselves in a space/cost conundrum with respect to data storage. Mostly the industry has approached the challenge of ungroomed data growth by promoting higher capacity disk and offering technologies for compressing or reducing data so that more bits can be stored on the same number of spindles. This is, at best, a stop-gap measure -- a way to grow the “junk drawer” of storage that ultimately impacts the performance and resiliency of the data storage infrastructure.
To tackle the challenge of storage performance and cost, we need to go to the root cause of the dilemma: the data and, perhaps at the same time, our attitudes toward alternative storage media such as tape.
Cleaning up the 'storage junk drawer' with data archiving
An important component of capacity utilization efficiency is the practice of data archiving. Archiving is a simple idea: take older and less frequently accessed data, cull the duplicates and dreck, and place it on media that delivers high capacity and reliability but at the lowest possible cost. But it's a practice that encounters significant resistance from many storage mavens. Typically, no one wants responsibility for building an archive practice or deciding what data is appropriate for the archive. “That involves decision making that's above my pay grade,” is the all-too-common refrain among data storage professionals.
In the absence of any sort of archive strategy, the storage junk drawer continues to grow. Based on an assessment of the data stored on disk infrastructure in more than 3,000 companies, it's clear that only approximately 30% of data is active (used frequently in support of mission-critical operations), another 40% is data that must be retained but is rarely accessed; and the balance is duplicates, contraband and orphan data whose owner in metadata is no longer part of the company or its infrastructure. But a bit of archiving, combined with data hygiene (culling the worthless data) could reclaim up to 70% of every spindle you currently own, significantly bending the storage capacity demand curve, which is estimated to grow by between 300% and 650% over the next three years, depending on the analyst you consult.
There are many ways to build an archive and to realize greater capacity utilization efficiency in storage. A basic approach involves leveraging storage resource management (SRM) software to report files that haven’t been accessed in 90 days, sorted by their user/creator/owner. You then distribute the report to the business unit manager so they can take responsibility for getting their staff to identify files ready for archive and junk that can be deleted. Archive then involves a much simpler migration of marked files to an inexpensive repository like tape.
A better approach is to group files at the point of creation by the business process they support. Numerous products work with Active Directory to classify files by creator so appropriate storage rules can be applied. Microsoft Corp. also provides the File Classification Infrastructure (FCI) with its server, an underutilized tool that can be leveraged to categorize file data for archiving.
The emphasis here is on file-based data. This isn't intended to exclude database output, but is instead a recognition of the fact that, since the mid-1980s, the preponderance of data created is taking the form of files. This so-called “unstructured” data actually provides significant and useful structural attributes (file systems and file metadata) that can be extremely useful in building a solid archive process.
Tape technology delivers better density, cost and energy metrics
The on-going preference for file-based output also opens the door for a rethinking of archive -- not as a rarified data container that is itself subject to the problems of change over time, but as a simple file system extension that can leverage a well-understood storage meme: the common, network-attached file server. For reasons of cost, the best platform to support such a file server-based archive platform is tape.
The latest developments in tape technology include partitioned media that enables the indexing of file pointers for rapid access to specific files on the tape media. This has been leveraged by new file systems, including the Linear Tape File System (LTFS), to create a mass storage platform for file data that can be front-ended by a simple server with a NFS or CIFS/SMB mount to deliver “network-attached storage (NAS) on steroids” or tape NAS. Users interface with the platform as they would any other network file store and receive performance in terms of data access that's on par with the Internet -- more than adequate for rarely re-referenced archival data. Plus, for long block files, like surveillance video or human genome output, the speed with which files are delivered to the user is actually superior to the speed of disk-based storage delivery.
Given the improvements in tape cartridge resiliency (30 years to 70 years), tape capacity (32 TB tape Barium-Ferrite cartridges have been demonstrated by IBM and Fujifilm) and library design, there's no comparable storage platform available in the market that delivers better density, cost or energy metrics. Tape NAS-based archive is another use case for tape systems in companies already equipped with a suitable library, and for those that don’t own a library, it's a pretty good reason to buy one.
At the end of the day, capacity utilization efficiency is the ultimate determinant of overall storage efficiency, whether from an operational or budgetary perspective. Given the growing cost of disk-based storage, storage administrators ignore it at their own peril.
BIO: Jon William Toigo is a 30-year IT veteran, CEO and managing principal of Toigo Partners International, and chairman of the Data Management Institute.