Archiving data is more important than ever in today's environments; it ensures proper data retention, saves space on storage systems and eases the backup burden.
You'd be hard-pressed to find an industry analyst today who isn't predicting unprecedented acceleration in the rate of data storage capacity growth. Whether driven by big data projects, unbridled data replication supporting virtual server failover strategies, the need to store increasingly rich (and large) digital media files, or simply regulations and laws that require retaining data for decades or more, the appetite for storage capacity continues to grow unabated.
Unfortunately, IT budgets don't.
For the last few years, storage array purchases have been consuming an increasing percentage of overall IT hardware spending -- between 33 cents and 70 cents of every dollar of a typical annual IT hardware budget, depending on which industry pundit you consult. Yet for all the capacity growth, those toiling in the IT trenches can't escape the feeling that, despite their efforts, they're simply rearranging the deck chairs on the Titanic. There's simply too much unmanaged data to properly steward it across mostly unmanaged storage infrastructures. A data disaster isn't only easy to visualize, it seems to be well within the realm of the possible and maybe even likely to happen.
Higher capacity disks aren't the solution
The challenge of data growth isn't just about capacity allocation, despite what some vendors say. Until recently, the industry was doubling disk capacity approximately every 18 months, while cutting the cost per gigabyte of disk drives by 50% each year. But those trends have slowed substantially over the past year or two, partly because the industry wants to press flash storage into service, but also because non-trivial technical and financial hurdles have appeared in the methods used to implement new capacity-increasing technologies on current disk drive production lines. For years, we've been hearing about heat-assisted magnetic recording (HAMR), bit-patterned media (BPM) or acoustically assisted magnetic recording -- technologies that can dramatically increase drive capacities. But they have yet to be implemented on commercial disk drives. The latest capacity improvement, which achieves 6 TB of capacity using a helium-filled drive case and extra platters, has elicited little more than a yawn.
Of course, as disk capacity expansion trends have slowed, we've seen the introduction of many software tools for squeezing more data into the same amount of spindle space. Choices include in-line compression and many flavors of data deduplication. But these expensive software features on storage appliances never provided much more than tactical fixes to the problem of spiraling capacity demand. The challenge posed by data growth, most agree, won't be solved by some sort of value-add deus ex machina descending to the stage at the last minute.
To cope with data growth, we need to focus less on improved capacity allocation efficiency and more on capacity utilization efficiency. We need to place the right data on the right storage media. That means data with high rates of re-reference are stored on faster (and usually more expensive) media like solid-state or enterprise-class disk, while data with lower rates of access and modification are hosted on low-cost, high-capacity disk or even lower cost tape.
To do this effectively, data centers need to rediscover an old idea: archive.
Dusting off archive's image
Data archiving suffers from several challenges today. For one, it's too often viewed as an arcane practice, such as backup or hierarchical storage management: busy work that doesn't fit among the innovative projects that are seen as contributing to IT agility or dynamism.
There is some truth to this view. Archive is a maintenance process usually enabled by third-party software that applies retention and migration policies to move data over time from one type of storage to another. Ultimately, most data is headed for cold storage, whether on tape, optical media or capacity disk, where it will likely remain un-accessed until the media itself turns to dust.
While IT operators may see such data placement simply as a way to move inactive data out of production storage to reclaim space, there are usually other valid business reasons for archiving, including intellectual data preservation and compliance regulations. Still, it's difficult to dislodge the view held by many IT operators that data, once archived, has no value in day-to-day operations.
Bottom line: Archiving data is like taking out the trash; a required task but not one that generates a lot of enthusiasm.
Implementing archiving is a step-by-step process
Adding to those perceptions is that archive is typically a challenge to implement. To do it right requires much more than just purchasing data mover software. The heavy lifting part of archiving data is the up-front work required: analyzing and classifying data, then monitoring data accesses on an ongoing basis to help determine when data should be moved.
In addition to classifying data, storage platforms need to be analyzed and classed based on criteria such as performance, capacity, access speed and cost. This analysis helps planners to create a storage pecking order through which data is migrated.
Over time, data is typically moved from fast disk to capacity disk, and then down to tape or optical technology for long-term storage. However, according to analysts, disk has been pressed into service as a long-term or deep archive platform despite the poor economics of using disk in such a role. Recently, Stamford, Conn.-based Gartner Inc. produced a Magic Quadrant describing the enterprise information archiving market that envisioned the archive platform as a deduplicating disk array: two technologies, disk and deduplication, that are anathema to an archiving traditionalist.
With data and storage targets classified, next comes the task of instrumenting the storage infrastructure for policy-based data migration. This is usually a role for archive software, often instantiated on its own server appliance. CommVault, IBM, Symantec and others have data management software products that include archive functionality, but at the price of committing your data assets to their proprietary data management schemes.
Some planners simply use their disaster recovery data backup software as an archive utility as it provides a container for data and a time stamp identifying the vintage of the container's contents. However, such an approach has been widely criticized. Backup differs significantly from archive, especially in the key areas of granular data selection and presentation, and support for data class-centric policies and triggers (events for moving data around infrastructure). The importance of such differences often become an issue only when archives need to be accessed and used efficiently as part of litigation discovery or to support research into historical business activity.
Meanwhile, the quest continues for a simpler and more open long-term data archiving scheme. Mortgage banking, media and entertainment (M&E), healthcare imaging and some governmental research projects have become the "laboratories" where archive definitions and processes are being refined. In most firms, planners simply want the archive function to be a "fire-and-forget" process that operates in the background, is self-maintaining and capable of providing periodic status reports. But in businesses where data itself is the business, archive has taken on the more important role of providing a low-cost, read-optimized storage platform for active but unchanging data.
It's in these crucibles of data asset preservation -- such as video and audio post-production shops, and broadcast networks that store their products as digital information -- that real advances have been made in archive technology. Such advances have included improvements in and diversification of the media for archival storage, the future-proofing of data formats and the delineation of different kinds of archive practices.
Improved archive media led by resurgent tape
In terms of archive media, tape is re-emerging as the sine qua non of archive, dispelling concerns regarding its resiliency, durability and capacity that were raised over the past decade. With a media life expectancy of more than 30 years, with new Barium Ferrite (BaFe) media coatings that will soon enable the storage of 32 TB of data uncompressed on a single LTO cartridge, and with data streaming rates in excess of 250 MBps (exceeding the capabilities of most disk storage platforms), tape is making a compelling case for its own engineering excellence.
Put simply, tape blows away disk-based archive from the standpoint of capacity, speeds and feeds, and cost. For disk-based storage to get near the capacity of tape, even with technologies such as deduplication and compression, a much more costly and complicated platform is required. Plus, technologies for squeezing more data onto the same amount of disk capacity introduce an unwanted variable into archival repositories: proprietary data container formats that may not withstand the test of time.
Important innovations in tape technology have lately included the Linear Tape File System (LTFS), a file system for tape created by IBM. LTFS facilitates the use of a tape library as a very large capacity network-attached storage system. Data can be written to and retrieved from an active archive repository leveraging LTFS-formatted tape, which is presented as just another HTTP, NFS or CIFS/SMB file share simply and affordably, and without requiring third-party archive software. For those who don't want to cobble together their own LTFS repository, vendors are offering preconfigured LTFS "heads" such as Crossroads Systems' StrongBox that work with any tape library back-end.
Active archive via LTFS represents an interesting mix of traditional file storage and archive concepts, but it's also limited in one way. LTFS is optimized for "long-block" files such as video, audio, or rich telemetry or visualization output, but the technology isn't well suited for use with many shorter length, small-byte user files.
That restriction is a blessing in some use cases and accounts for why LTFS is so appealing to oil and gas exploration, human genome research, pharmaceutical research, and motion picture and television production -- all of which tend to work with long-block files. In those vertical markets, a storage medium capable of delivering fast and consistent read performance is preferred. With LTFS, long files, once accessed and started, stream to the requesting user in a manner that's more consistent and jitter-free than is the case when the data is hosted on spinning disk.
Cost and energy consumption are also important factors in active archive. With current tape capacities, petabytes of stored files can be stood up in a platform that occupies only a couple of raised floor tiles and consumes only a few light bulbs' worth of electricity.
Object storage well positioned for archive
Not all archival data is long-block data or even based in hierarchical file systems. Web data, for example, tends to use a flat-file system in which a single object such as a page layout may comprise many small data components.
LTFS, at first glance, would make for a poor solution for archiving large collections of small files. This has kept disk-based platforms in service for archive applications despite their performance and cost-of-ownership restrictions. Last year, the Information Storage Industry Consortium (INSIC) joined The Clipper Group in evaluating the cost of ownership of archive repositories using disk-based platforms and tape-based platforms. Comparing the five-year total cost of ownership (TCO) for a 500 TB disk archive vs. the same capacity tape platform, INSIC discovered that acquisition costs and power costs were significantly different. The tape platform consumed approximately $4,500 in utility power compared to $110,000 for the disk platform, and at $150,000 cost a fraction of the $1.3 million acquisition price tag of disk. The Clipper Group performed a 12-year TCO comparison and discovered that disk was 500 times more expensive than tape based on energy costs alone.
However, a new technology has been introduced by Spectra Logic that promotes the use of LTFS even with archives comprising a mix of short and longer block files. Spectra Logic's BlackPearl server is another LTFS head end but with a twist: It supports object storage. BlackPearl works in conjunction with a new protocol from the vendor called Deep Simple Storage Service (DS3). The name may be evocative of another protocol, Amazon Web Services' Simple Storage Service (S3) protocol, used to transport files into the Amazon storage cloud … at least, that's what Spectra Logic is banking on. DS3 leverages Amazon S3 concepts but adds important innovations that are required for archive, namely BULK GETs and BULK PUTs that move many file or data objects at once from application workflows to the BlackPearl server. Once at the server, these data objects are moved collectively onto LTFS-formatted tape as one large block file structure.
Spectra's object storage approach is gaining considerable interest not only in media and broadcast markets where applications used to edit and store video and audio avail themselves of object content storage output, but in high-performance and big data computing environments. The first "client" developed by Spectra interfaces BlackPearl and DS3 with Hadoop architecture, while other clients are being developed to connect DS3 to application workflows created by specialty software in the M&E space.
With each new client rolled out to support object content, archive moves further away from proprietary software and file-system-oriented archive and more toward open archive approaches that will enable simpler integration of archive and data management with application workflow. In a few months, we may even be talking about flape, a storage architecture being discussed by engineers at some prominent storage vendors today in which data is written concurrently to flash and tape (flape, get it?). When accesses and modifications to data objects and files fall off, the data is erased from the flash components, making them available for new writes. However, the tape archive copy continues to be available for as long as retention rules require.
Another alternative implementation of flash-plus-tape-based archive may be as a cloud service, which some market watchers have taken to calling a floud. Tape archiving has already begun to find expression as a cloud service with Fujifilm's Dternity and Permivault services leading the way. Last year's collapse of Nirvanix, a major cloud storage vendor, reinforced the requirement to select a cloud storage provider that uses tape in its operation so you can get your data back in a short timeframe. Nirvanix's customers had little warning and only a short time to retrieve their data from the cloud across a WAN.
Archive might seem stodgy, but there are plenty of important developments happening now. Watch this space.
About the author:
Jon William Toigo is a 30-year IT veteran, CEO and managing principal of Toigo Partners International, and chairman of the Data Management Institute.