The enterprise archive of tomorrow

Regulatory compliance, litigation and corporate governance, which all mandate that data be secured for years, if not decades, are responsible for creating a new approach to archiving. Archives are no longer just a place to store old data. The enterprise archive of the future will free up massive amounts of primary storage capacity, while streamlining backup processes.

When it comes to archived data, many companies today are facing a crisis. Primary storage is at capacity, while old (static) data on backup tapes is shelved and kept indefinitely, just in case that data needs to be recovered. Ironically, these two information repositories -- primary and backup storage -- are where most archived data currently rests.

But enterprise archiving is undergoing a transformation. Outside factors, such as regulatory compliance, litigation and corporate governance are all mandating that data be secured for years, if not decades. This is creating a new approach to archiving.

In the new approach to archiving, archives are no longer just a place to store old data. The primary and backup storage repositories contain only the day-to-day operating data a business needs. The static data is in an easily accessible archive, stored at much lower cost. . . . yet protected, secured and always available.

The enterprise archive of the future will free up massive amounts of primary storage capacity, delaying or even eliminating upgrades. It will also streamline backup processes by reducing the amount of static data repeatedly backed up.

Primary disk storage as the archive

Many companies still use primary disk for storing static or persistent files. This is why demand for primary disk grows by 75% to 100% per year, and why 60% to 80% of the information stored on that disk is static. Although few IT professionals would claim primary storage as their archive, in most cases, the data resides on the most expensive storage in the data center.

More on archiving
Buying guide: Archiving software

Email archiving in 2008 FAQ guide
Primary disk is expensive compared to archive storage, but it is used for archiving because it appears to be the easiest way to handle it. Just add more disk, expand volumes, and you've got an archive.

Of course, there's more to it. Acquiring the disk, installing it in the storage system, allocating it and actually provisioning those volumes is very time-consuming. Keeping data on primary storage that should be archived places an undue burden on the backup process. Each full backup becomes incrementally more challenging, more time-consuming and more costly. Solving this problem necessitates continual investment in the backup architecture -- faster disk, faster tape drives, faster networks, new backup applications -- creating an endless cycle of upgrades. Lastly, in many cases, primary disk lacks both the ability to scale to petabytes and the option to secure data through encryption and WORM.

Backup storage as an archive

Even companies that leverage the backup process as an archive have concerns about tape's long-term reliability and efficient access. More and more data centers are augmenting tape with disk to attain faster backups or faster recoveries, but more often to achieve greater reliability. Tape-based backup entails numerous problems, and tape media failure is a constant concern in the backup process. Using tape as an archive is even riskier. Tape is no more reliable for storing archive data than backup data, yet the archive data may be more critical to the organization and may need to be stored for many years before recovery is needed.

Most IT professionals would point to backup media (most often, tapes) as their archive, but because they lack confidence in the backup media archive, data also remains on primary storage. This is why disk as part of the backup process is being used as an intermediate storage area to recover data. Traditional disk-to-disk technologies can only hold backups for a few days or weeks. Data deduplication has expanded the retention capabilities of disk-based backups to months. Prior to data deduplication, disk was merely a cache for backup data prior to its going to tape, but now it can be used as a legitimate storage area, but without further enhancement to address security, immutability and accessibility. Still, most disk-to-disk backup systems are not suitable for archives. They also lack the ability to scale and to provide high availability.

Most disk-to-disk backup systems are not suitable for archives.
George Crump
FounderStorage Switzerland
The oft-used approach is to get rid of this old (static) data, protect the data with special value to the organization and retain the information that the organization needs in order to adhere to compliance regulations. In short, archive it. Most users would like to move this data from primary storage, but the traditional archive targets (tape and optical) are hard to manage, unreliable, inefficient and difficult to interact with when rapid retrieval is needed.

You can build a cost-effective disk-based archive by coupling SATA drive technology with data deduplication. A disk archive is easy to access and often a network mount point. Creating the archive can be as simple as copying or moving data from one network share to the new one.

The challenges with a disk archive are scalability, security and redundancy. The goal of an archive should be to relieve primary storage and streamline the backup process while moving old files or maintaining secure copies of important files. The archive will often become the final resting point, and last known copy, of data that is important to the enterprise. If tape is not suited for the dual role of backup and archiving, neither are disk-to-disk backup solutions.

The enterrpise archive of tomorrow must be capable of retaining the last, and only copy of data for decades and do so in a manner that provides the user with 100% confidence that the data will be accurately recoverable the moment it's needed. This requires an enterprise archive built on the capabilities of disk archiving -- easy to access and significantly less expensive than primary storage -- while adding the critical functions of scalability, security and availability.

Requirements for the enterprise archive

With today's data retention requirements at a minimum of five years and 100-year archives now being mentioned, an enterprise archive must scale to multiple petabytes today and to hundreds of petabytes in the next few years. Such scaling will require a grid storage architecture, which has discreet components for capacity and performance called nodes. As more storage is needed, a capacity node is added to the grid. Theoretically, an infinite number of such nodes can be added.

Also important is the ability to scale granularly. With disk drive prices continuing to plummet, more capacity will be delivered for less money. The ability to scale granularly allows for adding only the storage required at that point in time, delaying the next purchase to optimize capacity and cost. This granular scaling will also require that the architecture be able to accommodate mixed nodes. Mixed scaling of nodes allows the use of the largest capacity drives available at the time of capacity expansion.

The longer data is accessible the more important it is to secure it. Enterprise archives should be able to encrypt data to limit access and allocate portions of their storage to WORM to ensure chain of custody. This would also require flexibility in adapting to changing regulations and guidelines.

Data requiring retention for decades must be recoverable. Such data must be scanned constantly to make sure it has not degraded or is not resting on a failing drive. If a drive failure is occurring, data should be automatically relocated to ensure its recoverability prior to total drive failure.

The physical size of the archive data set (petabytes now, and soon hundreds of petabytes) demands data protection beyond RAID 5 or RAID 6 and simple backup procedures. Data sets of this size using 1 TB and beyond hard disk drives will be nearly impossible to recover from a drive failure. The technology must leverage all the capacity nodes to rebuild failed volumes quickly. With multiple drives and nodes involved in a rebuild, not only does that rebuild complete faster it also requires less individual resources to perform the task, making a drive rebuild seamless and almost invisible to the end users.

Upgrading primary storage to a completely new platform commonly occurs every three to five years. The rip-and-replace-then-migrate approach used in primary storage upgrades will not be effective with an enterprise archive. Because of the size of the data set (100 PB), moving the data from one platform to another could take months, and the likelihood of drive failures during this transfer would be high.

The archive must utilize a rolling upgrade architecture of mixed nodes, which allows for the integration of new nodes with old ones, and,as time dictates, decommissioning the older nodes. This rolling upgrade architecture never has a single upgrade point but the upgrade happens gradually and seamlessly as new nodes are added to the archive.

To deliver an enterprise archive is going to require new leadership. It is unlikely that the traditional disk and tape vendors will do anything to drive down the cost of capacity significantly or quickly. Look for vendors that deliver a cost-effective grid storage platform while still offering excellent service and support.

About the author: George Crump has had 20 years of experience designing storage solutions for IT decision makers across the U.S. He has held executive positions at Palindrome, Legato Systems Inc. and SANZ. He now heads up his own independent consultancy known as Storage Switzerland, which provides unbiased advice and strategy services to help storage professionals solve their storage management challenges.

Dig Deeper on Data storage compliance and regulations

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.