This article is part of an Essential Guide, our editor-selected collection of our best articles, videos and other content on this topic. Explore more in this guide:
3. - Benefits of an effective data archive: Read more in this section
- Implement an archive to boost overall storage efficiency
- Eliminate wasted disk capacity in archival storage
- Three considerations for successful long-term data archiving
Explore other sections in this guide:
This article can also be found in the Premium Editorial Download "Storage magazine: Improve data storage efficiency with archiving technology."
Download it now to read this article plus other related content.
A core best practice for effective storage management is archiving technology that frees up storage resources, improves performance and protects data that must be retained.
Data archiving has typically fallen into the realm of storage infrastructure, more or less by default. That made sense originally, as the first use case was clearing older data from expensive disk. Usually, that meant moving data to tape and then more or less forgetting about it. Seven years was usually the extent of the retention limit, but recovery was often problematic. Tapes faded with time, applications became obsolete and data formats changed. Organizations struggled to respond to court-ordered discovery motions, having to retrieve, recover and read potentially hundreds or thousands of tapes to get the slice of data that was usually demanded in a short period of time.
Archiving technology's dual role
Even now, the sheer volume of data drives an economic incentive to move older data to lower cost media, but archiving is increasingly bifurcating into a storage management task and a business-driven application. As a business application, the primary use case is data retention for regulatory reasons; move-and-forget is not enough. Data recovery in some form is a virtual certainty and that form is unpredictable, given the whims of regulators and the courts. Moreover, some data, such as health care information, may need to be retained and recovered decades from now. Email, SharePoint and other file-system data are problem areas for almost all organizations because they consume inordinate amounts of capacity and may be subject to legal holds.
Because of these new requirements, IT managers need to employ a collaborative approach and work with the business units and legal department for archive implementations. IT staff can hardly be expected to know what legal policies are needed, but they should know the technological options that will help match business requirements with the archive implementation. We'll run down some archiving technology options so storage managers can get a sense of the breadth of alternatives in the marketplace as well as the capabilities they should be looking for.
Archive solution essentials
- Data classification
- Data mover
- Data indexing
- Discovery tool
Nice to haves
- Data destruction
- Single-instance store
- Integrity checking
- Immutable (as required)
- Integrity checking
Archivers morphing into management apps
As the purpose for archiving data has shifted from storage management to include data management, archiving solutions have taken on the characteristics of broader data management applications. Consequently, the key user constituencies have also shifted. Rather than storage managers alone, key users of archive applications include CIOs, compliance officers and attorneys. User concentrations have skewed toward more heavily regulated industries, especially finance and health care.
Archive solutions range from general purpose to specialized. However, most will have a set of features that classify, move, index and discover data. Many will also include features that facilitate long-term data recovery, data destruction, data deduplication and compression, single-instance storage and integrity checking. Which combination of these features is included may be determined by the target user and use case.
Because early archiving technology efforts were limited to moving backup tapes offsite, organizations cannot make the mistake of thinking a new archive application is a "green field" opportunity. In most cases, years of legacy tapes remain in the vault, all with their own retention and expiration policies. Storage managers need to ensure that backup policies don't conflict with archiving policies. Destroying data prematurely could put the organization at risk of noncompliance with court orders. On the other hand, retaining data unnecessarily makes it fair game for legal discovery, even though it's not strictly required for a given order. Either way, the result can cost organizations staggering sums in penalties or awards.
Horizontal application providers
CommVault Systems Inc. is an example of a company that targets both backup and archive from a single point. The company's Simpana OnePass feature is designed to scan, copy, index, store, report on and create a synthetic full in a single pass. The data is moved to the Simpana ContentStore, which is the virtual back-end repository for all backup and archive metadata. ContentStore facilitates a global view of all data, where it can be searched, discovered and deduplicated. Policies regarding retention, legal holds and "defensible deletion" can be applied to this single repository. CommVault also positions this solution for big data applications regardless of data source. However, social media, instant messages (IM), blogs and the like are not presently within the scope of the product.
Two products specifically targeted at the email and file system problem are EMC Corp.'s SourceOne archive suite and Symantec Corp.'s Enterprise Vault, though both also provide litigation support. SourceOne includes components for Microsoft Exchange, IBM Lotus Notes, SharePoint and file systems. In addition, the product includes an Email Supervisor that monitors inbound and outbound email for compliance with policies; this supervisor facilitates Financial Industry Regulatory Authority regulatory compliance. The SourceOne Discovery Manager searches the SourceOne repository for relevant information and can output the data into an Electronic Discovery Reference Model (EDRM) XML format.
SourceOne is built upon EMC's Data Domain platform, which the company indicates is evolving into a "protection storage platform" for consolidated backup and archive. While this doesn't currently imply a melding of the backup catalog and archive metadata, it does yield the benefits of deduplication and a single physical target for both purposes.
Symantec Enterprise Vault is designed for both storage optimization and e-discovery. Symantec indicates that e-discovery is now the predominant use case in the U.S., although it remains more mixed between optimization and discovery in Europe. Whereas single-instance storage was removed as a feature of Exchange 2010, Enterprise Vault still deduplicates these files. The biggest benefit of deduplication is in the backup and archive operations. So, the benefit is both in physical space savings as well as reduced backup and archive operational time. Enterprise Vault dedupes across all data sources and does so upon ingestion into the archive. This includes not only email and SharePoint files, but social media. By virtue of Symantec's acquisition of Clearwell Systems Inc., Enterprise Vault includes a self-service e-discovery capability suitable for attorneys and other non-IT users. The result is removal of IT from the discovery process and lower attorney costs.
Specialized application providers
Two of the more specialized archiving providers are Patrina Corp. and Hewlett-Packard (HP) Co.'s Autonomy. Patrina focuses on the financial industry and goes so far as to be located on Wall Street, near the epicenter of its key market. Patrina offers a Software-as-a-Service-based records management solution that encompasses typical unstructured data and email as well as social media, blogs, IM and chats. Patrina differentiates its offering largely through customization, and estimates that 90% of its users have some amount of customization.
For Patrina users, the key is being able to discover and aggregate data. Because of the unforeseen nature of the slices of data requested by regulators, not to mention changing regulations, Patrina offers both self-service data discovery and support teams to assist its customers. Patrina uses a Windows platform with data stored on Windows-readable WORM CDs, ensuring long-term readability of media.
HP's Autonomy product is also primarily focused on the compliance market for regulated industries with an emphasis on archiving only data that is truly necessary. This means having robust policies that govern the data throughout its lifecycle, including data deletion, or "deletion by design" as HP terms it. HP uses Autonomy's analytics engine as a key differentiator. This analytics engine is designed to manage data using pattern matching and context to filter out "noise" and provide extensive information on unstructured data. It can also simultaneously search text, video and audio files. In addition to unstructured data archives, HP is noticing a trend toward archiving structured data, such as when an application is retired. Autonomy indexes across all data sources while applying both compression and single-instance storage.
Storage platform considerations
The storage platform attributes needed to support archive operations must include scalability, data integrity and security. Security can include both encryption and immutability. Although some archive applications perform single-instance storage and deduplication at the software level, companies like EMC take advantage of the native capabilities of the Data Domain hardware.
In addition to its Data Domain platform, EMC positions its Isilon line of scale-out arrays as an archive platform. Isilon arrays are designed to accommodate hundreds of terabytes of data, so searching for relevant data should be simplified by using a single platform. Although Isilon can certainly support traditional archive workloads, the company positions it specifically for big data and large files, such as geophysical and medical imaging files. Additionally, Isilon includes an InsightIQ management platform to give storage administrators reports of trends, performance attributes and other information to optimize the system. EMC also has its Centera content-addressable storage for immutable requirements and Atmos for geographically distributed cloud environments.
HP matches its Autonomy app with the HP StoreAll family of arrays. One unique aspect of StoreAll is its Constant Validation feature to perform constant integrity checks. Given the massive scale anticipated by HP, the company feels that checking data integrity after writing is essential to proactively avoid problems and ensure the files have not been unexpectedly altered. When combined with Autonomy, HP StoreAll's Express Query feature can scan metadata, not the actual files, at a rate the firm claims is 100,000 times faster than traditional file scans.
Tape still plays a part in archiving
No discussion of archiving technology platforms would be complete without mentioning tape. In this regard, the LTO Consortium's Linear Tape File System (LTFS) transforms tape from passive media into an integral part of an archive offering. This file system spans both near-line tape and disk. As promoted by the Active Archive Alliance industry consortium, when combined with automated tiering software, archiving allows data to be moved to the lowest cost media automatically. Tape remains the lowest cost media for long-term storage, and LTFS makes it easier to integrate into archiving systems because it supports a familiar file system. Cleversafe Inc. (scale-out storage), HP, Scality (large-scale unstructured storage) and XenData Ltd. (video archive) are among the recent companies to join the consortium. Spectra Logic Corp., a traditional tape vendor and Active Archive Alliance member, continues to expand its use of "archive-grade" disk to front-end tape and facilitate better performance and integrity checking.
About the author:
Phil Goodwin is a storage consultant and freelance writer.