Optimizing data archiving operations is all about pain avoidance in most organizations. Few derive a competitive...
advantage from their ability to archive data, with the possible exception of companies that can dip into research-oriented repositories and pull out some useful information. Nevertheless, data archiving is an important competence for nearly all IT organizations, as a failure in this discipline can be quite painful: intrusive, time-consuming and expensive data discovery searches, legal sanctions, fines and the like.
Thus, the über use case for archiving is regulatory compliance. One might say there are two kinds of storage managers: those who have spent evenings and weekends assisting in a data discovery operation, and those who will.
Traditional data archives are little more than big data dumps, usually stored on thousands or tens of thousands of tapes stashed in a third-party vault. The time, not to mention the pain, of recalling, reading and sifting through hundreds or thousands of tapes to ensure an exhaustive search is no longer a best practice or even acceptable for most discovery orders. IT organizations need facilities that provide faster access to relevant data, while ideally allowing them to relinquish the role of the middleman in the data retrieval process.
Archive needs: Four questions to consider
1. Service-level agreement (SLA): What is the required SLA for the data recovery and does it vary by application?
2. Discoverability: What tools are available to find the needed data and can it be user self-service?
3. Regulatory compliance: Does the archived data have special regulatory requirements, such as retention requirements, destruction requirements or immutability?
4. Application integration: To what degree does the application need to know about archived information, thus requiring an application-specific solution?
Cost is always a major consideration with any solution, and even more so when no competitive advantage can be gained. It would seem then that tape would continue to dominate as an archive media as it still boasts the lowest cost per stored gigabyte (GB). However, cloud archive provider Proofpoint Inc. cites industry statistics that say organizations spend $18,000 to $20,000 per GB for legal document review during an e-discovery order. But only 2% to 5% of the documents retrieved are ever used in court, which means organizations spend between $200 and $300 per GB to filter out unneeded documents.
Clearly, the major cost is data management, not the underlying media. That doesn't mean that all data needs to be in a high-performance repository. Organizations should weigh the costs and benefits for each use case. For data that must be retained but is highly unlikely to be accessed before obsolescence, managers may make the conscious decision to put that data in a very low-cost repository and suffer the discovery pain in the unlikely event of a discovery requirement. For other cases, data with a high recovery probability may be stored at higher cost, but with more data management capabilities. In practice, most organizations will have some of both.
Four key cloud archiving considerations
Organizations looking for cloud archive solutions will particularly consider four aspects:
Service-level delivery. Just like any other solution evaluation, IT managers need to define their service-level requirements. Because many cloud services are based on delivering low cost first and foremost, high-capacity/low-performance media can be expected. Cloud providers may or may not offer any particular service-level agreement other than data availability. In that regard, many will offer very high reliability with unspecified performance.
Discoverability. You've placed hundreds of terabytes of data in an archive and the cost per GB is a killer deal. How do you find that needle in a haystack demanded in a legal discovery situation? Purpose-built solutions may provide the data management discovery tools and perhaps even the services to find the desired data, but at a higher cost per GB. The lowest cost, general-purpose services may require you to "roll your own" for data discovery. The lowest cost media might not equate to the lowest cost solution.
Regulatory compliance. Just retaining data is not enough to satisfy compliance. First, the correct data must be selected for retention. Just as importantly, data deletion and destruction is needed to prune data that has passed the required retention period. Legalistically, it can be as damaging to have unrequired data as it is to not have the required data. Data may also need to be secured in such a way as to ensure that it hasn't been tampered with, using WORM (write once, read many) or immutable storage.
Application integration. Just copying data to the cloud is a de minimis archive solution. To facilitate recovery, integration with the data's original application may be necessary. Organizations should understand how data is ingested into the archive solution.
Basic cloud archive services
Cloud archiving solutions have evolved to address both general-purpose and specific use-case scenarios. General-purpose solutions are basically inexpensive data repositories. Amazon Glacier, for example, advertises a starting price of just 1 cent per GB per month. Microsoft Azure also provides on-demand data storage for any data type.
Glacier is an extension of Amazon Simple Storage Service (S3); data can be moved from an S3 repository to Glacier using S3 lifecycle policies. Amazon includes a number of features with Glacier aimed at archive requirements. First of all, the company advertises 12 nines of durability. This is enabled by data being copied to multiple devices within multiple physical locations. Archives are stored as unique data sets up to 40 TB in size. TAR and ZIP files can also be ingested, so backup datasets can be archived individually. Data stored in Glacier is encrypted as well as immutable. Glacier doesn't provide e-discovery tools and archive sets must be downloaded to either an S3 bucket or an in-house data repository for searching. In addition, data is deleted as an entire set, so data elements can't be selectively destroyed. In concept, this is similar to expired backup sets that organizations are accustomed to.
Azure doesn't provide a separate cloud storage facility specifically for archive, so organizations will need to set their own policies and organize their data accordingly. Azure does copy the data to at least three different physical locations to assure durability.
Specialized cloud archive services
In scenarios where data discovery is a matter of "when" rather than "if," IT managers will want to examine more specialized solutions. With regard to cloud, this fundamentally means SaaS solutions for specific products or use cases. Common examples would be email, documents, instant messaging and SAP. Email is one of those applications that all organizations must manage, but that requires special attention in regulated industries.
Proofpoint is a 100% cloud-based provider that focuses on communications in the financial industry, including email, documents, instant messages and any social media that transits the corporate network. The company's Enterprise Archive services are U.S. Securities and Exchange Commission (SEC) and Financial Industry Regulatory Authority (FINRA) compliant. For financial organizations, both security and fast discovery searches are paramount. Enterprise Archive encrypts the data at the customer site during data ingestion. It uses a "double blind" encryption key architecture, which ensures that only the client can read the encrypted data. Proofpoint itself can't access the data under any circumstances. Proofpoint does allow its customers to search the encrypted data, but the data can be unencrypted only at the customer's site.
The company uses a grid-based architecture in its cloud to provide extremely fast searches, often returning results in seconds. One of Proofpoint's key differentiators is that it provides a search SLA. In fact, the company touts that its search capabilities can be used directly by the legal discovery team, making archive searches a self-service function. This can significantly reduce the length of the process, where the biggest cost savings can be achieved.
Another cloud archive provider, Mimecast, has a different slant on the market. It also focuses on regulated industries, but more on professional services (e.g., healthcare, legal) and some manufacturers than on financial services. Because the company has clouds in North America, Europe and Asia, Mimecast can cater to the needs of multinationals. Legal retention requirements vary by country, and Mimecast helps multinational companies cope with these differences even when data transits globally, including email and instant messages. Mimecast focuses on large enterprises only. It currently manages more than 25 PB of data and indexes more than 100 million messages per day. It even has a facility in the Channel Islands for organizations that need to retain data offshore.
Mimecast allows the repository to be managed by its customers or offers services to handle the management for them. Data is ingested in the form of .PST files, off hard drives, from third-party services or directly from Exchange servers. Mimecast's search capabilities can find communication patterns that form a "corporate memory" that's beneficial for European discovery exercises. European retention requirements tend to call for archiving data for as long as possible, whereas North American procedures tend to prune data as soon as possible. Thus, the company offers data destruction capabilities while warning that it's almost impossible to eliminate emails from every source.
Some of the more traditional in-house archiving solutions are adapting to cloud-based delivery. EMC InfoArchive is making that transition, largely by creating a third-party ecosystem of related services. InfoArchive has some interesting capabilities that IT managers may find instructional for archiving. For example, it ties data and content together while converting it to an XML database. This can include emails, voice recordings or transactional data. The value of the XML conversion is that it guarantees long-term data readability even if the original apps are no longer available. Thus, some key use cases for the product are decommissioning old applications and consolidating data from mergers and acquisitions. Data ingestion methods include creating an XML schema to match the source data tables and using third-party APIs and customized extract, transform, load applications to convert data. In this way, InfoArchive can be adapted to application-specific archiving.
Data retrieval determines archive costs
Intuitively, it would seem that cloud archiving is a great way to reduce the cost of long-term data storage by shaving pennies per GB from the budget. However, the real cost of archiving is data retrieval, especially during a court-ordered discovery exercise. Thus, the savings opportunity may be best found by leveraging cloud-based archive services that can offer an economy of scale as well as the expertise IT organizations would have difficulty acquiring on their own.
A large, one-size-fits-all repository is tempting for simplicity's sake, but large-scale IT organizations will need a portfolio of archive capabilities to meet varying business needs. Cloud providers may be the simplest way to meet regulatory requirements at the lowest possible cost.
About the author:
Phil Goodwin is a storage consultant and freelance writer.