cutimage - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

What is the best strategy for archiving data in place?

Archiving data in place is becoming more attractive to enterprises because it allows data to be archived on the storage it already resides on.

For starters, understand that the archive-in-place strategy means leaving the data where it is physically located -- and somehow marking that data so it is recognized by its archival class or category. That is a big shift from where we are today.

Now we do already apply special protection to certain data in a disciplined IT environment. Data that supports a mission-critical, always-on application may be included in a data protection service that replicates the data synchronously to a secondary location from which it can be accessed instantly if the primary location is rendered unavailable. That sort of high availability is part of a range of data protection services, and usually the most expensive service. It is therefore applied only to the most important and interruption-sensitive applications and data, with other protective services applied to data that supports less critical applications or those that can sustain a temporary interruption in access without dire consequence to the organization.

When talking about archiving data in place, you'll want to consider data protection policies against data preservation policies. That's one way to think about it.

Just as those data protection services are applied by smart data storage managers and planners in a judicious way, data preservation (archive) services can also be applied granularly to the data that requires them -- data that belongs in an archive for historical, legal or regulatory reasons. As with data protection services, not all data requires the same caliber of data preservation services. More to the point, it is not necessary or even desirable to deploy or leverage a standalone storage platform for archive; archiving data in place is probably the goal going forward.

Archiving in place involves first selecting, classifying and marking data that has archival value. There are many ways to accomplish this, including adding metadata to a file at point of creation. With file systems, metadata additions are possible but restricted by practical issues such as metadata header space and metadata processing efficiency. Microsoft, for example, has opened up its File Classification Infrastructure (FCI) facility to enable administrators to create metadata classes that can be stored with files. Right clicking a file in a Windows environment shows attributes such as HIDDEN, ARCHIVE, SECURITY and so on. With FCI, admins can add new categories such as ACCOUNTING (referring to a department with special data archiving rules), HIPAA (a specific regulation), SEC (to identify data required in Security and Exchange Committee filings), and so forth. If these tick boxes are checked on a given desktop system or for a specific user identified as part of an Active Directory group, all data saved to storage by that desktop system or user will be marked with that metadata attribute.

Marking the data makes it easier to apply a policy for data of a given class or type consistently. In the case of archive, policies might include replication or the creation of erasure coding objects and their distribution around infrastructure to ensure the file is preserved even in the event of a partial storage failure, periodic integrity checks of original and replicated instances, identification of deletion dates, and exclusion from processes like compression or deduplication that might conflict with requirements such as SEC rules regarding original and unaltered data.

In short, it may not be necessary to move data to a specialty archive platform if we can archive data in place. This fact, in turn, changes traditional views of archive and information lifecycle management that posited a four-part design for an archive system:

  1. A data classification system (maintained in the archive in place strategy)
  2. A storage classification system
  3. A policy engine to identify what data to preserve and how
  4. A data mover to move the data into the archive platform based on policy

With archive in place, only the data classification system and the policy engine are needed. Server-side storage flattens hierarchical storage design, so there is nothing to differentiate one tier of storage hardware from another. And since there is no storage platform differentiation, there is no need for a data mover to ingest data into a specialty archive platform.

What will likely be needed to enable a true archive-in-place strategy is an object-based storage environment. File systems are already flattening in Web storage environments, where complex tree-like structures for storing binary objects are routinely discarded in favor of a "horizontal" and infinitely scalable storage tableau. The eventual elimination of the file system structure seems the inevitable outcome of simplified server-side and Hadoop storage architecture and the growing interest in self-describing data objects that provide extensive metadata constructs for use by data management software to facilitate the policy-based allocation of placement, preservation and protection services.

The bottom line is that for companies pursuing federated storage models and big data analytics, the implementation of object storage and archiving data in place are a must. Burgeoning technologies are coming from numerous storage companies such as Dell, EMC, HP, IBM, NetApp, Quantum and Spectra Logic. The real bet for dealing with future unknowns, however, is likely to be a hardware-agnostic approach. Caringo's SWARM is a good place to look for a hardware-agnostic approach.

Next Steps

For continuous data growth, consider archiving technology

How a tiered storage model can alleviate capacity woes

Dig Deeper on Data storage compliance and regulations