Cebreros - Fotolia
IT planners tend to focus on performance and scale when selecting a primary storage system, but another important consideration is data protection. But data protection is no longer the sole domain of the backup process.
Modern primary storage systems can go a long way to protecting themselves with built-in data protection systems that potentially replace backup or at least lighten its load. For IT professionals, the key is to understand the capabilities of these features and decide just how much of the backup load they would like primary storage to carry.
Protection from media failures
Media failure protection has been available on storage arrays since their inception. The goal of media protection is to keep data available if an individual storage device fails. But all media protection comes at a cost. The first cost is capacity overhead: how much additional storage capacity is required to maintain protection. At first, all media protection required mirroring, which has a 100% capacity overhead. RAID protection -- RAID 3, RAID 4 and RAID 5 -- soon followed, which required less capacity for protection but exacted another cost: procession power. RAID protection calculates parity to provide data redundancy, and that calculation consumes compute resources. The most sophisticated data protection systems consume the most compute resources.
The final cost of protecting from media failure is measured by the time to return to a protected state. If there is a media failure, a mirrored arrangement can return to a protected state quickly. A RAID setup, because it has to recalculate all that parity, can take longer. As drive capacities increase, so does the time that it takes to recover from a drive failure. An array configured with 6 TB drives with a basic RAID implementation can take days to return a system to a fully protected state.
Failure protection falls short
It's clear that media failure protection has to improve. Recently, some vendors have introduced advanced RAID controllers that don't have to read the entire drive to recover data when doing a drive rebuild; they only need to rebuild the data that was actually on that drive. Considering that most drives run at about 35% of their capacity, intelligent RAID should reduce recovery times by 60% or more.
An alternative to advanced RAID recovery is erasure coding, which uses parity-based data protection systems similar to RAID. Erasure coding is typically used in scale-out storage environments, and is built from a cluster of storage nodes. It provides better granularity than RAID and writes both data and parity across the storage nodes. The advantage from a failure perspective is that all the nodes in the storage cluster can participate in the replacement of a failed node. As a result, the rebuilding process does not become as CPU-constrained as it might in a traditional storage array with RAID.
For scale-out storage, an alternative to erasure coding is replication. Essentially, data from one node is mirrored to another node or to multiple nodes. Replication is simpler to calculate than erasure codes, but it does consume at least twice the capacity of the protected data.
Protection from data corruption
RAID protects from media failure but will not help an organization recover from data corruption caused by a user or software error. But with modern primary storage systems, administrators don't have to rely on recovering corrupted data from a recent backup if one of those errors occurs. Instead, they can leverage snapshots. On an array, the physical location of data is mapped to a table. When there's a request for data, the table determines the correct location of the data and routes the request appropriately. A snapshot, instead of making a copy of the data, makes a copy of the table. Data associated with the copied version of that table is then frozen or set to read-only for as long as the copied version of the table exists.
A snapshot creates a point-in-time copy of the data without actually copying data. Data growth only occurs when updates are made to data under snapshot or when new data is added. When this occurs, the original table is also updated. The application uses the original table to gain access to the live data set. The other process, like backup or replication, uses the snapshot table to access the point-in-time copy of the data.
Snapshot technology has been available on storage systems for more than a decade but over the last few years, it has become among the most valuable data protection systems. First, most storage systems can track hundreds of snapshots without any major impact on performance. Second, most snapshot features within storage systems have the ability to interface directly with applications such as Oracle and Microsoft SQL Server to capture a clean copy of data while the snapshot is occurring. These two advances mean that snapshots can be taken frequently and stored for fairly long periods of time with the assurance that the data they store is usable for recovery.
Snapshots are valuable when there is data corruption or an accidental deletion. In those cases, the snapshot can be mounted and data copied back to the production volume, or the snapshot can simply replace the existing volume. In both scenarios, data loss is minimal and time to recover is almost instant.
A RAID Reader
For more information on the RAID alternatives available in most storage systems:
Protection from storage system failure
The type of failure that used to force a recovery from the backup software's data was a failure of the storage system itself, caused by multiple drive failures, a bug in the storage software or firmware, or some other crippling event. Now, data centers can leverage replication technology that builds on top of snapshots to deliver protection from a storage system failure.
Snapshot replication leverages the snapshot's granular understanding of data, and only copies changed blocks of data from the primary storage system to a secondary storage system. Typically, snapshot replication is used to create an off-site disaster copy but the reality is that most "disasters" are not data center-wide; they often involve just one critical server that has failed. Snapshot replication can be used to replicate data to a secondary storage system that may also be on-site. This secondary storage system can be used as a recovery point if the primary storage system fails. Of course, snapshot replication can -- and should -- still be used to update a third storage system off-site.
A common concern about this approach is cost. At one time, the target replication system had to match the originating system, but now most storage system vendors allow replicating snapshots to a lower-cost storage system within their product portfolio. In other words, a data center can replicate from a tier 1 storage system to a tier 2 storage system. Alternatively, a third-party replication application or a software-defined storage product can be used to replicate from any storage system to any other storage system.
Another alternative is to leverage inline deduplication to create a full copy of the data with zero data capacity growth. With products that provide that capability, a copy of the data is made and is deduplicated as it is copied. In other words, no data is written since the data already exists; only deduplication metadata is updated. These deduplicated copies can be more useful than snapshots but are still exposed to metadata vulnerabilities.
The value of having a secondary storage system on-site cannot be overstated. It protects the data center from one of the worst disasters possible -- a primary storage system failure. The secondary system can be used to feed other processes like analytics, reporting and, of course, the backup process itself.
Erasure Coding Revealed
Stay on top of developments related to erasure coding technology:
Data center failure
Minor disasters like application data corruption or storage system failure are far more common, but a full-site disaster tends to capture the most attention. While most IT professionals will never experience a site disaster firsthand, the consequences of a site loss are so severe that a data center must have a disaster recovery plan. Again, as with the other failure scenarios, there are multiple DR options. The first is leveraging snapshot replication described above to replicate data to a secondary site. The problem is that the costs associate with equipping, powering and staffing a secondary site can be overwhelming, especially when you consider the chances of needing the secondary site are relatively slim.
For larger organizations that already have a secondary data center, the costs may be manageable. Many businesses, however, don't have secondary sites, so they should consider the measured use of cloud services for disaster recovery. "Measured use" means that the cloud service should be used to store only the most recent copies of data. The recent data copies are the ones most likely to be needed in the event of a major disaster. There are replication and cloud backup products and services that provide the ability only to keep the latest versions of data copied in the cloud.
The cloud can then be used to instantiate application images in the provider's cloud. The result is a rapid (not instantaneous) recovery in the event a data center loss.
Ready for failure
Primary storage can better protect itself from data failures than ever before, but those internal data protection systems don't add up to a total answer for adequate data protection. While most storage systems can provide more frequent data copies and more rapid data recoveries, at some point, storage managers should make sure that a copy of data that is entirely separate from the primary storage system is made and safely tucked away. In addition, IT planners need to consider that increasingly disasters are caused by outside intrusions such as ransomware, so it's important that data centers have a disconnected, off-line copy of data in addition to their secondary online copies.
2015 Products of the Year: Backup hardware finalists
Deduplicating disk backup systems help protect data
Your backup strategy must include snapshots protecting data