BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Information and data are key assets for all businesses, and are a major responsibility for IT departments. As part of normal business processes, IT replicates and copies data every day for a wide range of needs. Unfortunately, this copy data storage consumes excessive amounts of data, which constrains storage and adds cost.
One way to reduce this data sprawl is by using copy management systems. Although a relatively new technology, the roster of copy data management (CDM) vendors already includes a handful of small companies, while large storage system vendors have added it to their existing products.
Data is replicated in IT organizations for multiple purposes. Excluding disaster recovery, where data is faithfully copied to another platform, most copies are point-in-time, meaning they represent a static image of data at a particular time interval. Snapshots, for example, are taken hourly or daily to provide IT organizations with quick recovery from data corruption or data loss, such as users deleting files.
Data copies can also be used to seed test environments for application development. In this instance, a separate copy of the data is created to provide isolation from the primary copy for obvious reasons, such as compliance or the risk of corrupting the production image.
Before server virtualization, applications used dedicated development and user acceptance training (UAT) systems. Development environments tested code and UAT tested system load.
Now, as applications are becoming more virtual, and with the adoption of containers, the process of implementing application changes is taking on a more DevOps approach of rapid iteration and rollout. This means developers want multiple copies of test data to be available at the same time, even if many copies may only be kept for a few hours or days.
Maintaining multiple copies of data becomes an organizational headache. Each copy must be tracked against an owner so it can be released at some point. Within the backup world, there are processes already in place to manage this lifecycle. Backups are typically rotated through a time-based cycle automatically.
But this is not appropriate for test or development data, so new processes need to be developed. This is one of the opportunities for copy data management.
With the move to server virtualization, all the secondary data uses we have already discussed are derived from images of virtual machines. Modern hypervisors provide interfaces and APIs to extract virtual machine (VM) data at the block level, making the backup process relatively simple.
Hypervisors typically offer snapshot capabilities, too, but these are not without penalty. Keeping multiple snapshots and, more importantly, consolidating those snapshot updates at a later time, can have a significant performance impact on the applications running in a VM. This is another issue that copy management systems can address.
There are many internal IT processes and systems taking point-in-time data copies for a range of purposes. With high penetration of server virtualization, most of the various processes taking data extracts via APIs are using the same or similar interfaces to extract their data. It makes sense to consolidate these functions into a single platform.
Consolidation can provide the following significant benefits:
- Cost savings. Data for a range of purposes (archive, backup, test/development) can be collapsed onto a single set of hardware, reducing the need to run multiple platforms, as well as the associated cost of deploying, maintaining and upgrading each one. This saves money on hardware, power, space and cooling.
- Reduce operational impact. Moving operational tasks to another platform reduces their impact on production. It eliminates snapshot management (and the performance overhead). You can manage data recovery on the secondary platform instead of using the production system. This reduces the risk of accidentally overwriting a production system.
- Reduce security impact. As an extension of the operational benefits, putting secondary data onto another platform allows easy segmentation of security permissions. Teams that require access to production images, for whatever purpose, can be isolated from a security perspective and audited separately.
Why has the ability to implement CDM only come around now? There are a few technical innovations that make copy management systems attractive compared to running separate platforms.
The first innovation is data deduplication. This is the process of removing physical data redundancy from a data set by eliminating repeated pieces of data, usually at the block level. Instead, a single physical copy is retained, with metadata and pointers used to map the logical to physical relationship of the data.
Secondary data is highly redundant, with multiple copies created of the same virtual machines and the same underlying VM images. This makes the savings from deduplication considerable, and even more so when combining multiple point-in-time sources, such as backups and data images.
The increased processing power of today's hardware platforms, most of which are now based on Intel x86 architecture, means you can perform techniques such as deduplication, zero-detect and compression without additional custom hardware. This allows copy management technology to focus on adding value through software, even if it is sold as a hardware/software combination.
We can also add flash storage to this innovation cycle. Flash provides high-performance access to random data (typically the sort created through deduplication), with device capacities increasing as prices continue to fall.
Separation of hardware
Copy management systems are typically implemented as separate platforms, rather than directly with production systems. This design choice helps follow the standard backup rule of keeping data on physically separate platforms from production. This isolates the primary and secondary data from logical corruption, and enables the data to be placed remotely, if required.
Secondary platforms also treat data differently than primary storage. With production data, the aim is to deliver I/O to the application as fast as possible, with snapshots taken infrequently. With copy management technology, data continually changes, with application updates stored continuously, and the requirement to access historical data a part of normal operation.
As a result, the way in which data is stored and retrieved in copy management systems must be structured so that there is as little performance impact as possible when accessing data from six months ago or five minutes ago. This means having data internally structured differently to production systems. This data structure and the associated metadata are needed to provide advanced features such as search, which adds to the overall value of CDM as a backup and archive platform.
An extension to the cloud
Copy management systems can take advantage of the flexibility offered by the public cloud. Hyperscale services, such as Amazon Web Services, Microsoft Azure and Google Cloud Platform, provide unlimited compute and storage resources for a monthly charge based on consumption. Public cloud turns what would have been capital purchases into operational ones, charging for only the resources consumed, rather than the ones purchased.
Copy management systems that extend to the public cloud enable organizations to offload older data that is much less likely to be needed immediately for data recovery or testing. It means this CDM system can effectively become an archive for the application (which is why search becomes such an important feature).
As CDM cloud support matures, we can imagine the ability to instance applications directly in the public cloud for test/dev work, rather than having the data on premises. This could result in significantly reduced costs and could, of course, move those costs to an operational model.
What the leading copy data management vendors offer
Now that we have a better understanding of what CDM can offer, let's take a brief look at the leading copy management systems available today. These products were determined based on extensive research into the top market shareholders and which products fit the presented buying criteria best.
Both Rubrik and Cohesity are tackling copy management with combined hardware and software offerings. These copy management systems are typically scale out, and they interface with the public cloud.
Actifio and Druva offer software, both capturing data from existing hardware platforms, which include virtual server environments and traditional applications.
Catalogic Software takes advantage of the snapshot capabilities of the underlying storage platform, managing snapshots from EMC, IBM and NetApp storage arrays.
Hitachi Data Instance Director manages snapshot and image copies on Hitachi Data Systems' enterprise Virtual Storage Platform and Hitachi NAS Platform, with support for traditional applications such as Oracle, Exchange, SQL Server and SAP HANA.
Dell EMC offers Enterprise Copy Data Management, which manages data across Dell EMC VMAX, XtremIO and Data Domain platforms.
Delphix software is focused on solving copy management issues for databases.
Commvault offers a range of CDM-like features in its all-encompassing data management platform.
Do you know copy data best practices? Take this quiz to find out.
The effects of data storage capacity growth on performance
Tips for restoring data from archives