Whether you call it data gravity or data inertia, moving data from one piece of storage infrastructure to another can be a painful process. At least that's how it used to be. These days with the right tools and infrastructure, many of the pain points and headaches associated with a traditional data migration process can be eliminated. All it takes is a little forward planning and the right technology.
A recent Hitachi Data Systems report details research from IDC and The 451 Group. The data presented shows migrations represent 60% of large enterprise IT projects and that nearly half of all IT budgets are devoted to operational costs -- a convincing indication that migrations can consume a significant portion of an IT budget. With an estimated cost of $15,000 per terabyte of migrated data, it's not surprising that migrations are seen as a daunting prospect for many IT departments. And there are plenty of reasons why data migrations have become such an issue.
Typical causes of migration headaches include:
Complexity. Today's monolithic storage arrays are complicated beasts, supporting many thousands of LUNs or volumes connected to hundreds of hosts across Fibre Channel (FC), iSCSI and FC over Ethernet networks. Deployments include local replication (snapshots, clones) and remote replication (both synchronous and asynchronous) with inter-system application dependencies that need to be taken into consideration. Modern arrays are implemented with many tiers of storage, and use performance management features such as dynamic tiering to deliver optimum I/O response time.
Technical dependencies. Systems with many hosts will have been deployed over many years, so hardware and software devices, firmware and device drivers may vary widely. All of these components may need upgrading or refreshing prior to a migration occurring. In some cases, devices may not be supported, representing a risk point or the need to spend money to replace hardware.
Operational dependencies. Most enterprise IT environments operate on a 24/7 basis, making it difficult or nearly impossible to create an outage for a planned migration. This situation is especially true where complex server dependencies exist and business continuity/disaster recovery (BC/DR) service-level objectives need to be maintained. A huge amount of time can be spent simply planning and re-planning migrations and negotiating with change teams.
Scale. Storage arrays are capable of storing vast amounts of data. The latest monolithic arrays from EMC and Hitachi scale to more than 4 PB of capacity. The velocity at which data can be moved between physical locations means petabyte volumes of data will take a long time to transfer, during which time you must ensure there's little or no performance impact on production applications.
Cost. Data migrations take serious planning and need many resources to be executed effectively, from project managers to storage architects and application owners who need to be on hand to validate that the migration has been completed successfully. There's also the cost of maintaining two sets of equipment during the migration process, so the longer migrations take, the higher the run rate is for maintaining additional duplicate hardware.
Of course, these issues only reflect the problems of device-to-device migrations. The considerations are different if an application or data is moved into the public cloud. And if a company has built and manages large-scale pools of data based on new technologies such as Hadoop, there are other concerns. Clearly, across the enterprise, data mobility is an issue.
Ultimately, the aim of any migration strategy is improving data mobility. For each of the broad categories already discussed, we'll look at some of the migration techniques and architecture designs that can help mitigate some common migration issues.
One of the most common requirements in data migration is to move data between storage arrays or appliances. If we consider block-based protocol data for a moment, typical approaches include:
Host based. Data is migrated at the host level, through data copying volume to volume, with both the old and new volume presented to the host. The copy process can be quite basic (e.g., tools such as Robocopy) or more sophisticated (using logical volume managers). Host-based migrations offer an opportunity to once again lay out data across physically presented volumes.
Array based. Data is moved between arrays using array-level migration tools. For homogeneous transfers (where the source and target devices are from the same manufacturer and are like models), this can be achieved with native replication tools, albeit with some restrictions. Heterogeneous migrations are more complex, although tools such as EMC's Open Migrator or HP 3PAR Online Import will allow data to be moved from third-party storage arrays.
Hypervisor based. One of the benefits of server virtualization is the ability to live migrate virtual machines (VMs) through features such as VMware's vSphere Storage vMotion and Microsoft Hyper-V Live Migration. Both of those tools enable a VM to be moved between storage on disparate arrays, with the added benefit of being able to move cross-protocol, for example from block to NFS in the case of vSphere. Using the hypervisor to perform data migrations incurs licensing costs, but it does significantly reduce the operational issues of data migration.
Virtualization appliance based. One highly effective way to perform data migrations is to abstract the storage from the host through means of an appliance, such as IBM's SAN Volume Controller or EMC's VPLEX. These products virtualize the underlying storage arrays and offer migration tools to move data between physical locations while keeping the logical data available and with no impact to the host. An outage may be required to install the appliance into the data path; in many cases this is achieved with minimal interruption compared to the time required to move data. Many IT departments choose to run with a virtualization appliance permanently in place, providing the capability to seamlessly manage future migrations or perform data rebalancing for capacity and performance needs.
Virtual array based. The latest version of Hitachi's Virtual Storage Platform, the G1000, provides the facility to virtualize external storage and seamlessly move data between arrays. This takes the virtual appliance features one step further by allowing data to be moved into the array and between homogeneous arrays without an outage. The G1000 (HP's OEM version is the XP7) is unique in being able to migrate data between monolithic arrays with only limited planned downtime and, if the data already resides on the G1000, without any outage at all.
With file-based protocols, data migrations can be equally complicated. In most cases, the main issue for data migration isn't the transfer of the data, but making that data available to the user during and after the migration process with no change of logical location. File shares are typically mapped by server name or universal naming convention address, which can change as data is moved to a new filer. Here the solution is to abstract the data location as part of or before the migration process. This can be achieved using technology such as Microsoft's Distributed File System, a software solution integrated into Active Directory, or Avere System's FXT filer appliance (a hardware solution). In both instances, abstracting the file share name and using a global namespace allows future seamless data migrations to take place.
Migrating object stores
Object storage is gaining in popularity as data repositories for archive information and binary large objects (BLOBs) such as media and medical imaging. Data is migrated in and out of data stores using application programming interfaces that are REST based and, in most cases, the application tracks the storage of items added to the object store. This means that if data is moved between object stores, the application needs to know about it. Products such as Cleversafe's object store use data dispersal methods (such as erasure coding) to store objects across multiple hardware components. One benefit of erasure coding is the ability to reconstruct data across physical nodes over distance, allowing components of the object store to be replaced or physically relocated without data interruption.
Migrating data to the public cloud
The use of public storage clouds is increasing and that proliferation brings with it the need to provide facilities to move data into cloud provider environments. There are a number of ways this may be achieved, including moving entire VMs into the cloud or moving data at the block or file level.
VM import is provided as a feature by many cloud providers. For example, Amazon Web Services' VM Import feature allows VM images to be imported from existing vSphere, Hyper-V and Citrix hypervisors. Unfortunately, there are many restrictions on the types of VMs that may be imported; as a result, it may be more practical to consider simply importing the data into a fresh VM.
Zerto's suite of BC/DR replication products allows entire VMs to be migrated into cloud environments. This feature could be used to move entire VMs as a migration task, rather than simply for backup.
File-based data can be migrated into cloud providers using gateway appliances such as those from Nasuni and Avere Systems. Both platforms abstract the presentation of file data, allowing the back-end storage to be managed by the appliance. With Nasuni, the gateway appliance can be a physical server or VM, both of which can take the identity of a virtual file appliance, providing resilience and the ability to restore access to data with minimum disruption. Avere Systems' appliance allows data to be both moved and replicated between local storage and a cloud provider, enabling mirroring and data mobility functionality.
Microsoft's StorSimple platform allows block-based data to be moved to its Azure cloud storage platform, expanding the capacity of local storage resources. Nasuni provides similar functionality, as does AWS with its Storage Gateway. However, only Amazon provides the ability to access that data as block devices within its Cloud Computing platform.
Scale-out storage and big data
Cloud and traditional storage aren't the only data storage platforms these days. We're seeing the emergence of scale-out storage solutions and data repositories (sometimes known as data lakes) for storing large quantities of data.
Open source platforms such as Ceph and Gluster provide scale-out file and block capabilities and are maturing to a level where data migration will be relatively easy to achieve. Scale-out storage solutions from the likes of SolidFire and Nimble Storage enable clusters to grow and shrink on demand.
Hadoop is one of the most well-known and popular big data platforms. It has a built-in tool called DistCp that can be used to copy data between Hadoop clusters. Of course, Hadoop wasn't really built with data mobility in mind, so moving data in and out of a Hadoop cluster isn't a case of simply presenting a file system or a LUN to the user.
New architectures can equal new migration issues
As Facebook's experience shows, large data lakes can grow to a size where physical data center space is an issue, and moving clusters requires a serious amount of planning and effort. In some respects this takes us back full circle to our original discussion on data mobility and migration of traditional block storage. The issues of data migration have been well described, and solutions are available to do it simply and to reduce costs. However, new storage technologies are still immature relative to managing data mobility, and that will be a significant area of focus as those technologies become more popular.