Sashkin - Fotolia
Published: 10 Feb 2017
Primary data storage and primary workloads are well understood by now -- their secondary counterparts, not so much. Consequently, the secondary side has been slow to benefit from the many architectural advances realized on the primary side over the last few years.
By convention, primary data storage directly interacts with an application when reading and writing data. Primary workloads, also by convention, are applications that directly interact with primary storage and often, but not always, support end users. Tier 1 or tier 2 applications such as ERP, customer relationship management, Microsoft Exchange and SQL Server, and so on are all considered primary applications. On the secondary side, we typically think of data protection, archiving, replication, data deduplication, compression, encryption, analytics and test/dev as individual secondary applications that work on and with data created by primary applications and stored in secondary storage.
The problem is this secondary world has been disparate, confusing and chaotic, made up of a large number of vendors who often solve only a single piece of the puzzle. In fact, it's only recently that the industry's started to see all these different workloads holistically through a common lens of "secondary" storage applications.
These definitions aren't definitive -- how could they be? -- and the lines between primary and secondary storage and applications remain somewhat blurry. But they do serve as a basis for further discussion.
It all starts with primary storage
As with primary data storage, secondary storage is starting to undergo a tectonic shift. I'll summarize what's happening with primary storage first, as it applies to the revolution beginning with secondary storage.
The chaos created by silos of primary storage and their impact on cost, time and application performance has been well-documented. Thankfully, we've seen advances on this side brought about by using principles of virtualization; scale-out; convergence; and, more recently, hyper-convergence.
Virtualization abstracted compute hardware, changing the way we provision and manage compute resources. Convergence combined compute, storage, networks and server virtualization in a more integrated fashion to make the infrastructure stack easier to buy, provision and manage. And hyper-convergence flipped the industry on its ear by essentially melding compute, storage and storage networking (SAN, NAS) into a single entity and scale-out architecture.
Because hyper-convergence reduces so many of the pressure points IT has been struggling with over the past three decades, its growth shows no end in sight. Suffice to say, hyper-convergence reduces or eliminates the burden of provisioning or fine-tuning, managing, and arbitrating compute and storage resources, and will soon do the same for networking.
The question is ...
If hyper-convergence is so good for primary data storage and workloads, does it have a role on the secondary-storage side? While there has been plenty of innovation, the overarching answer is a loud "no."
At a vision level, however, shouldn't we be able to apply the principles of hyper-convergence, scale-out file systems, virtualization, software-defined storage (SDS) and more to gain some semblance of order in the secondary world, just as we did on the primary side? From the perspective of technology available, the answer seems to be a resounding "yes."
The challenge is huge, though, and one has to wonder if it's possible to develop a comprehensive product for all secondary workloads without starting from scratch. Before we address this, let us draw some boundaries and define what this new architecture for secondary workloads should achieve.
Hyper-converging secondary storage
We at Taneja Group have labeled this new architecture hyper-converged secondary storage. The reason for the name is simple: Products in this category use hyper-converged principles, but do so 100% in service of secondary workloads.
As you read the prerequisites we've developed for the new hyper-converged category of secondary storage, keep in mind that just because a product is missing a specific function today, doesn't mean it wouldn't belong in this category. The question you have to ask yourself is, "Is it architected so that this new functionality can be added without fundamentally redoing the product?" For no vendor may currently meet the full definition, some will come close and others may never make it. All of which is just as true of hyper-convergence on the primary-storage side.
In the end, you have to decide if what's missing is important to your requirements or not. Considering this, here are the fundamental requirements for hyper-converged secondary storage:
- The storage must be "infinitely" scalable in a scale-out fashion, using a nodal architecture. Practically speaking, "infinite" means the same as a public cloud does today; i.e., it scales as far as you need to for a given set of applications, without performance dropping or latency increasing.
- The storage can handle multiple workloads with varying performance requirements without manual tuning. Hyper-convergence principles applied on the primary-application side are equally applicable here.
- The storage is software-defined at the core, separates the control plane and data plane and allows the use of most commercial off-the-shelf hardware. No hardware dependencies, it can run on-premises and in the cloud.
- The storage tightly integrates with public or private clouds. More than sending data to the cloud, this means having the ability to manage and protect data once it's in the cloud. It must be able to use the cloud as a tier in a seamless fashion.
- The storage can handle all secondary workloads, today and in the future. Today, that means data protection, archiving, disaster recovery, replication, data migration, deduplication, compression, encryption, test/dev, copy data management (CDM) and analytics.
- The storage supports multiple block, file and object protocols, including iSCSI, Fibre Channel, NFS, SMB and REST at a minimum, and has the ability to store files and objects within the same storage pool.
- The storage is based entirely on policy. Set the policies for a workload at the outset, possibly using a predefined template and the system manages the entire data lifecycle and workflow thereafter -- including spinning up and tearing down infrastructure resources -- without operator involvement.
- The storage has built-in quality of service (QoS). Since a multitude of secondary applications will, by definition, be running on the infrastructure, the system must allocate resources according to set policies, and there must be a way to ensure compliance.
- The storage can support both physical and virtual workloads, so the primary data source can be either.
- The storage can index metadata and content, builds-in custom analytics. It has a sophisticated search capability, including the ability to search on data within files. The results are available to applications via standard APIs.
- The storage has one web-based management console for the entire secondary infrastructure and is fully manageable from anywhere there's internet -- designed for a global namespace if the customer so chooses. And it has the ability to integrate with common management platforms such as VMware vRealize Automation.
- The storage has self-healing architecture that doesn't require failed parts be replaced immediately, can deal with multiple disk and nodal failures without losing resiliency or availability with no requirement for data migration, and the level of resiliency required can be dialed by IT.
- The storage has built-in data virtualization principles to ensure one data copy can serve many application workloads.
- The storage has enterprise-grade security that includes encryption for data at rest and data in motion and access control. It may optionally include external key managers.
- The storage offers recovery point objectives measured in seconds and minutes, with instantaneous recovery time objectives (no rebuilds and no rehydration).
There is a common misconception in the industry that a hypervisor is a prerequisite for hyper-convergence. Since hyper-convergence as a technology was first used on the primary side and applied to virtual workloads, the need for a hypervisor was evident. But as the industry moves toward containers, that need will disappear. The situation is no different on the secondary side.
As you can see, this is an extensive list of requirements. I believe, though, we are at the point when all the right technologies are available to implement this vision. Several vendors are very close, while others are scrambling to get there in the next 18 months or so, and some -- depending on their current architectures -- may never succeed.
Hyper-ready, or not
At a casual level, many existing vendors will say they meet the above criteria when in fact they don't. So let's examine different categories of secondary products as they exist in the marketplace right now to learn more.
It is a fact of computer science that your architecture, once set, defines what you can or cannot do (effectively) in the future. In that vein, most vendors that started out producing a data protection product, or data protection and replication product, a few decades ago will likely not have a scale-out nodal structure, virtualization and many other technologies that surfaced in the past five years.
If a vendor started out as a CDM player, for example, it is possible data would need to move to its repository before copies are managed. Data protection may or may not be an integral part of the product (that is, it may require a third-party data protection application) and the product may or may not scale-out.
The level of analytics available varies among vendors, with most still focused on storage utilization and data protection metrics. If a vendor only delivers archiving, it is likely a silo unto itself. Cloud integration may simply be a data transfer function. Most data deduplication products are just that: specialized devices for reducing copies of data. And they work in concert with a data protection application supplied by the same or a different vendor. Scale-out is present in some and not in other vendors' products.
In the meantime, certain replication products only perform that one function and, hence, create a separate silo of functionality. Most lack true QoS capabilities. And while there may be a service-level agreement and policy-based control plane, there's usually no way to determine compliance. By design, object storage purveyors have scale-out, single namespace, very secure and available platforms (due to erasure coding), but often lack integral file support and don't support all the secondary workloads mentioned above.
Developed over the past three decades, the sweet spots for this arsenal of current secondary products vary all over the place. Most do what they were designed to do, and they do it well. But the world has changed, and data protection alone isn't sufficient. We must rethink the whole paradigm of protection and availability in the realm of massive content, consumerization, compliance, social media, mobility, cloud and big data analytics.
Although it's difficult to fundamentally transform a product architected two decades ago, vendors are desperately trying to do so, sometimes through development and often partnerships. How they get there, or whether they get there, remains to be seen.
At present, two companies started with visions closest to the one defined in this column. Cohesity meets most of the criteria, whereas Rubrik comes close but is missing some key pieces. CDM players such as Actifio and Catalogic also have much to offer, and will likely be reaching out to fill in the gaps. IBM Spectrum Copy Data Management utilizes Spectrum Protect, and IBM will (most likely) tightly integrate it more with IBM Cloud Object Storage to provide a scale-out capability. And lest we forget, both SimpliVity (recently acquired by Hewlett Packard Enterprise) and Scale Computing have publicly stated visions to apply hyper-convergence for both primary and secondary storage applications in a single infrastructure.
Led by public cloud vendors such as Amazon Web Services, the battle to transform primary data storage over the past decade has driven a staid industry to action. I believe hyper-convergence directly resulted from applying public cloud principles to on-premises infrastructures. I also believe we should apply a fresh approach based on hyper-convergence to the equally staid secondary-storage side of the data center. The need is even greater, as 80% of all data resides in secondary storage.
Either we choose to light secondary storage up and use it for better business decision-making or let it stagnate and become a bigger and murkier swamp. Yesterday's architectures will not get us there. As with primary, a fresh approach based on hyper-convergence is needed.
Essential Guide: Storing primary data in the cloud
Podcast: Cloud storage and primary data
Should you use the cloud for primary storage