3dmentat - Fotolia
The basic idea of a storage array is straightforward: It's a collection of hard drives aggregated into single or multiple logical volumes that can be written to from a server. The capabilities of these systems have evolved from simple volume management to include advanced enterprise features such as snapshots, data deduplication, compression, tiering and replication. Over time, the software that provides these features has become as important as the hardware it runs on.
That's the premise of software-defined storage (SDS), which has emerged as an effective alternative to traditional storage systems. SDS abstracts those management features from the physical hardware and allows users to select their own combination of hardware and software, effectively creating do-it-yourself data storage arrays.
Software-defined storage can be delivered in several ways and should be considered more a description of a product category rather than a specific group of products per se. This category includes storage virtualization, converged storage and hyper-converged storage, each with its own permutations. For our purposes, we'll cover each member of the SDS category and its permutations. We'll also provide some guidance so you can decide if an SDS-type solution makes sense for your organization.
Storage virtualization was the first attempt at abstracting storage software from the storage hardware. Initially, storage virtualization products consisted of fairly high-powered hardware appliances that would run the storage software. Logically, the appliance would sit between the servers and the storage with all I/O routing through it. These appliances could work with storage hardware from a variety of vendors, which was pooled and controlled by a common suite of storage management software.
Some of these products could aggregate the individual storage appliances into one logical unit or provide transparent movement among different storage systems. For example, a volume could be non-disruptively migrated between a hard disk-based system and a flash-based system during a peak load period. Three examples of products in this space are DataCore SANsymphony-V, FalconStor Network Storage Server (NSS) and IBM SAN Volume Controller (SVC). There are also other software products, and most major storage vendors offer some type of storage virtualization.
DataCore and FalconStor have both been around for more than a decade, and they both offer software solutions so that users can run it on the server hardware of their choosing. DataCore, in particular, has a rich feature set that equals or surpasses the capabilities of many of the top-tier storage vendors. Both companies have also evolved their solutions to enable participation with the other SDS types we'll describe.
IBM's SVC product is a dedicated appliance that can be bought separately from IBM; the company also integrates SVC with some of its other storage products, such as the Storwize line of products and the FlashSystem family of all-flash storage arrays. While limiting storage server flexibility, this route does eliminate a support variable; IBM's feature set is comparable to the software products and includes its Real-Time Compression technology. IBM SVC has the unique ability to allow third-party software applications to run on the appliance.
Virtualized storage virtualization
The next logical step for storage virtualization vendors was to allow the software to run as a virtual machine (VM) in a server virtualization environment. This level of virtualization removed the need to acquire separate hardware, but wasn't quite on par with converged or hyper-converged storage. The virtualized software could typically provide shared access to storage, either directly attached to the server host the VM was running on or it would provide features to shared data storage arrays in the storage infrastructure.
Converged or hyper-converged storage seems to be the form of SDS attracting the most attention. This type of SDS leverages the internal storage of a number of the servers in a virtual cluster. It's like storage virtualization that provides storage features such as volume management, snapshots, caching, deduplication, compression and replication. Many companies offer this type of solution, including VMware, Atlantis Computing, StarWind Software and Maxta.
All these products promise to lower the cost of storage dramatically because they can leverage the internal storage in the physical servers making up the virtual cluster. The vendors also claim their products will simplify the storage infrastructure by not requiring a dedicated storage network. They create a server-side storage network so they don't require any shared storage. One of the core differentiators among these solutions is how the storage is shared, and how the data on the storage is protected.
Replication and witness data protection
The first form of converged/hyper-converged storage leverages a "replication and witness" model for data protection and data sharing, both of which are critical in a virtualized environment. With this technique, the data on a VM is 100% intact on the server on which it resides and then replicated to a number of other servers, typically three, defined by the user (the third server copy is referred to as the "witness"). This allows the VM to be migrated to any of the other two servers and without losing touch with its data. If one of the physical servers fails, or its internal drives fail, the VM can be started up on one of the surviving nodes.
One benefit of this technique is that it's inherently simple; there are no complex RAID calculations to be made by servers that are also responsible for running VMs. It also means that all read operations don't require network I/O, because they all come from the local physical server. Some of the solutions in this category will leverage a flash storage area in each node for active data. In those cases, data is read directly from flash and, because it doesn't have to come across the network, the read suffers almost no latency.
The downside of the replication/witness model is that it increases storage capacity requirements by approximately 3x. While some of that additional cost is offset by the savings that internal server-class storage provides, not all data centers will be able to handle a 300% increase in capacity requirements. Another challenge is that the network connecting these servers is critical to the operation, but it's not a storage network. However, the network will transport storage traffic so it must be finely tuned for that task.
An alternative technique is to aggregate the internal storage within the clustered servers so a virtual shared volume can be presented to the connecting hosts. With that architecture, a VM's data is striped across all the drives within the cluster and a parity bit is generated so that a single drive or node failure won't result in data loss. This form of data dispersal is known as erasure coding, and is similar to RAID 5 and RAID 6 in terms of data protection. This technique has the advantage of being more efficient with regard to storage capacity requirements, but it's not as efficient with network utilization since every read and write has to come across the network from the aggregated pool of storage. It also allows a VM to be moved to any physical server on the network, not just the pre-designated targets allowed by the replication model.
Blended software-defined storage
A final form of converged/hyper-converged SDS leverages a blend of techniques. In this design, a VM's data is written 100% intact to the internal storage of the physical server it resides on. Then it's also dispersed to an aggregated volume created from the internal drives of the physical servers in the cluster. Often, the internal drive that stores the intact copy of the VM is a flash drive, and the aggregated volume is a hard disk drive. Reads come from this local drive, and writes go to the local drive and the aggregated volume. This design allows for better data efficiency than the replication/witness model described above and also reduces the demand on the network during read operations compared to the aggregated model.
The converged downside: Unpredictability
While converged SDS is very appealing, it introduces a unique downside that IT planners should consider: unpredictability. The advantage storage virtualization and legacy shared storage have is that they dedicate CPU power to the storage software. In the converged model, the CPU is also used to run the hypervisor and the VMs that support the applications consuming the software-defined storage. There could be situations where, under peak loads, contention arises between the applications and the storage software, causing both to experience poor performance.
Choosing your SDS
Before deciding which SDS method is right for your data center, you should determine if SDS in any form is a good fit for your data center. Despite SDS vendors' claims of reduced costs and simplified management, dedicated, shared storage devices continue to be the dominant choice for data centers.
However, most shared storage products also run on off-the-shelf hardware, similar to what you might choose for your SDS environment. So why are they dominant? First, they have the value of being turnkey solutions. An IT planner doesn't have to evaluate both storage software and storage hardware. Second, they typically bring some unique features that may not be available from SDS vendors. Third, many IT professionals find that the combined cost of an SDS solution doesn't end up saving significantly more than a turnkey approach.
Storage virtualization is the simplest form of SDS to grasp. The design is similar to legacy shared storage, but the storage controller responsibility has been abstracted from the legacy storage system and placed on a dedicated appliance. This approach should have immediate appeal in environments with multiple storage systems from multiple vendors. The ability to add a common interface and feature set could save significant operational dollars upfront and provide flexibility in future purchases. Additional storage purchases can be made solely on the basis of the hardware's capabilities, without having to factor in the bundled software.
For a converged/hyper-converged SDS product to be successful, it has to provide comparable or better performance and the ability to use legacy shared storage at a lower price. Most SDS solutions can provide similar features and, with proper tuning, better performance, especially in designs where data access doesn't require a network transfer. There is the very real issue of a storage professional having to allocate time to evaluate the flash and hard drive storage that goes into these designs, as well as the time required to properly tune the network for this type of I/O. And, once again, there's the issue of performance predictability.
If the appropriate time can be allocated to the selection process and the design of the server network, then a converged software-defined solution should be able to deliver a significant cost savings. But it's important to note that some of those savings should probably be invested in more powerful server CPUs and more server RAM to compensate for potential predictability problems.
About the author:
George Crump is president of Storage Switzerland, an IT analyst firm focused on storage and virtualization.
- CW ANZ: Taming the data beast –ComputerWeekly.com
- IT Definitions: Storage Special –ComputerWeekly.com
- Computer Weekly – 25 September 2018: Mapping the future at Ordnance Survey –Oracle Cloud
- CW500: A roadmap to software-defined everything – Morgan Stanley –ComputerWeekly.com