Problem solve Get help with specific problems with your technologies, process and projects.

The struggle between virtual machine performance and storage

Storage has been painted as the cause of sluggish VM performance, but it might be your hypervisor's fault.

This article can also be found in the Premium Editorial Download: Storage magazine: Hyper-converged storage systems=Storage + Compute + Networking:

A lot of ink has been spilled by vendors and their evangelists extolling the benefits of various "preferred" solutions for speeding up slow-going virtualized workloads. Truth be told, the entire discussion has taken on a kind of Bizarro World characteristic that has mostly gone unnoticed and undocumented by the trade press. If, like me, you long for fact-based analyses of your IT problems, here's a stab at a discussion of storage in virtual server environments that doesn't conflate, confuse or otherwise draw false causal relationships between disparate data points.

Almost as quickly as server virtualization came into vogue as more than just a useful test-bench tool -- that is, to facilitate the consolidation of servers through single-server multi-tenancy -- we started hearing about the evil impact of legacy storage. Storage, as we've been told repeatedly, is directly responsible for the slower performance of workloads once they're virtualized and instantiated in hypervisor environments.

Originally, a lot of the villainization of data storage focused on known deficits of contemporary storage products and topologies. Evil storage vendors had long insisted on deploying their boxes of disk drives with proprietary controllers hosting proprietary software functionality designed as much to lock in the consumer and lock out the competitor as to deliver any sort of superior capabilities. Combine that with the industry's unwillingness to work and play well together on a common management approach that would enable the infrastructure to be maintained, scaled and configured holistically rather than on a box-by-box basis, and you have all the ingredients for cooking up a flawed and costly infrastructure.

The above points are hardly debatable, of course. Things got so bad in the early 2000s that analysts actually encouraged their clients to source all storage from a single vendor in the hope that homogeneity would enable coherent management -- a linchpin, together with data management, of any cost-containment strategy.

So, we can all agree that unmanageable storage was the root of many evils in IT. It meant oversubscription with underutilization, driving the need for more capacity and bigger Capex spends. And it required more IT personnel with specialized skills in storage architecture and administration to manage the gear and interconnects, so Opex spending was high and to the right.

Is storage really to blame?

While we can agree that these characteristics of storage were undesirable and in need of remedy, they didn't explain the problem of slow-performing virtual machines (VMs). Yet, VMware and other hypervisor vendors insisted on making a false correlation and causal relationship between the problem of virtual machine performance and proprietary storage. That resulted in approaches like VMware's vaunted vStorage APIs for Array Integration in 2006 and even its more recent "software-defined storage" play, Virtual SAN, which continue to address the problem of slow VMs by attacking evil storage.

There may be some instances where storage latency -- the speeds and feeds of storage devices and networks or fabrics that connect them with servers -- can slow application performance. This is well understood and typically addressed through a combination of caching and parallelism; the former is used to collect writes at a fast storage layer, making the slower storage invisible to the application, while the latter increases the number of actuators working a task (such as I/O processing) so more work gets done in less time. We begin to try different strategies after we've identified that application I/O is encountering a logjam somewhere on the pathway to the storage through the combination of software (APIs, command languages and protocols) and hardware (host bus adapters, cables, switch ports and device connections) that connects the app to its stored data. A simple indicator that a problem exists in this path is the existence of a greater than expected I/O queue depth, which is a lot of writes waiting to be written to the storage device.

Only, with most of the slow VM performance I've encountered at client sites and in our labs, storage queue depths are pretty shallow. The problem of slow VMs, in other words, cannot logically be laid at the doorstep of storage latency. At the same time, in most of these situations, the rate of processor cycling on the server hosting these slow-performing VMs tends to be extraordinarily high. When processors run hot like that, it usually indicates a struggle to resolve a problem that exists in the application inside the VM or in the hypervisor software itself. In short, the logjam exists above the layer of the storage I/O path.

Your hypervisor could be the bottleneck

So, evil, proprietary legacy storage may not be to blame for your slow VMs. It's likely the problem is your hypervisor or virtualized application. This begs the question of why you would want to unplug all your existing storage -- in many cases an infrastructure you've spent considerable energy and budget consolidating into a SAN or carefully deploying as NAS file servers in your network -- only to replace it with direct-attached JBODs operating under your hypervisor vendor's latest software controller kit.

The I/O blender effect is real. When you stack a bunch of VMs and simply let them run wild, blending lots of random I/O into an incoherent mess of writes that will, in short order, burn out your flash and clutter up your disk, you will likely have VM slowdowns. Again, this isn't a problem with storage per se, but with your hypervisor strategy. Assuming you want to stick with your hypervisor, you may want to consider a more efficient way to organize and write data -- whether using a log-structuring approach from a vendor like StarWind Software, an efficient software controller from PernixData or an uber-controller from DataCore Software -- that doesn't require you to change your underlying storage hardware infrastructure.

As for ripping and replacing your legacy storage for a server-side approach, the choice is yours. But be clear on why you're doing it. My research has discovered little supporting data to suggest that displacing legacy storage with DAS will do anything to address slow virtual machine performance. Do yourself a favor and use some simple meters, available on every operating system, to examine processor activity and queue depth before you settle on a remediation strategy.

BIO: Jon William Toigo is a 30-year IT veteran, CEO and managing principal of Toigo Partners International, and chairman of the Data Management Institute.

This was last published in March 2015

PRO+

Content

Find more PRO+ content and other member only offers, here.

Essential Guide

Guide to storage performance and specs

Join the conversation

2 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

Full disclosure ..... Yes I work for Tintri.

What needed is insight and awareness through the infrastructure in a hypervisor based architecture and the ability to act on that in real time. How about a platform that can provide multi-hypervisor based shared storage that can predict and react on a per-VM basis and provide an application with guaranteed performance SLA ...... Sound right ?
regards, john@tintri
Cancel
I noticed a serious performance difference in I/O bound applications running in VMWare on a machine, even with SSD. It seems to be getting better over time, but it is a problem, and trying to run of a Storage Area Network just adds network propagation delay. My advice for now is to run as much as you can in RAM and on SSD.
Cancel

-ADS BY GOOGLE

SearchSolidStateStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close