Server performance can be a moving target. Whether physical servers or virtual machines, there are a number of...
places where data can be held up and, accordingly, there are different ways to address each type of bottleneck. It doesn't really matter whether you're dealing with a single physical server running one OS, or a big server supporting a hypervisor and multiple VMs.
There are only a few places where you're likely to find bottlenecks, and in each place, you'll find cache, ingestion of data, transport, writing of data, reading of data, and so on. Each of these data storage problems can be optimized, with more cache, faster ingestion, faster transport or faster storage devices. But fixing one will often reveal limitations in another part of the process; add flash for faster storage reads and writes, and you're apt to find the limitations in your storage network. Add a faster storage network, and your caching could become the performance culprit.
Multiple VMs can cause multiple problems
When you're dealing with multiple VMs, you need to be aware of some special cases related to actions that take place on several VMs simultaneously. Updates, backups or just random I/O can all have their effects multiplied dramatically when they're occurring across many VMs at the same time, producing I/O storms. Addressing these bottlenecks may be as simple as scheduling backups so that they start sequentially rather than at the same time on many VMs, using a backup agent that works through the hypervisor rather than each individual VM, or scheduling updates at staggered intervals.
There's a special kind of I/O storm that requires more effort to identify and a more sophisticated technique to solve: the I/O blender effect. While individual VMs and their OSes may optimize storage traffic by queuing reads and writes, when there are dozens of VMs all reading and writing simultaneously, the resulting traffic tends to become very random and difficult to optimize. There are specialized products, however, that can effectively address the I/O blender effect.
Let's look at how to track down and identify bottlenecks and performance-robbing configuration issues, including what tools to use, matching apps to specific storage environments, determining if hardware or software upgrades are in order, and identifying and alleviating data storage problems.
General principles of troubleshooting
At the most basic level, troubleshooting data storage problems involves breaking a system down into parts, isolating a problem to one part, and then drilling down until you put your finger on the actual problem. Depending on whether the problem is sudden or a growing trend, specific tools such as logging or fault alerts can give you advanced warning, and more useful data than just a user's complaint that an application is slow. For experienced administrators, the process may happen at an intuitive level; for example, knowing that a problem affects multiple users automatically eliminates some possibilities, but can also cause them to overlook uncommon problems such as a cable with an intermittent fault. That's when troubleshooting tools can save the day.
Troubleshooting and management tools
The actual tools you use will vary depending on the applications and brands of storage. Both VMware and Microsoft have basic tools built in and add-ons that can be very helpful, such as VMware's vSAN and vCenter, or Microsoft's Systems Management Server (SMS). Many storage systems either include troubleshooting in their toolkits or they're available as add-ons. There are also many good third-party tools that can help in more heterogeneous environments, such as SolarWinds Storage Resource Monitor or Dell's Foglight for Storage Management.
In addition to monitoring and recording performance data, and sending alerts if some parameters such as latency or utilization on a specific link exceed certain parameters, these applications can record historical data over time, which can be very helpful in identifying trends and showing whether these data storage problems have been steadily growing or are new issues.
Cache uses faster storage to respond to requests for data. The cache may be in the form of volatile or non-volatile solid-state memory, or even disk or tape. There may be a dozen types of cache in the data path from an application to where the data resides: three types of cache in the CPU, one on the host bus adapter (HBA) to the storage or the network, another on the storage system's RAID controller, several on the storage controller, and possibly several levels of auto-tiering or caching on the storage, plus more on the individual disks that make up the storage.
This may seem trivial since many of these are automatically set up, but optimizing an application and its data so that the data is served from cache can make an enormous difference in the overall speed of the application. For example, if the battery backup on the RAM cache on a RAID controller dies, the RAID controller will generally drop to a safe protocol that serves data from disk only, since a power outage could cause corruption of data stored in the RAM cache. This can cause a 10X degradation in response times from the storage for both reads and writes.
Tuning storage systems so most data requests are served from cache can also make a huge difference. If the amount of data used by a system is too large for the cache, or if the data is very random, performance can drop substantially. Your storage system should be able to tell you the percentage of data served from cache. If that figure is below 90%, the fix may be as simple as adding RAM to the storage controller. Doubling the RAM in a storage controller from 32 GB to 64 GB could result in a 10X improvement in latency.
Beating the blender effect
Whether your virtualized server environment is supported by traditional shared networked storage, storage designed specifically for VMs or software-defined or hyper-converged systems, there are performance issues specific to virtualization. These primarily revolve around multiple guest OSes in a single server creating an I/O blender because hypervisors typically aren't set up to aggregate I/Os from the many different operating systems running at once. While a single server OS can put data requests in order so that nearby or adjacent blocks of data are pulled down as needed, when a dozen or more OSes are doing this, the effect becomes very random, and can overwhelm caching systems not designed to cope with it.
There are third-party products that can alleviate the I/O blender effect, such as Atlantis Computing's ILIO and Infinio's Accelerator. These products use some of the RAM in the hypervisor to create a RAM disk to accelerate data requests, and coordinate data from multiple guest OSes so that the storage system sees a more easily cached set of requests. VMware's vSAN and Microsoft's Hyper-V 2016 are also adding similar features.
In addition to caching and the speed and type of disks in use, the network -- whether Fibre Channel, iSCSI or FCoE -- is the third leg of storage performance. Bumping up the speed of the network is an obvious thing to try if performance begins to lag, but it's important to look at performance characteristics such as network saturation and throughput before doing something as radical as upgrading a network.
For example, a big storage system was configured with mostly hard disk drives (HDDs) and performance was lagging. The administrators upgraded the network from 1 Gbps to 10 Gbps Ethernet, but there was very little difference in performance after the upgrade. They then replaced the HDDs with solid-state drives (SSDs) and saw a huge increase in performance. The 10 Gbps network enabled them to get the most out of the faster SSD storage.
If they had replaced the HDDs with SSDs first, it probably wouldn't have made as much of a difference as well, since the much faster SSD storage performance would have been masked by the latency and throughput of the 1 Gbps Ethernet. Once those two areas were upgraded, the focus shifted to a third area, the ability of the server running the applications to ingest data. Server utilization rose from about 10% to nearly 70%, indicating that a CPU and RAM upgrade would likely be the next bottleneck remedy as the utilization of the application continued to grow.
VMs and storage in the cloud
Cloud storage was originally focused mainly on backups and off-line storage. However, many organizations may find that they have one or more blocks of primary storage in the cloud. Storage for VMs located in a cloud such as AWS or Microsoft Azure, for instance, will typically be in the same cloud. Since many of the big cloud vendors have very high-speed, low-latency connections to other vendors, it's entirely possible that an app in one cloud, set up by one app dev group, might be configured to connect to storage originally intended as local storage for an app created by a different group in another cloud. Some organizations are even using caching appliances to serve data residing in the cloud as if it were local in the data center. It's very simple to add data sprawl to VM sprawl.
The end result of the ease with which data can be moved around is that it can become increasingly difficult to manage, both in terms of determining which copy of a given file is the most up to date and also for troubleshooting connectivity and performance issues. Data management applications are available to help solve these issues, but the first steps may be more political than technical: ensuring that individual departments aren't simply setting up their own VMs and developing their own apps without adhering to corporate standards for data integrity or which cloud they should be using. Creating and promulgating a standards document can go a long way toward simplifying the administrator's job.
It might be tempting to simply put the fastest available all-flash storage system in place to avoid performance bottlenecks. But that will not necessarily improve all types of data storage problems -- and it can get very expensive. A tiered storage system with very fast RAM cache, with a tier of fast SSDs and a tier of high-capacity HDDs that is properly configured might be faster than an all-flash system, and it will be much less expensive in terms of raw capacity. You'll notice that many all-flash systems are marketed based on their capacity with deduplication enabled, which typically assumes a 2x to 3x (or higher) compression ratio.
With properly configured caching algorithms for the RAM cache, and an appropriately sized tier 1 of SSDs, you're likely to find that more than 90% (as high as 99%) of requests are being served from the RAM cache, and 99% of the requests that are not served from cache are being served from the SSD layer. Since the RAM cache may be 10% of the size of the SSD tier, while the SSD tier is 10% to 20% of the size of the HDD tier, you gain a very large performance increase for relatively little cost. Keeping an eye on the percentage of requests served from cache and tier 1 is important. If that figure drops significantly, the performance of the system as a whole can drop by orders of magnitude, since each lower layer is 5 to 20 times slower than the one above it.
Ten tips to improve storage for virtual servers
Five ways to better manage virtual server storage