Storage performance measurement (SPM) has always been a difficult proposition for IT organizations, and the job has become even more challenging now that virtual servers are so popular.
Take RiskMetrics Group Inc., for example. The New York City-based financial services firm has 30 VMware Inc. ESX Servers spread across six locations, including data centers in the U.S. and Switzerland. Each ESX Server typically runs 10 to 15 virtual machines (VMs).
Many of those virtual machines were formerly physical machines that ran on local disk, so they were of no concern to the storage team. Now the staff needs to not only ensure the VMs perform in the same way they would if they were physical servers, but plan for their potential exponential growth.
The physical host is easy to analyze, and when a problem surfaces in a physical server environment, it's usually on the host or the storage, said Ed Delgado, storage architect at RiskMetrics. But virtual server environments mean the storage team can't depend only on the performance numbers on the host or the storage. That's because of the amount of other virtual machines on the same datastore.
"Has one VM gone haywire with writes and is it now throttling the other 14 VMs on that datastore? How do you know the other 14 VMs aren't doing the same thing?" wrote Delgado in an email. "On a physical host you can check the read and write MB/sec of a host and trust that number, but on VMware you basically have to add the numbers from the 15 VMs to see how you're actually performing."
RiskMetrics uses Tek-Tools Software Inc.'s Profiler tool to chart the read/write KBps numbers for a specific ESX node to narrow down any issue to a specific datastore. Profiler runs in one of the RiskMetrics' U.S.-based data centers and pulls information from VMware's vCenter Server (formerly known as VMware Virtual Center), and displays all 30 ESX instances in one panel, Delgado said. If three different people call to complain about slow-running VMs, he can see if all three VMs happen to be on the same datastore.
"It just helps with the problem identification," Delgado said. "It doesn't offer solutions for fixing, but it is helpful in the troubleshooting aspect."
Tools for virtual machine environments
Marc Staimer, president of Beaverton, Ore.-based Dragon Slayer Consulting, said tools such as Tek-Tools' Profiler, Akorri Inc.'s BalancePoint, NetApp's SANscreen (through its acquisition of Onaro Inc.), SANpulse Technology Inc.'s SANlogics and Veeam Software's Veeam Monitor can help administrators ensure there's not "too much" oversubscription between the application and the storage.
"Things are very different between the physical and virtual world," Staimer said. "You need software tools to help evaluate your environment to make sure that you're not going to shoot yourself in the foot."
Staimer recommends tools from a third party rather than those supplied by storage vendors, who typically don't measure beyond their own systems. He said the third-party tools also provide coverage "all the way to the virtual server and the application.
"You need end-to-end monitoring, not point monitoring," Staimer noted.
Because virtual machines are so easy to deploy, users create them at a rapid pace and move them around on a fairly regular basis, creating the potential to overload shared resources, noted Rich Corley, chief technology officer at Akorri. He said his company's tools analyze the shared resources -- array, network and servers -- to help figure out which component is being overutilized.
Brian Radovich, lead product manager at Tek-Tools, estimated that 70% to 80% of his company's customers have deployed at least some virtual servers. By the time Tek-Tools released its VMware module a year ago, it had witnessed an explosion in production usage of virtual servers, he said.
"What's different between three years ago and now is that the roles in managing the application, the server and the storage are merging because you have this concept of virtualization and these shared resources," Radovich said. "You can look at the array and identify the standard problems, but does that translate into better performance for the end-user experience?"
Virtual server performance management best practices
RiskMetrics follows some general principles to achieve better storage performance with its virtual server environment. The storage team, for instance, assigns no SATA disks to the ESX cluster, instead devoting 15K rpm Fibre Channel (FC) disks to its VMware environment.
"VMware has definitely added some complexity to our storage environment," wrote Delgado, "but we are dealing with it by using the best quality disk we have and by segregating it from other applications that we run. All our clusters have their own dedicated RAID groups and LUNs [logical unit numbers], and so far, we have been able to stay ahead of troubled waters."
With its email archiving system, RiskMetrics decided to bypass VMware's internal disk management system and present a 1 TB LUN directly to the VM that runs its busy Symantec Corp. Enterprise Vault.
"We did it for performance reasons," said Delgado. "We did it because nobody can now touch that LUN. They can't put virtual machines on it because it's not part of the VMware environment."
VMware advises customers of the need to reserve space for swapping memory allocation. So RiskMetrics allocated 20%, or 200 GB, of the 1TB LUN for that purpose. Its Profiler tool sends out an email alert to the VMware team if the data store exceeds the 80% threshold.
When RiskMetrics first launched Enterprise Vault, they expected the application would exhaust their 1TB LUN in eight months. Instead, it used up the LUN in about six months. The storage team put in a second 1 TB LUN in the hope that it might last another six months.
RiskMetrics plans to switch Enterprise Vault from a virtual machine to a high-availability cluster in the next couple of months. When they complete the move, the primary instance of Enterprise Vault will run on a physical server, although the failover will continue to be a VM. Enterprise Vault currently fails over to a VM on a different ESX Server.
"It's a very busy system, doing similar work to what Exchange does, and it's constantly growing because it's an archive," Delgado said.
Delgado offered advice to any of his storage architect peers who need to measure or fine-tune storage performance in a virtual server environment. Because most of the virtual machines that are created won't be deleted every month, administrators will find themselves with more and more VMs running on the cluster, he warned.
"Make the time to check on the performance at least once a week," he advised, suggesting that administrators take snapshots of the key disk performance attributes (read/write MBps, read/write access count) at the same day and time to provide data points that can indicate the crossing of a performance threshold. "You may be able to map that back to the introduction of a VM that has been hogging resources," he said.
"Honestly," Delgado concluded, "my best advice is to make sure you have a close work relationship with the virtual machine admins. They are the ones creating VMs and [they] can keep the admin in the loop as to what may be coming down the line that could have a tremendous impact in your VMware environment," such as a virtual machine for a Microsoft Corp. SQL Server or Oracle Corp. database.