The trouble with measuring SAN performance

A common system administrator's nightmare involves the telephone call at 4:oo in the morning from your boss. He's calling to tell you there's a problem with the SAN. Here's how to survive and prevent this from happening again.

This Content Component encountered an error
This article can also be found in the Premium Editorial Download: Storage magazine: Low-cost storage pieces fall into place:

Fine-tuning your SAN
Trying to isolate the cause of a performance issue in networked storage environment requires the collection of performance metrics in the data path. Here's a sampling of touch points a storage administrator may want to capture metrics on when troubleshooting a performance problem.

Server

  • CPU
  • Speed
  • Number
  • Memory
  • Amount
Operating System
  • I/O
  • File system activity
Database
  • Number of threads or users
  • Type and complexity of queries being run
Storage Array
  • Disk speed
  • Back-end disk connectivity (Fibre Channel or SCSI)
  • Cache or memory
  • Front-end port count
Connectivity
  • Switches/Directors
  • Number of hops
  • Zoning
Protocol
  • Fibre Channel
  • SCSI
  • TCP/IP or iSCSI
Channels
  • Bandwidth
  • Number of channels
  • Speed
Host bus adapters
  • Firmware
  • Driver
  • Backplane
  • Number
  • Connection count
It's the early morning telephone call every storage area network (SAN) administrator dreads. The system administrator, DBA and their managers are already on the phone. The issue? One of the mission-critical application databases is running slow and they think it's a SAN performance problem.

The SAN admin does a quick review of the SAN topology and finds that three other servers running Oracle are each accessing the same Fibre Channel (FC) controllers on the same storage array. A deeper look into the performance and event logs for the host bus adapters (HBAs), switches and array turns up no glaring performance problems or equipment failures. It's still early in the morning and the SAN admin's headache is getting worse.

The storage admin needs to gather performance statistics from the HBAs, switches and the storage array in the data path and correlate that with the performance stats gathered from each of the three server's operating systems and databases to pinpoint the problem. Complicating the matter, each of these components has its own performance management tool that stores the data in varying formats. His next step? Collect, synchronize, interpret and report on the data from all of these components in the data path as quickly as possible.

Successfully diagnosing a complex SAN performance problem is more of a task for a team of mathematicians skilled in modeling chaos theory than a bleary-eyed SAN admin working with an Excel spreadsheet. Is help on the way? Are today's performance management tools up to the task?

What is good performance?
Complicating the task of delivering good performance is that there's a lack of meaningful performance measurement standards that define good performance. While vendors and analysts often cite benchmarks such as high rates of I/O and cache hits or low seek times on disk drives as examples of good performance, these statistics offer little insight into real-world situations.

More meaningful benchmarks to administrators would report on how a storage subsystem responds to random reads and writes vs. sequential reads and writes, the impact on a storage subsystem when multiple applications are executing and simultaneously accessing it or which backup product works best with which type of application. Currently, these sorts of standards exist only on the drawing boards at the Storage Performance Council (SPC).

Walter Baker, an administrator and auditor with the SPC, reports that in 2002, the SPC published the first industry standard performance benchmark for storage, called SPC-1. This benchmark measures I/O operations characterized by OLTP, database and e-mail operations. Yet by Baker's own admission, the SPC-1 standard represents just the first step toward reporting on what performance standards need to calculate. He points out that the SPC-1 standard primarily measures activity only in a single address space. The SPC-1 results can't necessarily be applied in multiaddress spaces where multiple applications execute simultaneously.

Until these standards get defined and accepted, AppIQ's CTO Ash Ashutosh believes that the responsibility to define good performance rests on the application owner. He advises these individuals to define the metrics meeting the application's service level agreements (SLAs) and from that definition construct an environment meeting those objectives.

Performance measurement challenges
A number of factors contribute to the difficulties an administrator has in gathering and interpreting meaningful performance data. The biggest issue is the collection of performance statistics. This data may be spread across every component in the SAN: the database, the operating system, the FC HBA, the FC switch and the storage arrays. Storage administrators need to know how to use each component's proprietary tools to collect the data.

Once you know how to use the tool, the next step is to make sure the clocks of the components being measured are all in sync. Being off a couple of minutes or possibly even a couple of seconds on any of the component's internal time clocks could skew the interpretation of the results. Second, storage administrators need to verify that they have gathered all of the data required for analysis. For instance, on a storage array, administrators may choose the option to capture all of its port I/O activity and cache hit statistics, yet fail to select the option that records a disk drive's read and write statistics.

The last few issues that affect the ability to determine the cause of any performance problem are the skill and the amount of time it takes the administrator to diagnose a performance issue. The amount of time this takes will correlate to the amount of time they have spent working on similar issues in the past, the access they have to the needed information and the quality of the information gathered.

Here the situation may get political, especially in large organizations. While in some organizations one individual may control and analyze all pieces of the SAN, in other organizations different departments may own and manage different pieces of the storage network. At this point, as AppIQ's Ashutosh observes, the process often degrades from a brain storming session to a blame storming session because the analysis and interpretation of performance data can become highly subjective.

One tool is often all you need
Vendors freely supply or sell performance management software for their products. For example, all flavors of Unix generally ship with the same set of performance management tools. Utilities commonly used in this environment to capture performance on disk statistics include sar, iostat, vmstat, ps and filemon. Utilities such as sar or iostat collect logical disk statistics such as disk busy, transfers per second and kilobyte throughput per second. Vmstat may be used to provide the total number of I/Os done during each interval while the ps utility provides options to analyze the amount of I/O done on behalf of the processes. Filemon monitors the performance of files systems and reports on the I/O activity on behalf of logical files as well as logical and physical volumes.

Outside of the variances of these utilities available on each flavor of Unix, the utilities that come with operating systems, databases and hardware appliances generally monitor and report on only their solution. Yet in a number of instances, these tools used alone or in conjunction with the other vendors' tools are sufficient to solve the immediate performance problem.

For instance, initiating iostat from the Unix command prompt and seeing long iowait times associated with a particular disk may indicate that frequently accessed data is loaded on a disk drive running at too slow a speed. Or starting McData's SANavigator product director and choosing the Performance Graph option to monitor their Intrepid 6140 may help the administrator to detect that the storage port to which the server is trying to retrieve data is running at 90% to 100% utilization. This may indicate a configuration where too many applications or servers are trying to access data down the same path at the same time.

Yet for either of these diagnoses to occur, certain assumptions and conditions must be in place. For instance, a storage administrator needs to have to know how to operate each performance management tool for the software platform, application or hardware appliance that it manages and the ability to collect the data it produces. The same administrator also needs the appropriate access, permissions to execute the tool and the ability to understand the data collected. These permissions are not givens in every environment.

The holistic view
Of course, life would be much simpler if a single robust management tool could integrate the performance management data of all the SAN components and provide a synchronized, holistic view into the environment. A number of vendors claim to provide such a tool. However, each vendor takes a different approach. Basically, the tools can be grouped into three general categories:

  • The tools that have grown from managing just their appliance to a more holistic view.
  • The tools that have traditionally taken an application focus, but are now seeking to drill down into the storage network.
  • New players who can situate themselves however they want.
EMC Corp. and Hitachi Data Systems (HDS) represent two vendors who are looking to expand the scope of their products from point solutions to enterprise solutions. EMC is taking the performance management features of their native Control Center product--which has traditionally only reported on their product line--and is expanding those capabilities to provide a holistic view of the storage environment regardless of the operating system, database, switch or storage array vendor. Similarly, HDS is expanding the functionality of its HiCommand product from only doing storage performance management on its storage arrays to an enterprise performance management solution.

While both companies have stated that these are their objectives, they're following slightly different paths to get there. One area where they both agree is in the adoption and integration of the new SMI-S storage standards into the implementation of their solution. Similar to what the simple network management protocol (SNMP) did for the IP world in terms of monitoring and reporting on performance, the emerging storage management initiative specifications (SMI-S) seek to do much the same for the storage networking space. EMC and HDS plan to get the necessary information they need from the operating systems and databases by deploying agents on the servers attached to the storage network.

From there, the design of their products differs in a couple of aspects. First, where EMC does not already have API agreements in place to report on the advanced functionality within its competitors' storage arrays, they are looking to reverse-engineer the solution. HDS, on the other hand, is looking to obtain the necessary APIs by purchasing them from its competitors. Second, EMC's recent acquisition of BMC's Storage Patrol product may give them a short-term edge over many of the competitors in this market. It now has something most of the others do not--performance management tools designed independently, plus its own tools.

Companies such as Computer Associates (CA), IBM/Tivoli and Veritas Corp. fall into the second group of companies looking to expand their traditional software base. For example, CA expects BrightStor SAN Manager to eventually link back into UniCenter to provide an enterprisewide console for LAN, WAN and SAN performance reporting and management at the host level. IBM/Tivoli also looks to match CA's initiative by tying its IBM/Tivoli Storage Area Network Manager back into its IBM/Tivoli Enterprise Console at some point in the future. Veritas is also looking to capitalize on its deployment on its existing server based Volume Manager and File System software and use it in conjunction with SANPoint Control to offer a similar enterprise console.

Yet for these Category 2 companies to get the level of detail needed to solve really thorny performance management problems, they need what the storage array and switch vendors have--the APIs. The new SMI-S standards grant them greater visibility into these environments by discovering switch bottlenecks and hot spots on storage subsystems. But they will eventually need more than just these standards to provide the advanced functionalities such as dynamic performance tuning on storage arrays already offered by their counterparts with the legacy hardware focus. In this respect, IBM/Tivoli's software may have a short-term advantage over their competitors in this category because the same parent company owns both the hardware and software parts needed to complete the equation.

The final group of vendors bringing the holistic offering to the table is the independent companies who have no legacy hardware or software they need to build into the equation. Companies such as AppIQ, CreekPath Systems and InterSAN can focus more on building performance management software that meets customer requirements than trying to integrate with legacy hardware and software.

However, Category 3 companies will struggle to get a foothold in organizations. There are only two times administrators care about performance: when they initially set up the system and when a problem exists. Other than that, administrators have better things to do than watch performance monitors oscillate between 20% and 80% utilization rates.

Two ways to manage performance
END-TO-END SOLUTIONS

AppIQ Manager: An application-integrated solution with optional add-ons that support Oracle and Exchange.

CreekPath Systems AIM Suite: Supports and integrates with a range of operating systems, Fibre Channel (FC) switches, HBAs and storage arrays.

EMC Control Center: Reports on and reactively responds to failed paths or performance hot spots. EMC is expanding this product to work with other storage arrays.

Precise Software i3 APM technology: Veritas recently acquired Precise Software's i3 APM technology that analyzes total system performance from the server and the application all the way down to the storage array level.

Other products: IBM/Tivoli SAN Manager, InterSAN Pathline, Storability Global Storage Manager and Veritas SANPoint Control provide varying levels of end-to-end performance monitoring and reporting.

POINT SOLUTIONS

HBAs: Emulex HBAnyware and QLogic SANblade Manager offer the ability to report and monitor performance on their HBAs, capturing such information as port status, throughput and problems the cards may experience

Switches: Brocade Fabric Manager, Cisco Fabric Manager, CNT/Inrange Enterprise Manager, McData SANavigator and QLogic SANblade Manager offer varying levels of abilities to display and capture performance statistics that measure port activity, utilization and throughput on their respective switches. For advanced FC performance analysis that works independently of the switch vendors, look to the Finisar GTX Analyzer.

Storage arrays: Most enterprise storage arrays vendors such as 3PAR, IBM, Network Appliance, StorageTek and Sun offer tools that minimally enable administrators to capture and report on various performance statistics. Some larger, monolithic storage arrays offer self-tuning performance tools such as EMC Symmetrix Optimizer and HDS HiCommand Tuning Manager that respond to hot spots on disk drives and can dynamically move data within the array to alleviate this contention.

Servers/operating systems: Demand Technology NTSMF, Fujitsu Softek Server Manager and NetScout Systems NetScout Server are just a few in a long list of the performance management tools that compliment the utilities included in most operating systems today.

The agent problem
It's ironic, but performance-tuning software can create its own set of performance problems. As performance agents scamper from component to component, they clog the storage environment.

For this reason, Peter Galvin, CTO, Corporate Technologies says his company stopped selling and supporting such a product. When deployed, this product's agent consumed 10% of the server's CPU and generated additional network traffic.

Chris Gahagan, EMC's senior VP of infrastructure software, says that administrators should expect agents to consume no more than 1% to 2% of CPU and memory overhead. Any more than that, and the agent becomes obnoxious. He believes that to keep performance management agents at that level, they should only focus on gathering and monitoring high-level data. They should only consume more resources when they start to spot a problem thereby requiring more options to be turned on. However, once the problem is identified and solved, the agent should automatically throttle back to its default configuration.

Another problem is simply getting the agents on the servers, configuring them and maintaining them once they are there. Installing and configuring agents on 10 or 20 servers running Windows is one level of difficulty. Doing the same thing on a couple hundred servers with different operating systems and databases creates a whole new level of complexity. The good news is that there's progress to report on this front. EMC's Gahagan believes software agents should be self- propagating and distribute themselves to the servers running them. EMC is currently building an agent architecture that uses a distribution server to propagate agents to the servers it supports.

AppIQ hopes to minimize or avoid the whole agent issue by programming their central server to open a connection to either Microsoft's Windows Management Instrumentation (WMI) interface or the various Unix vendors' versions of it. Sun Solaris has had this functionality since Solaris 7, while IBM is currently preparing a version of it for their releases of Linux and AIX. Buyer beware: Many of these products are still in early stages of release.

Best bets for now
An administrator pondering what approaches to take for their performance-management needs should probably use a combination of point solutions in heterogeneous environments or tools from storage array vendors in homogeneous storage networks. Too many of today's tools designed for heterogeneous environments are either still in their infancy, are dying on the vine such as BMC's Storage Patrol, or only work in qualified heterogeneous environments where it works with a limited number of storage arrays and operating systems such as CreekPath Management Suite.

Of the point solutions that only get information from one operating system, database, switch or storage array, your time would be better spent tuning these applications, understanding them and getting them synced up in your environment than looking to any third-party tool. Performance management software will continue to be more of an exercise in brute force than an art form for the foreseeable future. Emerging standards, questions about functionality, tight budgets and the fact that measuring SAN performance is not a priority in most environments until something goes wrong will contribute to an overall procrastination in deploying this technology.

This was first published in October 2003
This Content Component encountered an error

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSolidStateStorage

SearchVirtualStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close