This article can also be found in the Premium Editorial Download "Storage magazine: Low-cost storage pieces fall into place."
Download it now to read this article plus other related content.
|Fine-tuning your SAN|
The SAN admin does a quick review of the SAN topology and finds that three other servers running Oracle are each accessing the same Fibre Channel (FC) controllers on the same storage array. A deeper look into the performance and event logs for the host bus adapters (HBAs), switches and array turns up no glaring performance problems or equipment failures. It's still early in the morning and the SAN admin's headache is getting worse.
The storage admin needs to gather performance statistics from the HBAs, switches and the storage array in the data path and correlate that with the performance stats gathered from each of the three server's operating systems and databases to pinpoint the problem. Complicating the matter, each of these components has its own performance management tool that stores the data in varying formats. His next step? Collect, synchronize, interpret and report on the data from all of these components in the data path as quickly as possible.
Successfully diagnosing a complex SAN performance problem is more of a task for a team of mathematicians skilled in modeling chaos theory than a bleary-eyed SAN admin working with an Excel spreadsheet. Is help on the way? Are today's performance management tools up to the task?
What is good performance?
Complicating the task of delivering good performance is that there's a lack of meaningful performance measurement standards that define good performance. While vendors and analysts often cite benchmarks such as high rates of I/O and cache hits or low seek times on disk drives as examples of good performance, these statistics offer little insight into real-world situations.
More meaningful benchmarks to administrators would report on how a storage subsystem responds to random reads and writes vs. sequential reads and writes, the impact on a storage subsystem when multiple applications are executing and simultaneously accessing it or which backup product works best with which type of application. Currently, these sorts of standards exist only on the drawing boards at the Storage Performance Council (SPC).
Walter Baker, an administrator and auditor with the SPC, reports that in 2002, the SPC published the first industry standard performance benchmark for storage, called SPC-1. This benchmark measures I/O operations characterized by OLTP, database and e-mail operations. Yet by Baker's own admission, the SPC-1 standard represents just the first step toward reporting on what performance standards need to calculate. He points out that the SPC-1 standard primarily measures activity only in a single address space. The SPC-1 results can't necessarily be applied in multiaddress spaces where multiple applications execute simultaneously.
Until these standards get defined and accepted, AppIQ's CTO Ash Ashutosh believes that the responsibility to define good performance rests on the application owner. He advises these individuals to define the metrics meeting the application's service level agreements (SLAs) and from that definition construct an environment meeting those objectives.
Performance measurement challenges
A number of factors contribute to the difficulties an administrator has in gathering and interpreting meaningful performance data. The biggest issue is the collection of performance statistics. This data may be spread across every component in the SAN: the database, the operating system, the FC HBA, the FC switch and the storage arrays. Storage administrators need to know how to use each component's proprietary tools to collect the data.
Once you know how to use the tool, the next step is to make sure the clocks of the components being measured are all in sync. Being off a couple of minutes or possibly even a couple of seconds on any of the component's internal time clocks could skew the interpretation of the results. Second, storage administrators need to verify that they have gathered all of the data required for analysis. For instance, on a storage array, administrators may choose the option to capture all of its port I/O activity and cache hit statistics, yet fail to select the option that records a disk drive's read and write statistics.
The last few issues that affect the ability to determine the cause of any performance problem are the skill and the amount of time it takes the administrator to diagnose a performance issue. The amount of time this takes will correlate to the amount of time they have spent working on similar issues in the past, the access they have to the needed information and the quality of the information gathered.
Here the situation may get political, especially in large organizations. While in some organizations one individual may control and analyze all pieces of the SAN, in other organizations different departments may own and manage different pieces of the storage network. At this point, as AppIQ's Ashutosh observes, the process often degrades from a brain storming session to a blame storming session because the analysis and interpretation of performance data can become highly subjective.
This was first published in October 2003