Published: 13 Oct 2003
|Fine-tuning your SAN|
The SAN admin does a quick review of the SAN topology and finds that three other servers running Oracle are each accessing the same Fibre Channel (FC) controllers on the same storage array. A deeper look into the performance and event logs for the host bus adapters (HBAs), switches and array turns up no glaring performance problems or equipment failures. It's still early in the morning and the SAN admin's headache is getting worse.
The storage admin needs to gather performance statistics from the HBAs, switches and the storage array in the data path and correlate that with the performance stats gathered from each of the three server's operating systems and databases to pinpoint the problem. Complicating the matter, each of these components has its own performance management tool that stores the data in varying formats. His next step? Collect, synchronize, interpret and report on the data from all of these components in the data path as quickly as possible.
Successfully diagnosing a complex SAN performance problem is more of a task for a team of mathematicians skilled in modeling chaos theory than a bleary-eyed SAN admin working with an Excel spreadsheet. Is help on the way? Are today's performance management tools up to the task?
What is good performance?
Complicating the task of delivering good performance is that there's a lack of meaningful performance measurement standards that define good performance. While vendors and analysts often cite benchmarks such as high rates of I/O and cache hits or low seek times on disk drives as examples of good performance, these statistics offer little insight into real-world situations.
More meaningful benchmarks to administrators would report on how a storage subsystem responds to random reads and writes vs. sequential reads and writes, the impact on a storage subsystem when multiple applications are executing and simultaneously accessing it or which backup product works best with which type of application. Currently, these sorts of standards exist only on the drawing boards at the Storage Performance Council (SPC).
Walter Baker, an administrator and auditor with the SPC, reports that in 2002, the SPC published the first industry standard performance benchmark for storage, called SPC-1. This benchmark measures I/O operations characterized by OLTP, database and e-mail operations. Yet by Baker's own admission, the SPC-1 standard represents just the first step toward reporting on what performance standards need to calculate. He points out that the SPC-1 standard primarily measures activity only in a single address space. The SPC-1 results can't necessarily be applied in multiaddress spaces where multiple applications execute simultaneously.
Until these standards get defined and accepted, AppIQ's CTO Ash Ashutosh believes that the responsibility to define good performance rests on the application owner. He advises these individuals to define the metrics meeting the application's service level agreements (SLAs) and from that definition construct an environment meeting those objectives.
Performance measurement challenges
A number of factors contribute to the difficulties an administrator has in gathering and interpreting meaningful performance data. The biggest issue is the collection of performance statistics. This data may be spread across every component in the SAN: the database, the operating system, the FC HBA, the FC switch and the storage arrays. Storage administrators need to know how to use each component's proprietary tools to collect the data.
Once you know how to use the tool, the next step is to make sure the clocks of the components being measured are all in sync. Being off a couple of minutes or possibly even a couple of seconds on any of the component's internal time clocks could skew the interpretation of the results. Second, storage administrators need to verify that they have gathered all of the data required for analysis. For instance, on a storage array, administrators may choose the option to capture all of its port I/O activity and cache hit statistics, yet fail to select the option that records a disk drive's read and write statistics.
The last few issues that affect the ability to determine the cause of any performance problem are the skill and the amount of time it takes the administrator to diagnose a performance issue. The amount of time this takes will correlate to the amount of time they have spent working on similar issues in the past, the access they have to the needed information and the quality of the information gathered.
Here the situation may get political, especially in large organizations. While in some organizations one individual may control and analyze all pieces of the SAN, in other organizations different departments may own and manage different pieces of the storage network. At this point, as AppIQ's Ashutosh observes, the process often degrades from a brain storming session to a blame storming session because the analysis and interpretation of performance data can become highly subjective.
|One tool is often all you need|
The holistic view
Of course, life would be much simpler if a single robust management tool could integrate the performance management data of all the SAN components and provide a synchronized, holistic view into the environment. A number of vendors claim to provide such a tool. However, each vendor takes a different approach. Basically, the tools can be grouped into three general categories:
- The tools that have grown from managing just their appliance to a more holistic view.
- The tools that have traditionally taken an application focus, but are now seeking to drill down into the storage network.
- New players who can situate themselves however they want.
While both companies have stated that these are their objectives, they're following slightly different paths to get there. One area where they both agree is in the adoption and integration of the new SMI-S storage standards into the implementation of their solution. Similar to what the simple network management protocol (SNMP) did for the IP world in terms of monitoring and reporting on performance, the emerging storage management initiative specifications (SMI-S) seek to do much the same for the storage networking space. EMC and HDS plan to get the necessary information they need from the operating systems and databases by deploying agents on the servers attached to the storage network.
From there, the design of their products differs in a couple of aspects. First, where EMC does not already have API agreements in place to report on the advanced functionality within its competitors' storage arrays, they are looking to reverse-engineer the solution. HDS, on the other hand, is looking to obtain the necessary APIs by purchasing them from its competitors. Second, EMC's recent acquisition of BMC's Storage Patrol product may give them a short-term edge over many of the competitors in this market. It now has something most of the others do not--performance management tools designed independently, plus its own tools.
Companies such as Computer Associates (CA), IBM/Tivoli and Veritas Corp. fall into the second group of companies looking to expand their traditional software base. For example, CA expects BrightStor SAN Manager to eventually link back into UniCenter to provide an enterprisewide console for LAN, WAN and SAN performance reporting and management at the host level. IBM/Tivoli also looks to match CA's initiative by tying its IBM/Tivoli Storage Area Network Manager back into its IBM/Tivoli Enterprise Console at some point in the future. Veritas is also looking to capitalize on its deployment on its existing server based Volume Manager and File System software and use it in conjunction with SANPoint Control to offer a similar enterprise console.
Yet for these Category 2 companies to get the level of detail needed to solve really thorny performance management problems, they need what the storage array and switch vendors have--the APIs. The new SMI-S standards grant them greater visibility into these environments by discovering switch bottlenecks and hot spots on storage subsystems. But they will eventually need more than just these standards to provide the advanced functionalities such as dynamic performance tuning on storage arrays already offered by their counterparts with the legacy hardware focus. In this respect, IBM/Tivoli's software may have a short-term advantage over their competitors in this category because the same parent company owns both the hardware and software parts needed to complete the equation.
The final group of vendors bringing the holistic offering to the table is the independent companies who have no legacy hardware or software they need to build into the equation. Companies such as AppIQ, CreekPath Systems and InterSAN can focus more on building performance management software that meets customer requirements than trying to integrate with legacy hardware and software.
However, Category 3 companies will struggle to get a foothold in organizations. There are only two times administrators care about performance: when they initially set up the system and when a problem exists. Other than that, administrators have better things to do than watch performance monitors oscillate between 20% and 80% utilization rates.
|Two ways to manage performance|
The agent problem
It's ironic, but performance-tuning software can create its own set of performance problems. As performance agents scamper from component to component, they clog the storage environment.
For this reason, Peter Galvin, CTO, Corporate Technologies says his company stopped selling and supporting such a product. When deployed, this product's agent consumed 10% of the server's CPU and generated additional network traffic.
Chris Gahagan, EMC's senior VP of infrastructure software, says that administrators should expect agents to consume no more than 1% to 2% of CPU and memory overhead. Any more than that, and the agent becomes obnoxious. He believes that to keep performance management agents at that level, they should only focus on gathering and monitoring high-level data. They should only consume more resources when they start to spot a problem thereby requiring more options to be turned on. However, once the problem is identified and solved, the agent should automatically throttle back to its default configuration.
Another problem is simply getting the agents on the servers, configuring them and maintaining them once they are there. Installing and configuring agents on 10 or 20 servers running Windows is one level of difficulty. Doing the same thing on a couple hundred servers with different operating systems and databases creates a whole new level of complexity. The good news is that there's progress to report on this front. EMC's Gahagan believes software agents should be self- propagating and distribute themselves to the servers running them. EMC is currently building an agent architecture that uses a distribution server to propagate agents to the servers it supports.
AppIQ hopes to minimize or avoid the whole agent issue by programming their central server to open a connection to either Microsoft's Windows Management Instrumentation (WMI) interface or the various Unix vendors' versions of it. Sun Solaris has had this functionality since Solaris 7, while IBM is currently preparing a version of it for their releases of Linux and AIX. Buyer beware: Many of these products are still in early stages of release.
Best bets for now
An administrator pondering what approaches to take for their performance-management needs should probably use a combination of point solutions in heterogeneous environments or tools from storage array vendors in homogeneous storage networks. Too many of today's tools designed for heterogeneous environments are either still in their infancy, are dying on the vine such as BMC's Storage Patrol, or only work in qualified heterogeneous environments where it works with a limited number of storage arrays and operating systems such as CreekPath Management Suite.
Of the point solutions that only get information from one operating system, database, switch or storage array, your time would be better spent tuning these applications, understanding them and getting them synced up in your environment than looking to any third-party tool. Performance management software will continue to be more of an exercise in brute force than an art form for the foreseeable future. Emerging standards, questions about functionality, tight budgets and the fact that measuring SAN performance is not a priority in most environments until something goes wrong will contribute to an overall procrastination in deploying this technology.