Hidden away in IBM's Redbook on "Enterprise Storage Server performance monitoring and tuning" is a brief chapter...
on performance analysis of storage systems that serves as a quick refresher course on the business of performance analysis of storage systems.
The first point, IBM says, is that performance analysis is only possible in reference to something. You need standards set in a service level agreement or input from the users before you can decide whether performance is acceptable or not. In default of a service level agreement, the best place to start is to ask the users, the people who 'own' the data. Ideally you should have service levels established for the various kinds of data or applications. This includes highlighting and documenting the parts that require acceptable performance.
Once you have a performance baseline you can begin to analyze your performance. The basic tools are your daily performance monitoring activities. You should get reports giving at least a broad overview of your storage system's performance daily, either by using tools built into the operating systems, such as Unix iostat or Windows' Performance Monitor, or by using the reports generated by your storage management software.
If your storage system's performance is marginal or unacceptable, the next step is to figure out where you can make the most difference and start there. Fundamentally, every system is bottlenecked somewhere, although if performance is acceptable we don't care what the constraint might be. If we need to improve performance we start by identifying the bottlenecks, determining which is the worst bottleneck (or more specifically which bottleneck can be opened most cost-effectively) and breaking that. Once you've addressed the largest cause of delay, analyze your performance again. If it still isn't acceptable, go on to the next bottleneck and repeat until the performance is acceptable or until you reach the point of diminishing returns.
Generally, IBM says, you analyze performance in one of two ways. Either the users are complaining, in which case your analysis will be oriented toward improving response time, or you're seeing something in the reports and other system indicators you don't like, in which case you're likely to check the overall health of the system.
Response time problems generally divide into two classes. First there are problems in the storage system itself. The second class is external problems, which can involve everything from the LAN to the users' computers. The first step in solving a response problem is to understand the entire response chain from the disk to the user's screen and identify the areas that are holding down the response rate.
From a storage administrator's viewpoint, the internal problems are easier to characterize because they only involve data from the storage system. External problems typically require pulling together and analyzing data from multiple sources, often controlled by other departments. As you analyze the data from either source the key question is how much of the total latency in the operation is caused by which part of the operation, such as CPU time versus disk-access time versus I/O operations.
In analyzing a response-time problem, the usual place to start is to look for resource contention problems. Things like process contention increase response time in a big hurry.
Of course one of the biggest difficulties in any kind of response time problem is recreating the problem. Often the problem only occurs under limited conditions and may not be at all obvious at other times. This usually turns into a data collection problem. Try to identify the time period or conditions when the problem happens and collect data for those intervals. You should attempt to get a number of intervals and you may have to set a range and step through the data to locate the spikes in use. You also want a number of occurrences so you can make sure you've identified the problem.
Don't assume that because you've found one item out of range that you've located your problem. Storage system analysis requires an overall understanding of what's going on the system and that usually means looking at a lot of related indicators to make sure you do understand what's happening.
Once you've located your problem, proceed to tackle it in a top-down manner by breaking it into successively smaller pieces. Try to identify precisely what processes or operations have the problem and exactly when and how the problem occurs.
Finally, keep in mind that a storage system is exactly that -- a system. Try to keep all the parts of the system in balance as you tune and adjust the various parameters. Don't concentrate on one aspect of the system to the exclusion of all others.
Rick Cook has been writing about mass storage since the days when the term meant an 80K floppy disk. The computers he learned on used ferrite cores and magnetic drums. For the last twenty years he has been a freelance writer specializing in storage and other computer issues.