SAN problems are generally divided into two areas: connectivity and performance. Connectivity problems happen when something in the SAN, from a port to the whole SAN, simply doesn't work. With a performance problem,
Unfortunately connectivity problems are the most common. That's unfortunate because they are the hardest to troubleshoot. When an entire storage array suddenly drops off the SAN you have a pretty clear and fairly shallow troubleshooting tree. When things just aren't working very well it can be a lot more complex to identify what's happening.
This is exacerbated by the information problem. Usually the problem with troubleshooting a SAN isn't that you don't have enough information, it's that you've got too much.
From HBAs to switches to storage arrays, most devices on a SAN keep event and alarm logs, often copious logs. When a SAN starts to go flaky all these logs tend to fill up rapidly as the problem causes errors in other components. One of the first problems is to sort out the information you need from what you don't.
The best strategy is to concentrate and correlate. That is concentrate on the recent errors that seem related to the problem and try to correlate the information in the error and performance logs to get an overall picture of what's happening. In doing so it's important to get data from all these sources at the same instant in time so you can meaningfully compare what's happening. Ideally you can trace each device's log from when things were performing normally to when they went wrong. Since the information tends to drop off the logs in the storm of alarms, it's important to get those logs quickly.
While you're checking the status of the SAN components, make sure you've got current information on the operating system version, maintain level, driver version and versions of the firmware. This information is invaluable when you're dealing with other support personnel. (On a related issue, you should take regular snapshots of SAN performance at the same time of day and under the same load conditions. Not only will this give you a baseline to work from, trends in this data can also help you detect problems before they become serious.)
Usually the best place to start looking is at the switch. Often you can divide the problem tree in half and quickly determine if the problem is on the server side or the storage side of the switch. In the case of multiple switches or director and switch topology, start with the central point and work outwards.
Check the switch's event and alarm logs and monitor the performance on the ports. You can compare the values with the previously stored snapshot to see how performance has changed. If you haven't been taking regular snapshots, you should take two snapshots a few minutes apart and look for changes in the parameters. Check the status of the switch and the integrity of the fabric and check the integrity of the zoning as well.
On the server side, check the event and alarm log and, if multi-path software is installed, check the virtual paths and adapters.
On the storage side check the event and alarm log and the status of the Fibre Channel ports.
Given this information, and some thought, you should get some idea if the problem is in the components (both hardware and software) or in the design of the SAN. Remember it is possible to run up against a design limitation quite suddenly if, for example, an application needs more SAN bandwidth than was allowed for.
For more information:Tip: Troubleshoot SANs from the center out
Tip: Troubleshooting Fibre Channel HBAs in PCs
Tip: Fast guide to SAN management
About the author: Rick Cook has been writing about mass storage since the days when the term meant an 80K floppy disk. The computers he learned on used ferrite cores and magnetic drums. For the last twenty years he has been a freelance writer specializing in storage and other computer issues.
This was first published in February 2004