Troubleshooting storage area networks (SANs) is daunting. Never before have we had so many network devices between our application host and storage. Until a few years ago, all we had was an HBA and a disk controller with a thick, cumbersome cable in the middle. Back then, troubleshooting your storage connection was a less formidable task.
|The Fibre Channel protocol model|
Now, that thick cable has been replaced by a myriad of interconnects enabling multiple data paths between hosts and storage. And just as when we decoupled user terminals from their mainframe controllers, there's quite a bit that can go wrong with a hosts' ability to gain access to its storage in a SAN. That's why you should take the same methodical approach when troubleshooting your SAN as you do when troubleshooting your LAN.
Master a basic approach that can be adapted to uncover the particular problem at hand. For example, when troubleshooting a SAN, I always check the LED status of the switch. This simple effort might identify a marginal link, configuration conflicts between neighboring switches, HBA functionality and more.
Remember: Whatever initial steps you take to resolve conflicts in your SAN, be consistent in your starting point. These steps will turn into documented operational procedures for correcting errors in your fabric. Your support and operational staff will benefit by effective communication between the groups because everyone is familiar with the methodology behind the approach to resolution.
Contrary to direct-attached storage-where you would likely start your exploration into the problem at the host side or storage array - SAN problem resolution usually starts at the switch attached to the application host that's experiencing problems. Using this as your starting point, you'll more than likely see what Fibre Channel (FC) errors are being propagated to the application hosts' HBA before any SCSI translation.
Unless you're sure about the error, or if time permits you to follow a quick hunch, address the problem from the same angle every time. Over time, you'll become familiar with the various logic gates in your troubleshooting exercises and you'll know which path to take in resolving SAN problems. However, until that time comes, approaching your errors in a methodical format will allow you to become more familiar with the interworkings of your SAN and the Fibre Channel protocol (FCP) as well.
Similar to the Open Systems Interconnect reference model for TCP/IP, the FCP has a five-layer reference model of its own (see "The Fibre Channel protocol model"). The five layers are referred to as FC0 through FC4. During troubleshooting exercises, I often refer to the different functionalities of the five layers of the FC reference model. Starting from FC0 up to FC4, you can start to rule out the possible problem areas in your SAN. This is why I start most of my troubleshooting exercises by inspecting the LED indicators on the front of the switch.
Traversing the layers
FC0. This layer defines the data rates of the FCP with regards to the media types used in the solutions, as well as the distance and optimal levels of signal integrity. In short, it represents the quality of the physical connection between your application host and its storage. From the HBA, to the fiber optic cable and to the ports of the interconnected switches leading to your storage devices, a good LED link status throughout the data path could rule out much of this hardware.
FC1. The next step upward takes us to the FC1 layer. In transmit phase, the FC1 layer takes data characters and encodes them onto the link as 0s and 1s. And in receive phase, it decodes the 0s and 1s off of the link into data characters. This transformation takes place at both initiators and targets since they both send and receive data on the link. Knowing this, one could determine that the problem may lay higher up the physical chain, such as in an HBA or connecting port, if the LED status is satisfactory and the protocol error counters are increasing exponentially with user activity.
Additionally, your switch vendor should provide tools for you to query the various protocol error counters that are stored in the ASICs supporting your switch ports. You can use these tools to poll the switch port in which the problem application host is connected if you suspect that the host's HBA is failing. It should be mentioned here that even without any traffic on the link, the bit error rate of the FCP is such that you will see errors on the link about every 16 minutes.
FC2. Continuing upward, the FC2 layer is responsible for framing and flow control between connecting endpoints. Its functionality lies in the chips of the connecting ports on the SAN. Even this far up in the FCP reference model, we're still looking at the problem from a hardware perspective. While inspecting the error counters of a suspect port, take notice of any steady increase in frame header errors including out of order primitives that would indicate that the device connected to that port is suspect.
As with TCP/IP, the FCP also provides flow control via an acknowledgement (ACK) or receive ready (R_RDY), depending on the class of service being used by your application. With flow control, elapsed timers due to congestion in the fabric may also cause errors at this layer. For example, in a class 2 application, ACKs are returned to the sending node from the receiver to alert the sender that it did in fact receive the corresponding frame.
However, if congestion occurs and the ACK frame is dropped due to an elapsed timer, the entire exchange and sequence may be re-sent, depending on the error recovery mechanisms chosen by your application and/or your hardware vendors' implementation. For the most part, you should be able to determine if congestion is the culprit by evaluating the performance of other application hosts connected to the same switch as the problem application. Finding human error
You've now exhausted the layers that are limited to machine/machine interactions. From here on, human error enters into the picture.
FC3. The FC3 layer contains the supporting services in the FCP. Such services as the Simple Name Server (SNS), Alias Server and Time Server live at this layer.
Because there are user interfaces in this layer, whether from the command line or a third-party application, you'll find the solution to most of your problems at this and higher layers. Depending on the error being reported, and the FCP functions being used in your SAN, you'll start to interrogate the failing service with the tools provided by your switch vendor.
Most problems in this layer have to do with the SNS. And just like in the IP world where a DNS corruption may cause an end device to go off the air, a corruption in the SNS will have a similar affect in the FC world.
In all likelihood, you're not using all of the functionality of the FC3 layer. For instance, the Alias Server supports multicasting on an FC network. Therefore, if you're not utilizing multicasting, there's no need to take that path during the exercise.
FC4. Protocol mapping occurs at the FC4 layer. According to the standard, FC4 defines the program structures that third-party vendors must adhere to when sending and receiving data off of the FC transport.
What this means is this is the layer in which SCSI device drivers interface with the FCP. This area is undergoing constant change. Pressure to get products to market and mental lapses often don't give hardware engineers and application developers the ability to thoroughly test different scenarios that we are experiencing in the field.
That's why you must verify that the proper device drivers and firmware revisions are loaded on the devices comprising your SAN. Most hardware vendors have matrices indicating what device drivers and firmware revisions work best with other certified solutions. Still, it's a good idea to standardize on device driver and firmware revisions for like equipment. For example, if your group is managing 75 QLogic HBAs-and there isn't an extenuating reason not to-all 75 HBAs should be managed at the same device driver and firmware revision level.
If you follow these guidelines, you'll have a good start to resolving problems inside the SAN infrastructure. By breaking apart the functionality of the different layers of the FCP reference model, the troubleshooter should be able to reduce the time necessary to solve the problem by limiting their explorations to the one or two layers that are responsible for the lost functionality that you're experiencing in your SAN.