This article can also be found in the Premium Editorial Download "Storage magazine: Optimizing your enterprise database storage."
Download it now to read this article plus other related content.
Troubleshooting storage area networks (SANs) is daunting. Never before have we had so many network devices between our application host and storage. Until a few years ago, all we had was an HBA and a disk controller with a thick, cumbersome cable in the middle. Back then, troubleshooting your storage connection was a less formidable task.
|The Fibre Channel protocol model|
Now, that thick cable has been replaced by a myriad of interconnects enabling multiple data paths between hosts and storage. And just as when we decoupled user terminals from their mainframe controllers, there's quite a bit that can go wrong with a hosts' ability to gain access to its storage in a SAN. That's why you should take the same methodical approach when troubleshooting your SAN as you do when troubleshooting your LAN.
Master a basic approach that can be adapted to uncover the particular problem at hand. For example, when troubleshooting a SAN, I always check the LED status of the switch. This simple effort might identify a marginal link, configuration conflicts between neighboring switches, HBA functionality and more.
Remember: Whatever initial steps you take to resolve conflicts in your SAN, be consistent in your starting point. These steps will turn into documented operational procedures for correcting errors in your fabric. Your support and operational staff will benefit by effective communication between the groups because everyone is familiar with the methodology behind the approach to resolution.
Contrary to direct-attached storage-where you would likely start your exploration into the problem at the host side or storage array - SAN problem resolution usually starts at the switch attached to the application host that's experiencing problems. Using this as your starting point, you'll more than likely see what Fibre Channel (FC) errors are being propagated to the application hosts' HBA before any SCSI translation.
Unless you're sure about the error, or if time permits you to follow a quick hunch, address the problem from the same angle every time. Over time, you'll become familiar with the various logic gates in your troubleshooting exercises and you'll know which path to take in resolving SAN problems. However, until that time comes, approaching your errors in a methodical format will allow you to become more familiar with the interworkings of your SAN and the Fibre Channel protocol (FCP) as well.
Similar to the Open Systems Interconnect reference model for TCP/IP, the FCP has a five-layer reference model of its own (see "The Fibre Channel protocol model"). The five layers are referred to as FC0 through FC4. During troubleshooting exercises, I often refer to the different functionalities of the five layers of the FC reference model. Starting from FC0 up to FC4, you can start to rule out the possible problem areas in your SAN. This is why I start most of my troubleshooting exercises by inspecting the LED indicators on the front of the switch.
Traversing the layers
FC0. This layer defines the data rates of the FCP with regards to the media types used in the solutions, as well as the distance and optimal levels of signal integrity. In short, it represents the quality of the physical connection between your application host and its storage. From the HBA, to the fiber optic cable and to the ports of the interconnected switches leading to your storage devices, a good LED link status throughout the data path could rule out much of this hardware.
FC1. The next step upward takes us to the FC1 layer. In transmit phase, the FC1 layer takes data characters and encodes them onto the link as 0s and 1s. And in receive phase, it decodes the 0s and 1s off of the link into data characters. This transformation takes place at both initiators and targets since they both send and receive data on the link. Knowing this, one could determine that the problem may lay higher up the physical chain, such as in an HBA or connecting port, if the LED status is satisfactory and the protocol error counters are increasing exponentially with user activity.
Additionally, your switch vendor should provide tools for you to query the various protocol error counters that are stored in the ASICs supporting your switch ports. You can use these tools to poll the switch port in which the problem application host is connected if you suspect that the host's HBA is failing. It should be mentioned here that even without any traffic on the link, the bit error rate of the FCP is such that you will see errors on the link about every 16 minutes.
FC2. Continuing upward, the FC2 layer is responsible for framing and flow control between connecting endpoints. Its functionality lies in the chips of the connecting ports on the SAN. Even this far up in the FCP reference model, we're still looking at the problem from a hardware perspective. While inspecting the error counters of a suspect port, take notice of any steady increase in frame header errors including out of order primitives that would indicate that the device connected to that port is suspect.
As with TCP/IP, the FCP also provides flow control via an acknowledgement (ACK) or receive ready (R_RDY), depending on the class of service being used by your application. With flow control, elapsed timers due to congestion in the fabric may also cause errors at this layer. For example, in a class 2 application, ACKs are returned to the sending node from the receiver to alert the sender that it did in fact receive the corresponding frame.
However, if congestion occurs and the ACK frame is dropped due to an elapsed timer, the entire exchange and sequence may be re-sent, depending on the error recovery mechanisms chosen by your application and/or your hardware vendors' implementation. For the most part, you should be able to determine if congestion is the culprit by evaluating the performance of other application hosts connected to the same switch as the problem application. Finding human error
You've now exhausted the layers that are limited to machine/machine interactions. From here on, human error enters into the picture.
This was first published in November 2002