This article can also be found in the Premium Editorial Download "Storage magazine: Are your data storage costs too high?."
Download it now to read this article plus other related content.
Let's start to delve into some of the common errors you'll experience in operating your SAN, and how you can gain familiarity with the interworkings of your interconnects by taking methodical approaches to resolution. We'll assume that you're running certified versions of your vendor's firmware and device drivers, so your problems lie elsewhere.
After first checking the LED status of the switches in the data path to verify a good link, log in to the switch where the host experiencing the problem is attached to and examine the error counters of the relative port. Of course, if you don't have a good link state--a green light--you'll need to verify the functionality of the switch port by rebooting, running switch diagnostics or by using some of your vendor-supplied commands to confirm the viability of the connecting switch ports.
If you have a good link status throughout the data path, then you've pretty much taken FC0 out of the picture as a potential problem point. Now, we're applying the method described last month of starting at the lowest level of the FC reference model and proceeding systematically up the stack to FC4.
As we proceed to FC1-the layer in the reference model responsible for encoding and decoding data and bits-marginal hardware is likely to be the culprit of intermittent data errors in applications. Because of their very nature, these are often the hardest errors to track down: They are intermittent and can reside in the components of any one of a number of switches and/or bridges between the application host and its storage. These hardware components consists of host bus adapters (HBAs), gigabit interface converters (GBIC), fiber optic cabling, application specific integrated circuits (ASIC) and possibly SCSI controllers on an FC/SCSI bridge.
The ASICs in your FC switches are full of information about what particular component in your data path is failing. This information is in the form of error counters. While viewing these error counters, they'll quite naturally be on the incline due to the native bit error rate in the FC protocol. However, most--if not all--vendors reset these counters during a reboot in order to give you a baseline to calculate the rate in which these counters are incrementing over a given time. If rebooting the switch isn't possible, then depending on the vendor, you can reset the counters on the port without rebooting the switch. Keep in mind that because of FC's BER, transport errors are likely to occur every 16 minutes even without any data on the link.
Large numbers of encoding and CRC errors are usually an indicator that marginal hardware between the examined switch port and initiator is to blame for the intermittent data errors you're experiencing. A large number of encoding errors most likely indicates a problem in the cabling, perhaps a substandard or pinched cable. Large enterprises that splice their own cable are likely to have more marginal errors due to the involved human element. And large numbers of CRC errors usually indicate a marginal GBIC or the ASIC itself.
If the reported error counters in the switch port are in line with FC's BER, then you've successfully hurdled over FC1.
At this point, you might think that you can rule out a hardware problem. But actually the FCP incorporates much of its functionality in hardware, including FC2, which is responsible for frame and sequence development and termination, as well as flow control. That's one difference between IP and FC. Ethernet cards utilize the host's CPU (processing and interrupts) to facilitate the same tasks that FC devices do onboard. To a large extent, this is why FC is more processor-efficient in moving large blocks of data across the transport.
On the lookout for FC2
Sequences and ultimately frames are formed in the FC2 layer as the result of an exchange that's created at the behest of an Upper Layer Protocol (ULP), which is usually SCSI, but it could be IP or something else as well. To reiterate, exchanges are composed of one or more sequences, which are composed of one or more frames. Whether command or data, these exchanges are created to provide the glue between the ULP and the FCP by transferring information units between the originator (initiator) and the responder (target). These information units are protocol (e.g., SCSI) commands and data that are sent during a FC conversation. It's during this FC2 layer conversation that the troubleshooter must eavesdrop to determine where problems are occurring in the SAN.
We briefly went over how FC conversations are originated, however, if you're involved in a SAN troubleshooting exercise at this layer, you're more likely concerned with why the conversations are ending abnormally. Conversations between FC nodes at the FC2 layer use link control frames to open and maintain connections between them. If you're experiencing errors in the origination, transport or reception of these link control frames, the exchange, sequence(s) and corresponding frames representing the ULP's information units will be discarded or re-sent depending on the class of service and error recovery mechanisms being used. For example, when arriving at a port, a frame could be busied or rejected by the port, and thus the corresponding link control frame will be sent back to the originator. Why the frame was busied or rejected by the receiving port could be for any number of reasons. For instance, the receiving port will reject a frame with a header requesting a service that isn't supported by the fabric, or if the receiving port isn't fully initialized, the frame will be busied. It's important to note that not all classes of service will require the receiving port to return this error condition to the originating port.
Basically, if acknowledgements are part of the class of service in use, then error conditions will be reported to the originator.
This was first published in December 2002