Last month, we discussed some of the best practices involved in chasing down and resolving problems in your storage area network (SAN). (See SAN troubleshooting techniques) These practices relied on the administrator understanding the functionalities available at the five different layers of the Fibre Channel (FC) reference model. This month, we'll cover how to use these practices and tools to uncover and understand common errors occurring...
in corporate SANs today. In addition, we'll look at how FC analyzers are helping enterprises reduce recovery time by finding out what's going on beneath the fabric.
Let's start to delve into some of the common errors you'll experience in operating your SAN, and how you can gain familiarity with the interworkings of your interconnects by taking methodical approaches to resolution. We'll assume that you're running certified versions of your vendor's firmware and device drivers, so your problems lie elsewhere.
After first checking the LED status of the switches in the data path to verify a good link, log in to the switch where the host experiencing the problem is attached to and examine the error counters of the relative port. Of course, if you don't have a good link state--a green light--you'll need to verify the functionality of the switch port by rebooting, running switch diagnostics or by using some of your vendor-supplied commands to confirm the viability of the connecting switch ports.
If you have a good link status throughout the data path, then you've pretty much taken FC0 out of the picture as a potential problem point. Now, we're applying the method described last month of starting at the lowest level of the FC reference model and proceeding systematically up the stack to FC4.
As we proceed to FC1-the layer in the reference model responsible for encoding and decoding data and bits-marginal hardware is likely to be the culprit of intermittent data errors in applications. Because of their very nature, these are often the hardest errors to track down: They are intermittent and can reside in the components of any one of a number of switches and/or bridges between the application host and its storage. These hardware components consists of host bus adapters (HBAs), gigabit interface converters (GBIC), fiber optic cabling, application specific integrated circuits (ASIC) and possibly SCSI controllers on an FC/SCSI bridge.
The ASICs in your FC switches are full of information about what particular component in your data path is failing. This information is in the form of error counters. While viewing these error counters, they'll quite naturally be on the incline due to the native bit error rate in the FC protocol. However, most--if not all--vendors reset these counters during a reboot in order to give you a baseline to calculate the rate in which these counters are incrementing over a given time. If rebooting the switch isn't possible, then depending on the vendor, you can reset the counters on the port without rebooting the switch. Keep in mind that because of FC's BER, transport errors are likely to occur every 16 minutes even without any data on the link.
Large numbers of encoding and CRC errors are usually an indicator that marginal hardware between the examined switch port and initiator is to blame for the intermittent data errors you're experiencing. A large number of encoding errors most likely indicates a problem in the cabling, perhaps a substandard or pinched cable. Large enterprises that splice their own cable are likely to have more marginal errors due to the involved human element. And large numbers of CRC errors usually indicate a marginal GBIC or the ASIC itself.
If the reported error counters in the switch port are in line with FC's BER, then you've successfully hurdled over FC1.
At this point, you might think that you can rule out a hardware problem. But actually the FCP incorporates much of its functionality in hardware, including FC2, which is responsible for frame and sequence development and termination, as well as flow control. That's one difference between IP and FC. Ethernet cards utilize the host's CPU (processing and interrupts) to facilitate the same tasks that FC devices do onboard. To a large extent, this is why FC is more processor-efficient in moving large blocks of data across the transport.
On the lookout for FC2
Sequences and ultimately frames are formed in the FC2 layer as the result of an exchange that's created at the behest of an Upper Layer Protocol (ULP), which is usually SCSI, but it could be IP or something else as well. To reiterate, exchanges are composed of one or more sequences, which are composed of one or more frames. Whether command or data, these exchanges are created to provide the glue between the ULP and the FCP by transferring information units between the originator (initiator) and the responder (target). These information units are protocol (e.g., SCSI) commands and data that are sent during a FC conversation. It's during this FC2 layer conversation that the troubleshooter must eavesdrop to determine where problems are occurring in the SAN.
We briefly went over how FC conversations are originated, however, if you're involved in a SAN troubleshooting exercise at this layer, you're more likely concerned with why the conversations are ending abnormally. Conversations between FC nodes at the FC2 layer use link control frames to open and maintain connections between them. If you're experiencing errors in the origination, transport or reception of these link control frames, the exchange, sequence(s) and corresponding frames representing the ULP's information units will be discarded or re-sent depending on the class of service and error recovery mechanisms being used. For example, when arriving at a port, a frame could be busied or rejected by the port, and thus the corresponding link control frame will be sent back to the originator. Why the frame was busied or rejected by the receiving port could be for any number of reasons. For instance, the receiving port will reject a frame with a header requesting a service that isn't supported by the fabric, or if the receiving port isn't fully initialized, the frame will be busied. It's important to note that not all classes of service will require the receiving port to return this error condition to the originating port.
Basically, if acknowledgements are part of the class of service in use, then error conditions will be reported to the originator.
Congestion control is yet another responsibility of the FC2 layer. While engaged in a Class2 conversation, the communicating end points will have acknowledgments (ACKs) flowing between them for every frame sent by the initiator. If any of these ACK frames are dropped or delayed beyond error detect timeout value (E_D_TOV), the associated sequence and/or exchange will be terminated and re-sent. In Class3 conversations, ACKs aren't sent. Instead, only receiver ready (R_RDY) is sent back from the target to the initiator to indicate the target's receive buffer was cleared and is ready for another frame. If this frame is lost or corrupted, the initiator won't be allowed to send another frame until either a link credit reset (LCR) frame has been sent to the target, or the ULP aborts and resends the sequence.
There are quite a bit of management frames on the link besides commands and data. And although you can probe the connecting ports to view the error counters associated with these management frames, you won't be able to determine what protocol errors are occurring without a FC analyzer. By placing a FC analyzer on the data path between the originator and responder of an exchange, you can capture a finite amount of frames--depending on the amount of memory in the analyzer--destined for the target FC node. As data flows into the analyzer's memory, real-time performance data can be graphically displayed with counters for such protocol errors as malformed frames, elapsed timers, as well as the link and congestion control frames that are associated with aborted sequences.
Although the analyzer is placed between the two endpoints to make a copy of the frames into its buffer, at no time should the analyzer retime or modify the captured frames in any way. However, the optical signal will be amplified as it's retransmitted out of the analyzers transmit port, thereby altering any test data related to distance. Data captured at this layer can be viewed directly from memory or exported to a file to be viewed by your vendor's support team.
The common services at the FC3 layer can best be understood by looking at them as you would a set of daemon processes in Unix. Whether telnetd, named or routed, these deamon processes have a specific set of tasks to perform in your IP network. Thus, depending on the problem you're experiencing, you'll focus your efforts on the process responsible for facilitating that service.
The same is true in FC networks. For example, the fabric login server (FFFFFE) is responsible for facilitating the login of a port entering the fabric. During this conversation, the port will attempt to log into the fabric indicating the class of services it would like to communicate on the fabric with, its line speed, and hardware revision numbers. This process can be likened to the establishment of a line of communication between an IP host and a telnetd process on another host in the network.
The same similarities can be applied to named and the nameserver (FFFFFC), as well as routed and the fabric controller (FFFFFD).
So your method should be: Formulate a hypothesis, associate the failure to a particular service in the fabric and then follow that lead by uncovering the configuration related to that service. As an example, suppose you're trying to connect a new host to your fabric and it doesn't seem to want to show up as connected in your status display. After checking your LED status and ensuring that you have a good physical connection, mentally trace through the steps and services that a connecting node must go through to be visible on a FC network. In this example, it could be that the capabilities of the host's HBA aren't in compliance with the capabilities of the fabric when compared by the fabric login server. And thus a successful login won't be possible until the capabilities are matched, probably through a firmware or even hardware upgrade.
Upper layer protocols
The FC4 layer is responsible for mapping ULP information units onto the FC transport. Each protocol specification is responsible for defining how its command, data and status blocks will be mapped onto the FC network using information categories with defined formats. These protocol mappings usually appear in the guise of device drivers and firmware on the originating host or communicating target. Therefore, in the event should you find yourself chasing down an FC4 layer problem, your attention will be focused on combining the right mix of device drivers and firmware revisions in your SAN.
However, to arrive at the conclusion that it's indeed the FC4 layer that needs your attention, you must rule out the lower layers as possible problem areas. Because vendors aren't typically willing to open code or post a revision for every FC4 layer problem being experienced in the field by a user, it's important to ensure that the lower layers aren't suspect, and that when you call your vendors for support, you have sound technical reasoning behind your conclusion that the problems you're experiencing are related to the FC4 layer.
Documenting the common errors in your SAN is also a good idea. Not only will this serve as a valuable knowledge base for your support staff, and will one day be input into event monitoring and self-healing software, it should also serve you well when negotiating your support contracts with your vendors.