Problem solve Get help with specific problems with your technologies, process and projects.

SAN scalability, part 1: Account for fabric reconfigurations in the SAN

Bits & Bytes: Fabric reconfigurations happen and it's important to understand why they sometimes do or do not occur, says contributor Simon Gordon.

Starting out my career as an electrical engineer and machine code programmer, I have long believed that understanding the general, low-level details of how something works helps you ultimately use it at a higher level. While you don't need to be an expert on the "Fibre Channel Bench Reference," there are a number of technical details that you should understand in order to work with the larger more complex storage area networks (SANs).

In my zoning tips, I spoke not just about zoning but also about the simple name server (usually now referred to as the name server). In part one of this tip, I will talk about fabric reconfigurations. Part two will then discuss the thorny issue of RSCNs (registered state change notifications). For the more technically minded, I should warn you that I will simplify a bit.

As a reminder, when a device joins the SAN, it does a fabric login in order to register itself and get a 24-bit Fibre Channel address, registers itself with the name server so the device and its capabilities are recognized, asks the name server for a list of devices it is allowed to see (controlled by zoning) and finally does a port login to each of those devices to find out what it can access.

Principle switches

For most functions in a Fibre Channel fabric, services are distributed and you do not need to worry about how and where they are implemented. For a very limited number of functions -- most notably allocating domain IDs to new switches in the fabric -- it is the principle switch that does the work.

Now, I cannot emphasize enough that for almost all aspects of SAN design you really should not worry about this or worry about which switch is the principle switch. When a fabric is formed, or two fabrics are joined, an election process occurs and a switch becomes the principle switch. The Fibre Channel standards do discuss a mechanism where you can set a preference for some switches to become the principle. While all switches following the standards have to honor the use of this priority setting, only some switches actually allow you to set a preference. The only time I have seen people using some configuration parameters to control this is in heterogeneous fabrics (fabrics of different brands of switches). In particular, some of the external management software will assume that in a heterogeneous fabric, the principle switch is on one side or the other and problems can occur if this is not the case.

Fabric reconfigurations

In order for the fabric to work, having elected the principle switch, a process occurs so that there is a defined set of routes between the principle switch and each and every other switch in the fabric. When looking at your switches you will see references to up-stream and down-stream ISLs. These are the ISLs forming the routes from each switch to the principle switch and from the principle to every other switch -- otherwise referred to as principle ISLs.

Again, in most respects, you really do not need to worry about this at all, except that it helps to understand fabric reconfigurations and why they sometimes happen and sometimes do not. Quite simply, if an ISL breaks that is not a principle ISL, then you will not get a fabric reconfiguration because the management routes necessary to access the principle switch are all still in place. Of course, some other traffic going through the fabric may be rerouted. In this case, there will be some very slight disruption to normal traffic flow but at a level that will have little or no significant effect.

If, however, one of the principle ISLs is broken, then the fabric needs to re-evaluate how all the switches can communicate with each other -- hence you get a fabric reconfiguration. The Fibre Channel standards dictate a 10-second delay between losing a principle ISL and actually doing the reconfiguration, which is why you may see a countdown (10, 9, 8...) before things reform.

You will also get the same effect if you add a new switch or merge two fabrics together. Quite simply, it is as it is, all the Fibre Channel switch vendors are simply following the Fibre Channel standards, and traditionally the only way to avoid or limit these reconfigurations is to have SAN islands. No amount of multiple ISLs, trunking or even hot code activation, can eliminate all the causes of fabric reconfigurations.

Interestingly enough, marketing to the contrary, I have yet to see an implementation of trunking that reduces that chance of fabric reconfiguration. To be clear, trunking does improve load balancing between ISLs, and may eliminate or reduce the normal small delays of rerouting if a non-principle ISL fails. However, in the implementations I have seen, it is still the case that one of the specific physical connections is considered the principle ISL, and if this fails you still get a fabric reconfiguration.

One simple rule to remember is that the larger the fabric, particularly in terms of number of switches, the longer it takes for a fabric to settle down through one or more reconfigurations when there is a problem.

Long distance fabrics

It is increasingly popular to extend a SAN over long distance. You can do this with optical solutions, such as extended wavelength gigabit interface converters (GBICs) or small form-factor pluggable (SFP) tranceivers, or using dense wavelength division multiplexing (DWDM), or FC over IP (FCIP) solutions. In both cases, the result is a single fabric extended over some other infrastructure. Apart from the performance implications of long distance and latency in the underlying network, this has implications on fabric reconfigurations.

The first problem is that a fabric reconfiguration, with all the election process and so on, is happening in a physically stretched SAN. This means that all physical locations (i.e. normal and secondary data centres) are impacted by fabric reconfigurations and it will usually take longer for such an extended fabric to stabilize after a problem.

In addition, there is the potential for fabric reconfigurations to be generated by problems in the technology used to extend the fabric. This could be a signal strength issue, in the case of a simple stretched link using extended wavelength optic, or a network problem (possibly just a temporary glitch), in the case of DWDM or FCIP.

The solution?

Fabric reconfigurations are a fact of life. When designing your fabric, you should think about how large to take your SAN islands and how best to handle multi-site implementations. Options that are becoming more popular are SAN routing, subnetting and internetworking -- whereby you have some form of routed connection between separate SAN islands while keeping them as separate fabrics. Nishan has been selling such a solution for some time leveraging our work with iFCP.

About the author:

About the author: Simon Gordon is a senior solution architect for McDATA based in the UK. Simon has been working as a European expert in storage networking technology for more than 5 years. He specializes in distance solutions and business continuity. Simon has been working in the IT industry for more than 20 years in a variety or technologies and business sectors including software development, systems integration, Unix and open systems, Microsoft infrastructure design as well as storage networking. He is also a contributor to and presenter for the SNIA IP-Storage Forum in Europe.

Dig Deeper on SAN technology and arrays

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.