Problem solve Get help with specific problems with your technologies, process and projects.

SAN scalability, part 2: Limit RSCNs in the SAN

Bits & Bytes: This tip discusses problems with RSCNs and name server queries and offers SAN design solutions.

Continuing from part one of this tip, which talked about fabric reconfigurations, a related issue are RSCNs (registered state change notifications), an area close to my heart and one I've heard scary stories about.


The purpose of having a network is to make it easier to change things; to change the network and add or remove devices, without having to shut everything down. This is just as true of Fibre Channel SANs as any other network. It's all about plug and play. However, when things change, some of the devices in that fabric might like to know that things have changed -- that there is some more storage they can access, or that storage they were accessing has gone away.

So, when a device joins the SAN it has the option of registering to receive these state change messages if it is interested. Then, when something changes in the fabric, every device that registered is told of the change that just happened. So, RSCNs are a fact of life in a SAN. There are actually different levels of state changes, some fabric-wide and some device-specific. However, while the Fibre Channel standards in theory allow some additional levels of granularity today, many devices ask to receive all messages whether they are interested in them or not, and many switches will send all messages to a device if it registered for any rather than filtering.

All of this generates a few challenges, though scary stories tell that things are much better now than a few years ago. First, some devices register to receive these messages and then ignore them. Even worse, some devices register to receive these messages, and when they get them, they glitch causing operating system I/O errors or aborted backup jobs.

Worse still brings me back to the name server. When something changes most if not all devices are told, and of course all these devices then need to talk to the name server to find out what has changed. The challenge is that the larger the fabric, the bigger this sudden barrage of name server queries. Plus, the bigger the fabric the more likely there is to be a change and the more complex and slower the fabric reconfiguration process. Finally, having talked to the name server to see what has changed, each of the many devices may then decide to do port logins to each and every device it can now see, as it did when it first connected to the fabric.

In addition, the exact behavior of the process varies between different HBAs, and different operating systems -- hence the first rule of zoning being to zone by operating system and by HBA vendor. Today this typically only limits the final stage of the massive cascade of devices doing plogins (port logins) to each other, as mostly zoning does not isolate the state change notifications themselves.

All this sounds a little scary, and in all honesty is usually only a problem when your fabrics reach hundreds of devices rather than tens of devices. In defence of the various vendors, there is a catch-22 situation in that they often have a choice of following the standard or doing something non-standard to improve the situation. In most cases, the vendor could implement in a number of different ways, and sometimes the solution chosen may not be the most helpful.

Design to isolate fabric reconfigurations and limit RSCNs

How does a SAN designer build a large fabric and make it viable for the end user? Particularly when the issues I have discussed above are not the only ones limiting scalability. Before any of my friends shout out, I would like to point out that just having a smaller number of big switches and having hot code activation is not a complete solution. Hot code activation eliminates, or more accurately limits, one cause of fabric reconfiguration. Similarly, a smaller number of switches may reduce the number of ISLs but does not eliminate them completely, particularly remembering the number of end users spanning their SANs across sites. Most importantly, when a problem does occur, the size of the problem stems from the number of devices more than anything else.

So, hot code activation is useful, including hot code activation on edge switches, as switch reboots at the edge will cause fabric reconfigurations just as surely as those at the core. Good design requires thinking about where and how to implement ISLs. The next step is to look closely at your zoning schema in order to limit some of the activities that happen. After that, you do need to look at some of the horrible low-level configuration options on the HBAs and switches as in some cases tuning the behavior of these devices helps isolate and control the worst excesses of the problems.

In the final analysis, there is a reason for SAN islands and indeed dual-fabrics. There are not many problems I have talked about here that can span across multiple fabrics and between SAN islands. Before those people implementing dual fabrics get too confident -- please remember that the servers and storage are connected to both fabrics and so there are cases where problems hit both at once.

There are still a couple of problems with SAN islands: management and connectivity. Management can be fixed in a number of ways, including implementing multiple SAN islands within a single chassis. Although, this again increases the possibility of an event hitting multiple SAN islands at the same time, and requires using one of the newer SAN management applications that can manage multiple SAN islands.

Connectivity is something of a show stopper. The idea of a SAN is connectivity with flexibility. It is vital for customers with larger SANs or smaller SANs that span between buildings or data centers to have a mechanism to provide connectivity without ending up with a single fabric -- ultimately, a single fabric leads to the problems I have been talking about.

This is why these days you will see a lot more talk about iFCP, which provides connectivity and routing without leading to a single SAN fabric, and devices like the Nishan Multiprotocol Router, which provides a form of SAN subnetting, routing and internetworking capability, providing connectivity between SAN islands (whether it's within a single chassis or not) without generating a single large and potentially unsupportable fabric. And you get iSCSI support in the same box as well.

About the author:

About the author: Simon Gordon is a senior solution architect for McDATA based in the UK. Simon has been working as a European expert in storage networking technology for more than 5 years. He specializes in distance solutions and business continuity. Simon has been working in the IT industry for more than 20 years in a variety or technologies and business sectors including software development, systems integration, Unix and open systems, Microsoft infrastructure design as well as storage networking. He is also a contributor to and presenter for the SNIA IP-Storage Forum in Europe.

Dig Deeper on SAN technology and arrays

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.