Published: 02 Feb 2003
When you expand your storage area network (SAN), you'll discover just how well your architecture accommodates growth while preserving performance, availability and manageability. There are at least three critical growth-related abilities your switching architecture should provide:
- The ability to add ports in a hurry without sacrificing performance and without requiring a fundamental redesign;
- The ability to scale with little or no downtime; and
- The ability to abstract and control performance and security parameters.
Availability through redundancy
The SilkWorm 12000 chassis contains two logical 64-port switches, each accommodating up to four 16-port blades. Both switches are supported by a pair of clustered control processors (CPs) running a Linux kernel with Brocade's Fabric OS running as a layered application on each. However, Brocade requires that both CPs run the same version of Fabric OS to ensure that you regain the same functionality after failover. The CPs are configured active/standby with watchdog timers. The High-Availability Manager on each CP listens for a heartbeat message from the other sent via the User Datagram Protocol (UDP) over the backplane.
The first CP to boot during power up becomes the active CP and is responsible for managing the power on self test (POST) of all field replaceable units, running the Fabric OS for both of the logical switches, hardware zoning and routing. In addition, the CP is also responsible for presenting the IP addresses of the logical switches to the LAN. The standby CP runs a scaled-down version of the Fabric OS, and thus only a limited number of administrative functions are available.
When the High-Availability Manager detects a timeout of one of its managed timers, it reboots the standby CP with the full version of Fabric OS. After rebooting, the now-active control processor takes control of the failed CP's responsibilities, including presenting the logical switches IP addresses to the LAN. After the reboot has completed, the attached Nx_Ports needs to log back into the fabric.
The two logical switches share the active control processor and there's a 30-second delay when failing over control processors. The same version of the Fabric OS also runs on both CPs. For this reason, when designing high-availability SANs using the 12000, dual fabrics should be implemented using two separate chassis. And although the 30-second delay needs to be shortened, the requirement for implementing a highly available core using two chassis' doesn't adversely affect infrastructures requiring dual sites. It could even lower the cost of establishing a mirrored site, however, for single data centers without a mirror site, the 30-second delay needs to be addressed.
Blade technology lets you perform maintenance without bringing down the entire logical switch or chassis. Other high-availability features of the SilkWorm 12000's design include four power supplies and three blowers, all of which are field replaceable. However, the SilkWorm 12000 requires two dedicated circuits protected by circuit breakers. Each power source supplies power to two of the four power supplies.
Designing for scalability
Many users have no immediate need for all of the SilkWorm 12000's capacity. Initially, look to deploy a single SilkWorm 12000 with the minimum of 32 ports in a single logical switch. As your port and availability requirements increase, you can add blades or add a second SilkWorm 12000 chassis to your SAN configuration, giving you the dual fabric configuration that Brocade recommends.
As part of a core/edge network design, you can easily add ports with some idea of what to expect with regards to the performance and reconfiguration of routing tables in the SAN. If you're managing a large SAN consolidation effort without any application-specific information with regards to its performance or long-term storage requirements, then the core/edge design is most suitable because these variables will have less impact.
The core/edge design--combined with the interswitch link (ISL) trunking feature found in the 12000-- lets you expand the aggregate performance of your SAN while maintaining the desired high-availability features. ISL trunking is the aggregation of the speed and functionality of two to four ports belonging to the same ASIC when connected to another group of two to four ports belonging to a common ASIC on a different switch that also supports ISL trunking. In terms of overall ISL speed, you would multiply the auto-sensed port speed by the number of ports in the trunk group (two to four).
In terms of functionality, should you lose an individual ISL (from a bad cable, for example), reconvergence of routes in the routing table isn't necessary because the entire ISL group is provisioned over a single logical cable. Therefore, when provisioning ports on the core SilkWorm 12000, ISL trunks are formed in quads. As a result, don't provision ports from the same quad to different switches unless you're running out of ports.
Large installations generally require high-availability designs, therefore a standard should be in place to indicate that no switch will be added to the core with less than two ISLs unless justified and agreed upon by the application owners. Also, consider only connecting edge switches to the core SilkWorm 12000 and not storage devices or servers. Using core ports for attaching hosts and storage limits your ability to scale and could reduce your ISL trunking options.
However, if after connecting your enterprise storage array and tape library you have sufficient ports to scale your SAN into the foreseeable future, placing your storage devices on the core may yield predictable performance results, easing locality of reference design decisions. But remember to connect individual devices with multiple ports to different blades to prevent a single blade failure or removal from taking the device off the air.
Intelligence eases management
I've found the core/edge design to be the easiest to perform a knowledge transfer on and the easiest to manage of all the possible design strategies for large SANs.
As for ongoing management, intelligence in the core is key for all enterprise SAN implementations tracking SLAs. I found the SilkWorm 12000's ability to provide end-to-end performance metrics an essential beginning to implementing quality of service policies in the SAN.
Security management benefits from Brocade's third-generation ASIC technology. The administrator gains software zoning capabilities alongside hardware zoning security via access lists that are controlled by the ASIC hardware.
Besides performance and zoning enhancements, the possibility of increased security in the SAN can be accomplished by management applications interacting with Brocade's API to prevent or only allow certain source IDs access to the resources on the core. At the root of this functionality is frame filtering. Frame filtering is the interrogation of the Fibre Channel frame header and using its values as input into counters or other management type functions. Because no frame can traverse the fabric without a header, having visibility into the header opens up the door for all type of management applications.
Does it add up?
The SilkWorm 12000 met my expectations in terms of scalability and manageability. Frame filtering, trunking and solid architectural design are the underlying strengths that provide those characteristics. As for availability, the 30-second delay in the failover of the control processors needs to be improved for highly available, single-SAN solutions, an improvement Brocade is reportedly working on. As a result, my overall rating for this product is 94% out of 100.