Fibre Channel switches need to communicate and cooperate with each other to manage the overall fabric. The best way to ensure that that happens reliably is to select a switch from one of the top three switch vendors: Brocade Communications Systems, Cisco Systems and QLogic.
"There's a standard for this communication [between the switches], but the standard is kind of a weak, least common denominator of the functions required to build a SAN," said Robert Passmore, an analyst at Gartner. "All of the switch vendors have a much more robust overall set of management functions that are proprietary to [each of them]."
Some of the best practices that are common to all of the Fibre Channel switching environments are listed below, grouped in categories.
Plan your SAN for what you expect to need over the next three years.
Project your future needs based on the number of applications, physical servers and storage in use during the past two years. Take into account new technologies that may be deployed, such as virtual servers. Think about the impact different components will have on the overall environment.
"Whatever you think you're going to need over the next three years, double it and build it for that," says Marc Staimer, president
Not planning your SAN upfront is "a nightmare of immense proportions," according to Staimer. "The more you plan, the less rework you will have."
Determine application throughput and I/O to size and design the SAN environment most cost effectively.
Understanding the applications and knowing their throughput will determine what type of ports (oversubscribed or full throughput) will work best and how to build out the SAN design to most cost effectively use bandwidth. Many users opt for a core/edge design, often with 16-port or 32-port switches at the edge going into a bigger director switch, connected via an interswitch link (ISL).
"You need to know your throughput on all your edge switches to connect the appropriate amount of ISL to your director," Iacono says. Minimizing the ISL count can free up ports on the switch and give more money back to the SAN.
Companies with mature SANs may discover they need to shift an especially high-throughput application from an edge switch directly into the director to reduce hops and move it closer to the storage.
Don't be afraid of oversubscription.
Most servers don't require the full bandwidth of a Fibre Channel switch, so it's common practice to oversubscribe or allocate more potential demand than the switch can handle because statistically it's unlikely to need it all at the same time.
Still, Howard Goldstein, president of Howard Goldstein Associates, finds that administrators "tend to be conservative when they don't need to be." He notes that, in most SAN environments, "you're using one-tenth of the capacity of the switch port."
Assess power consumption and cooling requirements in advance.
Technology vendors often consolidate their offerings into the smallest possible packages, but customer sites can't always handle them. Most requests for proposals come with questions about power consumption, according to Mario Blandini, director of product marketing in Brocade's data center infrastructure division.
"You'd be surprised at how many IT environments literally have no more additional electrical capacity," Blandini says. "Most [hospital or university] buildings were built 75 years if not 100-200 years ago. And when they put the electricity in, no one ever fathomed you would be consuming in a 19-inch square space 10,000 W of electricity."
Build two independent Fibre Channel fabrics for redundancy
A SAN needs to be up 24/7. The more servers the SAN supports, the higher the consequences of failure. To make sure the SAN never goes down, there needs to be two paths from the servers to the storage.
If there's a failure along one of the paths -- with an HBA, switch, cable, port or anything -- the other path allows the application and its storage to continue to communicate. Another benefit is that upgrades can be done while the SAN is operating.
"Fibre Channel is a disruptive technology," Staimer says. "Anything you change, anything you add, whatever you do to your system, will disrupt the application using it at that time. So what you do is you force them onto one fabric while you make your change on the other one. You're the least disruptive when you have dual fabrics."
Management: The technical side
Deploy path management software to automatically switch the I/O request from one path to another in the event one path fails.
Some operating system environments provide basic capability. Some storage vendors have their own path management software that may cost more, but it offers additional features that may make it worthwhile, Passmore says.
Set up, tune and monitor hardware and performance alerts.
HP's Iacono remembers a large consulting company that got 6,000 alerts per day and didn't do anything with them. One switch vendor used to have a default alert set to go off whenever the SAN hit 0 MBps. That could trigger a thousand emails per day.
"You simply had to turn that off," he says.
But even just a few hardware bit-level errors are cause for concern, since that could signal an impending failure. "About 95% of failure rate in SANs, we're seeing [alerts] beforehand, but the alerting was not addressed," Iacono says. "If you're getting too many alerts, maybe you need to tune your alerting environment to get rid of the erroneous errors, or maybe there's a real issue that you need to address."
Back up the SAN configuration information to a hard drive not on the SAN.
SANs don't go down much, but when they do, they go down hard. If the SAN documentation is backed up on a server connected to a network drive, and that drive is over the SAN, the storage team will lose the information it needs to restore the systems.
"I could tell you Fortune 50 companies that do this," Iacono says. "It's amazing."
Many companies don't even have updated documentation. They often start with an Excel spreadsheet and the best intentions, and then rarely update it because they have more pressing responsibilities.
"If they have to troubleshoot something, they have no idea what's connected to what port," Iacono says. "I'd say everyone has some sort of documentation. Probably 50% to 70% [of it] isn't up to date."
Management: The personnel side
Employ a dedicated storage team and rigid change management procedures.
When a SAN goes down, it's usually because of human error. Strict change management policies reduce the chances that will happen. So does a dedicated storage team that manages the systems proactively.
Server administrators need to communicate and coordinate their needs with the storage group, which handles the storage design. The storage pros write down the process steps and setup instructions, including the actions on the storage array and the switches. Ideally, another storage specialist reviews the change design and quality assurance is done.
"Organizations that follow these kinds of processes are the ones that, in essence, go year in and year out without ever having a failure in the SAN," Gartner's Passmore says.
Set separate user accounts and passwords for each administrator and third-party consultant with access to the SAN.
It's not uncommon for an administrator with a new SAN switch to tweak parameters and not tell colleagues, Iacono says. When he finds a switch configured differently and asks what happened, he usually hears that "Joe was doing this and Steve was doing that."
According to Iacono, "Once you create accountability, all that disappears. "We want to be able to audit who's doing what."
Not only will the IT group be able to determine the source of any problems, it won't need to reset the universal password when a SAN administrator leaves the company.
Create zones at the same time LUN masking and binding is done.
When storage is created for a new server, tools are used to carve out a storage volume and give it an address, or an SCSI LUN. LUN masking hides the LUN from entities that don't own it; LUN binding attaches the LUN only to the worldwide ID of the HBA in the server.
At the same time that LUN masking and binding is done, a storage specialist should go into the switch and create a zone that will allow only specified adapters to talk to certain storage ports.
"In essence, the switches, through zoning, reinforce the LUN masking and binding," Passmore says. "And in one more step, switches at the port that talk to the server can be programmed to check the worldwide ID and therefore reinforce the LUN masking and binding that's been done in the storage arrays."
Use Secure Shell (SSH) protocol to access the SAN.
If an administrator logs into a SAN switch using the Telnet protocol, the password isn't encrypted, leaving it at risk of interception. SSH provides a secure channel.
"With SSH, everything is encrypted," Iacono says. "This is a standard if you're managing your Windows or Unix environment, but for some reason, no one does this for SAN environments."
Make sure the bandwidth in and out of the servers into the switches and the targets is adequate to accommodate the environment.
When once underutilized servers run multiple application workloads on virtual machines, the bandwidth requirements escalate. Users need to design their SANs with that in mind.
"A typical x86 server last year would be hard-pressed to do more than a gigabit per second of throughput," Staimer says. "The current generation of x86, [which is] typically dual-quad core, can easily push 10 GB, if the applications can. If you're running 20 applications concurrently, you're going to push that 10 GB. It's pushing the I/O that in the past the server really didn't push, because one application was rarely going to do it."
Make sure every physical server with virtual machines is in the same zone.
Using virtual server technology, an administrator can move an application from one physical server to another without any downtime, but those physical servers need to be in the same Fibre Channel zone to be able to access the storage.
"What happens when an application can't see its storage? It crashes!" Staimer says.
Use switches and HBAs that support N_Port ID Virtualization (NPIV).
If one physical server has five virtual machines running on it, NPIV will permit each of those virtual machines to get a unique identifier on a single HBA, and an NPIV-capable switch will recognize each distinct ID. That, in turn, means each virtual machine can have access to a different LUN.
Without NPIV-capable devices, the physical server would get one port ID.
NPIV is supported in new switches and HBAs, but anyone using legacy hardware might need to check with the vendor about a firmware update. NPIV works with blade servers similarly to the way it does with virtual machines.
This was first published in December 2008