Last month, I wrote about the rationale behind implementing a storage area network (SAN) for an e-mail system (see...
"The networked storage project: getting started,"). In the not-so-distant past it might have been unusual to see a SAN used for an e-mail system, but today things are different. Users routinely include multimegabyte file attachments with their e-mails. In addition, users expect their old e-mails to be available - attachments and all - for something close to forever. All of this means that modest storage requirements of yesterday's e-mail systems are history. Modern e-mail systems need robust storage capacity and typically, that means a SAN.
In part one, I also discussed how our team selected the SAN vendor (Dell/EMC), determined the operating system (NetWare 6.0) and worked through the myriad of considerations that went into the SAN design. In this installment, you'll find out about the various implementation issues we encountered as WVU's Network Services team took the GroupWise SAN system from a design concept to a fully functioning production system.
After completing our vendor selection, determining the design and ordering the SAN, it was time to send our staff members to Dell's training facility in Austin, TX. Four of our staff flew to Austin for a week of SAN training. The reviews were generally good. Every person we sent felt Dell had provided first-rate SAN training. In fact, if there was any gripe at all - except for the sweltering August weather outside of Dell's training facility - it was that the Dell training lab was exclusively Microsoft Windows server-based. While we were there to learn about the SAN, our servers were entirely Novell-based and there wasn't a NetWare server in sight.
After training, our staff returned to begin the installation of the 21 servers and the Dell/EMC FC4700 SAN system. Rather than roll our own, we contracted Dell to provide installation services. While this added to the cost and our staff could have handled the installation, we felt that having professionals handling racking, cabling and staging offered several advantages. It freed up our staff to prepare for the system implementation and it provided the opportunity to look over the installer's shoulders and learn the new system's hardware configuration.
As Dell's installation crew was wrapping up their work, we started the installation sign-off procedure. This is a comprehensive white glove inspection of all aspects of the installation including power on tests, cabling, server mounting and all hardware.
White gloves on, we started going over the system. The installation was spread over two free standing 6' x 19" racks. One rack held the 21 Dell servers, while the second rack housed the Dell/EMC SAN and Fibre Channel (FC) switches. Fiber optic cables ran from the server host bus adapters (HBAs) to the SAN through a pair of cut outs located on the top of the racks. Upon examination, we noticed that both holes had frighteningly sharp edges. Our concern was that over time, the sharp metal edges would rub off the protective cladding on the fiber optic cables, affecting the connections to the SAN.
We showed this to Dell's installation team, but they were reluctant to correct this. It was left to our team to come up with a solution. We inserted the fiber optic cables inside of a ribbed rubber hose - the kind you find at any auto parts store. It wasn't elegant, but it was functional.
Cable retention arms were used for running the cabling - copper and fiber optic - to the servers. The arms moved in and out to allow access to the rear panel of the servers. However, several of the cable retention arms in our installation seemed to have a mind of their own. Instead of moving strictly horizontally, they both moved horizontally and vertically - and this wasn't a feature they should have. Unfortunately, no one on our team noticed this until the installation had been completed, and Dell's installation team wasn't Johnny-on-the-spot to point it out. To be fair, though, Dell took full responsibility for the problem once it was pointed out to them and sent an engineer on site the following week who determined the parts were defective in manufacturing.
The company immediately shipped out new cable management arms and had installers return to replace them. While we felt that this was a no-hassle method for correcting the problem, we wished the installers had called this out to us while they were racking the servers.
Our design used two different Dell PowerEdge server models, the 2650 and the 1650. The 2650s handle the more rigorous demands of the system such as the post office agents (POAs.) The 1650s do their work in the lighter parts of the system such as the Web agents. By some design quirk, the hard drives from the 2650 PowerEdge servers aren't plug-compatible with the same size drives in the PowerEdge 1650s. We had expected otherwise. Having completely swappable drives would have made life a whole lot easier for us. As it is, we'll now have to keep spare sets of both drives on hand and remember what goes where in the rack.
Testing one, two, three
Modern SAN management technology is cool. Java-based Web management interfaces that provide access from any desktop to the SAN are all terrific management features. Still, when all else fails in the management tree, there's nothing like being able to access a good old-fashioned command line interface (CLI) with a serial communications application like Hyper Access. Fortunately, Dell and EMC's designers had wisely included a serial management port in the FC4700 SAN. Determined to test this feature, we grabbed a Windows 2000 laptop and headed for the data center. To our surprise, we found that we couldn't communicate with the CLI. No matter what we tried, the SAN just wouldn't talk to us. Finally, we called Dell support. They talked us through serial port set up configurations. Eventually, we were able to set parameters that the SAN liked. Several members of our team commented, "Serial communications used to be so easy."
Other problems surfaced: During the test period, we encountered difficulty with cluster failover. When we downed one of the clustered servers, the other one was unable to access the SAN. At first we were completely stumped. Fortunately our eagle-eyed GroupWise system administrator, Gene Hendrickson, discovered that the management port on one of the FC switches hadn't been connected during installation. He just plugged it in and restarted the switch. The clustering immediately started working. It helps to plug things in.
We weren't sure how we missed this during the installation sign-off procedure. However, this actually turned out to be a good thing. The installation team gained valuable troubleshooting experience with the SAN and the detective work reinforced what was learned in Dell's SAN classes.
|Disciplines learned from
Statistically, most faults in computer systems and hard drives appear within the first 30 days of operation. Therefore, it was important to get a SAN up and running a few weeks before the go live date. This allowed us sufficient time to burn in the system. Obviously, if the system was going to smoke, it's better if it happens in the preproduction testing.
Getting the system up early in test mode provided another advantage. It gave the team time to play with the system before it went live. Things such as deliberately creating outages to see how the system reacted would be totally forbidden on a production system, but it's fair game when the system is in preproduction testing. While this allowed the team to test the system and learn, it also produced some confusion regarding the SAN state as reflected in a message sent by Ed Norman, one of the project team leaders: "A large number of faults are being generated on the SAN. This began on Friday, Aug. 16, [and continued ten days later]. If this is being done on purpose, may I suggest we stop sending these alerts to Dell, because they are assuming these are real failures. If they are real failures, we clearly have problems with our hardware."
The eleventh hour
Every project has its eleventh hour glitch. Ours came just 30 hours before system cut-over. Just prior to cut-over, we installed NetWare Support Pack 2 on the servers. There were several good reasons for upgrading to SP2 before cut-over. Installing SP2 meant not having to bring the servers down to install SP2 after the cut-over. The support pack also contained fixes for SAN clustering - something we obviously wanted to have in our system. Additionally, we wanted to be prepared for any possible support issues that might arise during the cut-over. We knew that if we ran into any glitches, Novell support first question would be, "Did you install SP2?"
Vendors such as Microsoft and Novell create support packs to add features and fix problems. But nearly every support pack seems to find a way to break something that was previously working. SP2 for NetWare was no exception. After installing it, we found it took over four minutes to load a post office. Additionally, when a client connected to a post office agent, its IP address was displayed in the management system as 0.0.0.0.
The latter issue prevented us from seeing what clients were connected to the system - mostly an annoyance. The former issue was a greater concern creating the potential for an outage of up to five minutes in the event that we had to failover clustered servers.
As the notes of our project leader, Milton Christ, pointed out, at first Novell support was slow to respond to our requests for support: "We have been playing a waiting game with Novell [support] ... we lost an afternoon's work due to the issues with SP2. [At this late date] management doesn't want to hear 'I don't know what Novell is doing.'"
Fortunately, our Novell advocate stepped in. She raised the incident to high severity and contacted Novell's support manager to get coordination in place between Novell's OS and GroupWise support groups. In short order, Novell's OS group gave us a field test NetWare Loadable Module (NLM). The NLM protocol supports file locking from NFS-mounted files that fixed the slow load issue for the POAs. However, it didn't correct the 0.0.0.0 showing in the client IP address field. That fix, we were told, would come later. Since we had a few POAs with no users on them, we could use these POAs to test Novell's final fix without causing interruptions to the production POAs.
Up and running
We went live on Oct. 5, 2002 - a full day ahead of schedule. The migration from the old system to the new SAN system went amazingly smoothly.
After every project, we held a debriefing meeting to reexamine our design, installation and deployment procedures (see "Project summary,"). In looking at the past five months, there wasn't much that we wished we had done differently. In retrospect, we could have orchestrated the hardware installation a little better and communicated our design criteria to all team members more rapidly than we did. But on whole, it was a job well done by all members of the team. The design and vendor choices we made were soundly based and have begun to prove themselves in real life operation.
Were we do to it over again, we would without hesitation incorporate a SAN as a core element in our e-mail system. Maybe in the future we'll include NAS and iSCSI as well, but for now we're happy with our system - and so are our users. Out of nearly 6,000 GroupWise users on campus there were fewer than 250 support calls related to system or client issues to our Help Desk the first week of operation - clearly a success.