The networked storage project: moving ahead

Our SAN beginners have done the research and picked the products. Now it's time to implement - here's how they did it.

This Content Component encountered an error
This article can also be found in the Premium Editorial Download: Storage magazine: Best storage products of the year 2002:

Last month, I wrote about the rationale behind implementing a storage area network (SAN) for an e-mail system (see "The networked storage project: getting started,"). In the not-so-distant past it might have been unusual to see a SAN used for an e-mail system, but today things are different. Users routinely include multimegabyte file attachments with their e-mails. In addition, users expect their old e-mails to be available - attachments...

and all - for something close to forever. All of this means that modest storage requirements of yesterday's e-mail systems are history. Modern e-mail systems need robust storage capacity and typically, that means a SAN.

In part one, I also discussed how our team selected the SAN vendor (Dell/EMC), determined the operating system (NetWare 6.0) and worked through the myriad of considerations that went into the SAN design. In this installment, you'll find out about the various implementation issues we encountered as WVU's Network Services team took the GroupWise SAN system from a design concept to a fully functioning production system.

SAN training
After completing our vendor selection, determining the design and ordering the SAN, it was time to send our staff members to Dell's training facility in Austin, TX. Four of our staff flew to Austin for a week of SAN training. The reviews were generally good. Every person we sent felt Dell had provided first-rate SAN training. In fact, if there was any gripe at all - except for the sweltering August weather outside of Dell's training facility - it was that the Dell training lab was exclusively Microsoft Windows server-based. While we were there to learn about the SAN, our servers were entirely Novell-based and there wasn't a NetWare server in sight.

Project Summary
After all was said and done and the system was up and humming, we held a debriefing where we detailed what went right with the GroupWise 6 system design and its implementation and what went wrong. Happily, there were many more things that went right than things that went wrong. Here's a summary of our debriefing:
What we did right
  1. Assigned project manager (responsible for all phases of design and implementation) earlier in the design
  2. Created a PERT chart with milestones and adhered to it closely
  3. Developed budget, regularly monitored it and kept expenses within budget
  4. Encouraged staff to participate in design and implementation decisions
  5. Carefully thought out vendor selection
  6. Carefully thought out OS decision
  7. Well thought out design
  8. Had design reviewed by two independent outside consultants
  9. Exhibited a high degree of collaboration and communication
    • Within GroupWise services unit
    • Between units in network services
    • To other IT departments (customer support and IS)
    • To vendors (Dell, Novell)
  10. Implemented installation sign-off procedure
  11. Designed 30-day burn-in and test period
  12. Developed in advance contingency planning and determined ways to implement contingencies
What we could have done better
  1. Improved communications during initial phase of the project (corrected in progress)

After training, our staff returned to begin the installation of the 21 servers and the Dell/EMC FC4700 SAN system. Rather than roll our own, we contracted Dell to provide installation services. While this added to the cost and our staff could have handled the installation, we felt that having professionals handling racking, cabling and staging offered several advantages. It freed up our staff to prepare for the system implementation and it provided the opportunity to look over the installer's shoulders and learn the new system's hardware configuration.

As Dell's installation crew was wrapping up their work, we started the installation sign-off procedure. This is a comprehensive white glove inspection of all aspects of the installation including power on tests, cabling, server mounting and all hardware.

White gloves on, we started going over the system. The installation was spread over two free standing 6' x 19" racks. One rack held the 21 Dell servers, while the second rack housed the Dell/EMC SAN and Fibre Channel (FC) switches. Fiber optic cables ran from the server host bus adapters (HBAs) to the SAN through a pair of cut outs located on the top of the racks. Upon examination, we noticed that both holes had frighteningly sharp edges. Our concern was that over time, the sharp metal edges would rub off the protective cladding on the fiber optic cables, affecting the connections to the SAN.

We showed this to Dell's installation team, but they were reluctant to correct this. It was left to our team to come up with a solution. We inserted the fiber optic cables inside of a ribbed rubber hose - the kind you find at any auto parts store. It wasn't elegant, but it was functional.

Cable retention arms were used for running the cabling - copper and fiber optic - to the servers. The arms moved in and out to allow access to the rear panel of the servers. However, several of the cable retention arms in our installation seemed to have a mind of their own. Instead of moving strictly horizontally, they both moved horizontally and vertically - and this wasn't a feature they should have. Unfortunately, no one on our team noticed this until the installation had been completed, and Dell's installation team wasn't Johnny-on-the-spot to point it out. To be fair, though, Dell took full responsibility for the problem once it was pointed out to them and sent an engineer on site the following week who determined the parts were defective in manufacturing.

The company immediately shipped out new cable management arms and had installers return to replace them. While we felt that this was a no-hassle method for correcting the problem, we wished the installers had called this out to us while they were racking the servers.

Our design used two different Dell PowerEdge server models, the 2650 and the 1650. The 2650s handle the more rigorous demands of the system such as the post office agents (POAs.) The 1650s do their work in the lighter parts of the system such as the Web agents. By some design quirk, the hard drives from the 2650 PowerEdge servers aren't plug-compatible with the same size drives in the PowerEdge 1650s. We had expected otherwise. Having completely swappable drives would have made life a whole lot easier for us. As it is, we'll now have to keep spare sets of both drives on hand and remember what goes where in the rack.

Testing one, two, three
Modern SAN management technology is cool. Java-based Web management interfaces that provide access from any desktop to the SAN are all terrific management features. Still, when all else fails in the management tree, there's nothing like being able to access a good old-fashioned command line interface (CLI) with a serial communications application like Hyper Access. Fortunately, Dell and EMC's designers had wisely included a serial management port in the FC4700 SAN. Determined to test this feature, we grabbed a Windows 2000 laptop and headed for the data center. To our surprise, we found that we couldn't communicate with the CLI. No matter what we tried, the SAN just wouldn't talk to us. Finally, we called Dell support. They talked us through serial port set up configurations. Eventually, we were able to set parameters that the SAN liked. Several members of our team commented, "Serial communications used to be so easy."

Other problems surfaced: During the test period, we encountered difficulty with cluster failover. When we downed one of the clustered servers, the other one was unable to access the SAN. At first we were completely stumped. Fortunately our eagle-eyed GroupWise system administrator, Gene Hendrickson, discovered that the management port on one of the FC switches hadn't been connected during installation. He just plugged it in and restarted the switch. The clustering immediately started working. It helps to plug things in.

We weren't sure how we missed this during the installation sign-off procedure. However, this actually turned out to be a good thing. The installation team gained valuable troubleshooting experience with the SAN and the detective work reinforced what was learned in Dell's SAN classes.

Disciplines learned from
mainframe storage
An e-mail-based SAN has different performance metrics than, say, a file and print server. While both are I/O-centric, an e-mail SAN typically has many more small files to deal with. To determine performance, we used several metrics. First, we blasted e-mails from a test program we developed at the system and measured how quickly it processed them. Here are the test results we recorded:
100,000 messages were sent between two accounts on two separate post offices. Fifty-seven of the 100,000 messages had one million lines of text. The system handled them all without any lost messages.
System routed 36,000 e-mails in 10 minutes.
Even with 10,000+ messages in the mailbox, the client started very quickly.
We didn't receive the alert that there were "too many messages to be viewed." This is an error that came up often with the previous system whenever we had more than 4,096 messages in a folder.
Deleting 10,000 messages took less than five minutes. Deleting a large amount of messages in the previous system would lock up the client for over an hour in some cases.
Next, we ran benchmarks on the SAN itself with the following results:
     Read I/Os: 21,225/s
     Write I/Os: 21,223/s
     Read Performance: 41.6MB/s
     Write Performance: 41.5MB/s
     Average read response time: 0.04 sec
     Average write response time: 0.06 sec
We were curious to see how the measured read/write performance of the SAN compared to Dell's specifications. The documentation we received at Dell's SAN class indicated we should expect to see 30MB/s to 35 MB/s. Since our measured 41.5MB/s was higher, we felt that Dell was perhaps being a bit conservative in the documentation. Still, this was an impressive performance figure exceeding our expectations.

Going live
Statistically, most faults in computer systems and hard drives appear within the first 30 days of operation. Therefore, it was important to get a SAN up and running a few weeks before the go live date. This allowed us sufficient time to burn in the system. Obviously, if the system was going to smoke, it's better if it happens in the preproduction testing.

Getting the system up early in test mode provided another advantage. It gave the team time to play with the system before it went live. Things such as deliberately creating outages to see how the system reacted would be totally forbidden on a production system, but it's fair game when the system is in preproduction testing. While this allowed the team to test the system and learn, it also produced some confusion regarding the SAN state as reflected in a message sent by Ed Norman, one of the project team leaders: "A large number of faults are being generated on the SAN. This began on Friday, Aug. 16, [and continued ten days later]. If this is being done on purpose, may I suggest we stop sending these alerts to Dell, because they are assuming these are real failures. If they are real failures, we clearly have problems with our hardware."

The eleventh hour
Every project has its eleventh hour glitch. Ours came just 30 hours before system cut-over. Just prior to cut-over, we installed NetWare Support Pack 2 on the servers. There were several good reasons for upgrading to SP2 before cut-over. Installing SP2 meant not having to bring the servers down to install SP2 after the cut-over. The support pack also contained fixes for SAN clustering - something we obviously wanted to have in our system. Additionally, we wanted to be prepared for any possible support issues that might arise during the cut-over. We knew that if we ran into any glitches, Novell support first question would be, "Did you install SP2?"

Vendors such as Microsoft and Novell create support packs to add features and fix problems. But nearly every support pack seems to find a way to break something that was previously working. SP2 for NetWare was no exception. After installing it, we found it took over four minutes to load a post office. Additionally, when a client connected to a post office agent, its IP address was displayed in the management system as 0.0.0.0.

The latter issue prevented us from seeing what clients were connected to the system - mostly an annoyance. The former issue was a greater concern creating the potential for an outage of up to five minutes in the event that we had to failover clustered servers.

As the notes of our project leader, Milton Christ, pointed out, at first Novell support was slow to respond to our requests for support: "We have been playing a waiting game with Novell [support] ... we lost an afternoon's work due to the issues with SP2. [At this late date] management doesn't want to hear 'I don't know what Novell is doing.'"

Fortunately, our Novell advocate stepped in. She raised the incident to high severity and contacted Novell's support manager to get coordination in place between Novell's OS and GroupWise support groups. In short order, Novell's OS group gave us a field test NetWare Loadable Module (NLM). The NLM protocol supports file locking from NFS-mounted files that fixed the slow load issue for the POAs. However, it didn't correct the 0.0.0.0 showing in the client IP address field. That fix, we were told, would come later. Since we had a few POAs with no users on them, we could use these POAs to test Novell's final fix without causing interruptions to the production POAs.

Up and running
We went live on Oct. 5, 2002 - a full day ahead of schedule. The migration from the old system to the new SAN system went amazingly smoothly.

After every project, we held a debriefing meeting to reexamine our design, installation and deployment procedures (see "Project summary,"). In looking at the past five months, there wasn't much that we wished we had done differently. In retrospect, we could have orchestrated the hardware installation a little better and communicated our design criteria to all team members more rapidly than we did. But on whole, it was a job well done by all members of the team. The design and vendor choices we made were soundly based and have begun to prove themselves in real life operation.

Were we do to it over again, we would without hesitation incorporate a SAN as a core element in our e-mail system. Maybe in the future we'll include NAS and iSCSI as well, but for now we're happy with our system - and so are our users. Out of nearly 6,000 GroupWise users on campus there were fewer than 250 support calls related to system or client issues to our Help Desk the first week of operation - clearly a success.

This was first published in January 2003

Dig deeper on SAN management

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSolidStateStorage

SearchVirtualStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close