This article can also be found in the Premium Editorial Download "Storage magazine: Best storage products of the year 2002."
Download it now to read this article plus other related content.
|Disciplines learned from
Statistically, most faults in computer systems and hard drives appear within the first 30 days of operation. Therefore, it was important to get a SAN up and running a few weeks before the go live date. This allowed us sufficient time to burn in the system. Obviously, if the system was going to smoke, it's better if it happens in the preproduction testing.
Getting the system up early in test mode provided another advantage. It gave the team time to play with the system before it went live. Things such as deliberately creating outages to see how the system reacted would be totally forbidden on a production system, but it's fair game when the system is in preproduction testing. While this allowed the team to test the system and learn, it also produced some confusion regarding the SAN state as reflected in a message sent by Ed Norman, one of the project team leaders: "A large number of faults are being generated on the SAN. This began on Friday, Aug. 16, [and continued ten days later]. If this is being done on purpose, may I suggest we stop sending these alerts to Dell, because they are assuming these are real failures. If they are real failures, we clearly have problems with our hardware."
The eleventh hour
Every project has its eleventh hour glitch. Ours came just 30 hours before system cut-over. Just prior to cut-over, we installed NetWare Support Pack 2 on the servers. There were several good reasons for upgrading to SP2 before cut-over. Installing SP2 meant not having to bring the servers down to install SP2 after the cut-over. The support pack also contained fixes for SAN clustering - something we obviously wanted to have in our system. Additionally, we wanted to be prepared for any possible support issues that might arise during the cut-over. We knew that if we ran into any glitches, Novell support first question would be, "Did you install SP2?"
Vendors such as Microsoft and Novell create support packs to add features and fix problems. But nearly every support pack seems to find a way to break something that was previously working. SP2 for NetWare was no exception. After installing it, we found it took over four minutes to load a post office. Additionally, when a client connected to a post office agent, its IP address was displayed in the management system as 0.0.0.0.
The latter issue prevented us from seeing what clients were connected to the system - mostly an annoyance. The former issue was a greater concern creating the potential for an outage of up to five minutes in the event that we had to failover clustered servers.
As the notes of our project leader, Milton Christ, pointed out, at first Novell support was slow to respond to our requests for support: "We have been playing a waiting game with Novell [support] ... we lost an afternoon's work due to the issues with SP2. [At this late date] management doesn't want to hear 'I don't know what Novell is doing.'"
Fortunately, our Novell advocate stepped in. She raised the incident to high severity and contacted Novell's support manager to get coordination in place between Novell's OS and GroupWise support groups. In short order, Novell's OS group gave us a field test NetWare Loadable Module (NLM). The NLM protocol supports file locking from NFS-mounted files that fixed the slow load issue for the POAs. However, it didn't correct the 0.0.0.0 showing in the client IP address field. That fix, we were told, would come later. Since we had a few POAs with no users on them, we could use these POAs to test Novell's final fix without causing interruptions to the production POAs.
Up and running
We went live on Oct. 5, 2002 - a full day ahead of schedule. The migration from the old system to the new SAN system went amazingly smoothly.
After every project, we held a debriefing meeting to reexamine our design, installation and deployment procedures (see "Project summary,"). In looking at the past five months, there wasn't much that we wished we had done differently. In retrospect, we could have orchestrated the hardware installation a little better and communicated our design criteria to all team members more rapidly than we did. But on whole, it was a job well done by all members of the team. The design and vendor choices we made were soundly based and have begun to prove themselves in real life operation.
Were we do to it over again, we would without hesitation incorporate a SAN as a core element in our e-mail system. Maybe in the future we'll include NAS and iSCSI as well, but for now we're happy with our system - and so are our users. Out of nearly 6,000 GroupWise users on campus there were fewer than 250 support calls related to system or client issues to our Help Desk the first week of operation - clearly a success.
This was first published in January 2003