This article can also be found in the Premium Editorial Download "Storage magazine: Overview of top tape backup options for midrange systems and networking segments."
Download it now to read this article plus other related content.
This isn't to say SANs aren't reliable, but there needs to be a carefully engineered solution, deployed as designed and then not fooled with. It reminds me of another law to live by: To keep your nose from getting broken in three different places, keep your nose out of those three different places.
|The killer process|
With modern backup software, a small number of servers can control backup schedules and metadata collection for a large number of clients and servers with attached tape drives. This example shows Legato's NetWorker package, but other major packages use similar architectures.
The most exciting part about CE is how different groups have shared solutions. One group may have a great solution for Oracle database servers, and after review, it's adopted as the standard. No reinventing the wheel each time a solution is needed. CE's simplicity is its strength.
Years ago, Intel found that as they built new chip factories, each would go through the same learning curve. It would take as long to get the second or third process running as it did the first. So the process, now dubbed CE, was implemented that shared the engineering of previous sites. Each new site started at the same point as the current operating sites. This saves a huge amount of time. When a new process or site goes online, there's a high degree of certainty that its quality and yields will match existing sites. Since CE works well for making chips, IT implemented its version for deploying IT projects. Everything IT deploys has to adhere to a strict process to become certified and released. Remember that Intel has thousands of employees, most of which have at least one PC, so anything that's attached to the network can have a huge impact.
The goal was to get a product to P100 status. Here's how:
You get a CRD and create a PRD and match it to your roadmap and the ASB. Next, you get buy-in from the JOM and JEM and then take it to ITSP for a RFP. If you aren't ZZB'd, you make a team from the TIC, TAC, OSC and SE. They match the RFP with your PRD. If you're still on track, you start a white paper for the JET's preliminary OK. The JET gets feedback from the OSC, TAC, JOM/JOT and JEM. Then a pilot begins. Results are again reviewed by the JET, et al, and if you haven't got lost, you may get P100. Get the picture?
We almost sunk ourselves in the process making it nearly impossible to get any work done. If the acronyms didn't kill you, the paperwork would. I have a 45-page training class syllabus for anyone who wants to know what all that really meant. The key is a punch list called a white paper. It tracks each step and assures that the proper processes are followed, so by the time a project hits the computer room, we know it works and it won't bring down the rest of the environment (see "The plan, or 'killer process'" sidebar). Again, I have a personal - and painful - experience in what can go wrong.
Another digression: Once prior to the CE process, I deployed a multi-NIC solution for NetWare servers. We had 50 or so NetWare servers as file servers for about 3,000 people and bandwidth became a problem - old thick-wire Ethernet. As lead engineer at our site, I found a product that promised to allow servers to have more than one network connection. Wow - great idea. We spent a month testing it in our lab and it seemed to work great. One night, we scheduled downtime for all the servers and my team deployed the NICs to most of the servers. We called it "The NIC at Night" club.
Everything looked fine. The next morning, one of our manufacturing lines kept going down and Unix controllers kept going offline. You can guess the outcome. My multi-NIC servers were broadcasting something that only affected this one type of Unix server, but it was enough to bring production to a stop for three hours. I heard the manufacturing manager was asking the IT manager for $10 million in lost production. My bonus that year could have been better.
What went wrong? My testing wasn't complete; the joint engineering manager (JEM) would have seen that. The change was made without proper notification to others that might be affected; the joint operation manager (JOM) and operations support center (OSC) would have saved hours of downtime had they known something like this was changing. The P100 process now tries to stop this kind of half-baked deployment - it may not catch every problem, but there are enough checks and balances to lessen the chance or minimize the impact if something goes wrong.
The bottom line: Don't let the process kill you. Keep it simple, easy to follow, but demand strict adherence. Once something is certified P100, it can be deployed wherever it fits. Remember one size doesn't fit all - each product has its boundaries for which it was certified and using it outside these boundaries will cause problems. Most of the control starts at the purchasing department. If someone tries to order a tape library not on the P100 list, the order is kicked back. This isn't to say that special cases don't arise, but they are carefully reviewed and approved through a different process.
To be successful in a large-scale project such as EBaR, you must have the commitment of upper management. Intel was spending millions of dollars over a number of years. EBaR isn't something that can be rolled out in a single budget cycle. One of the major problems we had is we predicted significant head count reduction, so management cut the heads prior to our full deployment. This left us with fewer people to manage the existing environment and roll out a new one at the same time. It was a classic case of putting the cart before the horse.
Another tip: Involve all the stakeholders early on in the planning stages. Our team discovered related groups who were ready to do their own projects that could pitch in funding and manpower. The phrase "plagiarize with pride" became our rallying cry. If someone found a workable solution in some other department, our team made it P100 and would use it around the world. And remember when you steal something, give credit to those you stole it from. Then make any adjustments needed to fit your certification process and be happy you didn't have to do it yourself.
A not-so-old proverb states that those who conduct backups are only two failed restores away from a pink slip. My final advice: Install a process that doesn't leave room for mistakes.
This was first published in July 2002