Published: 03 Jul 2002
I recently retired from Intel as product manager for storage products. It was a good, yet hectic ride. During my 10-year tenure in that position, I deployed storage solutions for thousands of servers, adding up to nearly a petabyte of data. This article explains how
Intel came to use an elaborate standardization process to improve its backup and restore procedures.
|Intel's P100 checklist|
After the introduction of its x386 processor, Intel began earnestly moving away from using big iron mainframes to PC servers. Soon we had an assorted jumble of every server size and shape made, running every OS under the sun.
Today, there are over 10,000 servers, ranging in size from 50GB to multiterabytes. They're spread around the world in a dozen major sites and countless small sites and offices. A typical large site will have 200 to 300 servers and a small site could have two on up to 50. To back up all the servers, there are nearly 500 tape libraries and thousands of single tape drives. The libraries range from a single drive eight-tape changer to 16 drive 350 tape libraries. Almost all tape drives are DLT7000s.
Granted, not all this storage is found on Intel-based servers - yet - but the majority of servers are attached to tape drives. Some large engineering sites, served by non-Intel-based systems, need to store 100TB or more of data. Put simply, it's a huge environment consisting of an eclectic mix of different size servers employed worldwide, all of which need to be protected.
In the early days - before CE - anyone who wanted to deploy an application would build up a server, install their application, write some procedures - maybe - and put it on the Net. Soon, we had dozens of individual solutions with little standardization, and system administration quickly became a nightmare. You were likely to find five or six different backup solutions in the same computer room. Most servers would have their own tape drive attached and who knows what backup application, which was running on its own schedule and tape rotation scheme.
In 1995, things were getting out of hand and IT began to develop worldwide standardization programs. Core services like e-mail, office applications and online storage were moved to a standard solution and deployed around the world. This gave us an opportunity to develop an automated backup solution for close to 1,000 servers running Microsoft NT. Every server would have four network connections, three for users and one for maintenance. Backups were done over a maintenance net to a tape library. Each library would support 10 to 15 servers.
Special application servers with large data stores got their own tape changers directly attached (via SCSI). Soon we had a reliable and manageable backup and restore solution. Man-hours to administer backups at sites with 100 servers went from 20 hours/week to three hours/week, reliability skyrocketed and we could manage the backup and restore operations over the network. Often, support people in California would be configuring systems in Ireland and Hong Kong at the same time.
Over the next few years, this basic model was replicated until we had more than 6,000 servers covered. I know the number because once we had to deploy a patch to ARCserve just before daylight-saving time. We did it in one night from a central location. And hardware and support costs dropped sharply - an additional benefit to adopting a company-wide standard.
Even though IT had made significant strides in automating its backup and recovery processes, it still only covered less than half of the available servers worldwide. Everyone was using the same tape hardware, and most NT servers were using ARCserver under the corporate license agreements. Still, many groups still operated their own non-NT servers. There was a large Unix environment not being addressed. Though the number of servers was lower, the total storage on Unix was much higher. Each group had good solutions, but there was little sharing of resources between groups or applications.
Most had direct SCSI-attached tape systems with backup software running on top of the application server. This is always a problem. Backup systems take a lot of maintenance and you don't want to bring down your database application because your tape system needs maintenance. On the other hand, you need the SCSI-type speeds if you plan to back up, or more importantly, restore 500GB in a reasonable time.
The four-hour restore rule
I used to live by the rule that we shouldn't build any server we couldn't restore in four hours. This came from a disaster I once lived through. In the days of 50GB servers and 2GB tapes, we were seven hours into a restore when tapes began to fail. The customer was already upset about being offline for most of the day, and now we had to tell them not all their data was coming back. Not a pleasant situation. That taught me to get a clear agreement with your customer of how long they can stand to be down; then don't let the system grow any bigger than can be restored in that time. And remember to build in time to gather the tapes and get the right resources in place.
But ... back to the Unix problem. Most owners of Unix systems were now also running some NT and we wanted to merge the environments. For the first time, we were able to get NT administrators and Unix administrators to sit down at the table without blood-letting. A project was born - enterprise backup and restore, EBaR. For this project, we were able to pull together people from all over the world to design, test, deploy and manage a single strategy covering almost everyone's needs. This didn't mean that we had a "one size fits all" solution. We went through an extensive process in defining our needs and picking a product that best met them. We chose a Veritas suite of backup products since they covered most of our flavors of Unix and other operating systems and provided a way to centrally manage the environment. The tiered architecture of Veritas' NetBackup made it possible to support large data stores and small servers under one system.
The timing for this project coincided with the deployment of a standardized storage area network (SAN) solution. As we consolidated servers and storage onto large (1TB to 3TB) SANs with 50 to 100 servers, we deployed fiber-based backup systems to service them. This deployment, of course, had its unique set of issues. When it was configured correctly and everything was running right, it was great. SAN bandwidth was adequate - plenty of redundancy and dedicated connectivity is wonderful for moving large amounts of data. But, and this is a big but, there are many variables that can make things go wrong. Servers or libraries were being taken online and offline for maintenance, tape drives failed and fiber bridges went out - which kept many of us from sleeping well at night. This isn't to say SANs aren't reliable, but there needs to be a carefully engineered solution, deployed as designed and then not fooled with. It reminds me of another law to live by: To keep your nose from getting broken in three different places, keep your nose out of those three different places.
|The killer process|
With modern backup software, a small number of servers can control backup schedules and metadata collection for a large number of clients and servers with attached tape drives. This example shows Legato's NetWorker package, but other major packages use similar architectures.
The most exciting part about CE is how different groups have shared solutions. One group may have a great solution for Oracle database servers, and after review, it's adopted as the standard. No reinventing the wheel each time a solution is needed. CE's simplicity is its strength.
Years ago, Intel found that as they built new chip factories, each would go through the same learning curve. It would take as long to get the second or third process running as it did the first. So the process, now dubbed CE, was implemented that shared the engineering of previous sites. Each new site started at the same point as the current operating sites. This saves a huge amount of time. When a new process or site goes online, there's a high degree of certainty that its quality and yields will match existing sites. Since CE works well for making chips, IT implemented its version for deploying IT projects. Everything IT deploys has to adhere to a strict process to become certified and released. Remember that Intel has thousands of employees, most of which have at least one PC, so anything that's attached to the network can have a huge impact.
The goal was to get a product to P100 status. Here's how:
You get a CRD and create a PRD and match it to your roadmap and the ASB. Next, you get buy-in from the JOM and JEM and then take it to ITSP for a RFP. If you aren't ZZB'd, you make a team from the TIC, TAC, OSC and SE. They match the RFP with your PRD. If you're still on track, you start a white paper for the JET's preliminary OK. The JET gets feedback from the OSC, TAC, JOM/JOT and JEM. Then a pilot begins. Results are again reviewed by the JET, et al, and if you haven't got lost, you may get P100. Get the picture?
We almost sunk ourselves in the process making it nearly impossible to get any work done. If the acronyms didn't kill you, the paperwork would. I have a 45-page training class syllabus for anyone who wants to know what all that really meant. The key is a punch list called a white paper. It tracks each step and assures that the proper processes are followed, so by the time a project hits the computer room, we know it works and it won't bring down the rest of the environment (see "The plan, or 'killer process'" sidebar). Again, I have a personal - and painful - experience in what can go wrong.
Another digression: Once prior to the CE process, I deployed a multi-NIC solution for NetWare servers. We had 50 or so NetWare servers as file servers for about 3,000 people and bandwidth became a problem - old thick-wire Ethernet. As lead engineer at our site, I found a product that promised to allow servers to have more than one network connection. Wow - great idea. We spent a month testing it in our lab and it seemed to work great. One night, we scheduled downtime for all the servers and my team deployed the NICs to most of the servers. We called it "The NIC at Night" club.
Everything looked fine. The next morning, one of our manufacturing lines kept going down and Unix controllers kept going offline. You can guess the outcome. My multi-NIC servers were broadcasting something that only affected this one type of Unix server, but it was enough to bring production to a stop for three hours. I heard the manufacturing manager was asking the IT manager for $10 million in lost production. My bonus that year could have been better.
What went wrong? My testing wasn't complete; the joint engineering manager (JEM) would have seen that. The change was made without proper notification to others that might be affected; the joint operation manager (JOM) and operations support center (OSC) would have saved hours of downtime had they known something like this was changing. The P100 process now tries to stop this kind of half-baked deployment - it may not catch every problem, but there are enough checks and balances to lessen the chance or minimize the impact if something goes wrong.
The bottom line: Don't let the process kill you. Keep it simple, easy to follow, but demand strict adherence. Once something is certified P100, it can be deployed wherever it fits. Remember one size doesn't fit all - each product has its boundaries for which it was certified and using it outside these boundaries will cause problems. Most of the control starts at the purchasing department. If someone tries to order a tape library not on the P100 list, the order is kicked back. This isn't to say that special cases don't arise, but they are carefully reviewed and approved through a different process.
To be successful in a large-scale project such as EBaR, you must have the commitment of upper management. Intel was spending millions of dollars over a number of years. EBaR isn't something that can be rolled out in a single budget cycle. One of the major problems we had is we predicted significant head count reduction, so management cut the heads prior to our full deployment. This left us with fewer people to manage the existing environment and roll out a new one at the same time. It was a classic case of putting the cart before the horse.
Another tip: Involve all the stakeholders early on in the planning stages. Our team discovered related groups who were ready to do their own projects that could pitch in funding and manpower. The phrase "plagiarize with pride" became our rallying cry. If someone found a workable solution in some other department, our team made it P100 and would use it around the world. And remember when you steal something, give credit to those you stole it from. Then make any adjustments needed to fit your certification process and be happy you didn't have to do it yourself.
A not-so-old proverb states that those who conduct backups are only two failed restores away from a pink slip. My final advice: Install a process that doesn't leave room for mistakes.