This article can also be found in the Premium Editorial Download "Storage magazine: Overview of top tape backup options for midrange systems and networking segments."
Download it now to read this article plus other related content.
I recently retired from Intel as product manager for storage products. It was a good, yet hectic ride. During my 10-year tenure in that position, I deployed storage solutions for thousands of servers, adding up to nearly a petabyte of data. This article explains how
Intel came to use an elaborate standardization process to improve its backup and restore procedures.
|Intel's P100 checklist|
After the introduction of its x386 processor, Intel began earnestly moving away from using big iron mainframes to PC servers. Soon we had an assorted jumble of every server size and shape made, running every OS under the sun.
Today, there are over 10,000 servers, ranging in size from 50GB to multiterabytes. They're spread around the world in a dozen major sites and countless small sites and offices. A typical large site will have 200 to 300 servers and a small site could have two on up to 50. To back up all the servers, there are nearly 500 tape libraries and thousands of single tape drives. The libraries range from a single drive eight-tape changer to 16 drive 350 tape libraries. Almost all tape drives are DLT7000s.
Granted, not all this storage is found on Intel-based servers - yet - but the majority of servers are attached to tape drives. Some large engineering sites, served by non-Intel-based systems, need to store 100TB or more of data. Put simply, it's a huge environment consisting of an eclectic mix of different size servers employed worldwide, all of which need to be protected.
In the early days - before CE - anyone who wanted to deploy an application would build up a server, install their application, write some procedures - maybe - and put it on the Net. Soon, we had dozens of individual solutions with little standardization, and system administration quickly became a nightmare. You were likely to find five or six different backup solutions in the same computer room. Most servers would have their own tape drive attached and who knows what backup application, which was running on its own schedule and tape rotation scheme.
In 1995, things were getting out of hand and IT began to develop worldwide standardization programs. Core services like e-mail, office applications and online storage were moved to a standard solution and deployed around the world. This gave us an opportunity to develop an automated backup solution for close to 1,000 servers running Microsoft NT. Every server would have four network connections, three for users and one for maintenance. Backups were done over a maintenance net to a tape library. Each library would support 10 to 15 servers.
Special application servers with large data stores got their own tape changers directly attached (via SCSI). Soon we had a reliable and manageable backup and restore solution. Man-hours to administer backups at sites with 100 servers went from 20 hours/week to three hours/week, reliability skyrocketed and we could manage the backup and restore operations over the network. Often, support people in California would be configuring systems in Ireland and Hong Kong at the same time.
Over the next few years, this basic model was replicated until we had more than 6,000 servers covered. I know the number because once we had to deploy a patch to ARCserve just before daylight-saving time. We did it in one night from a central location. And hardware and support costs dropped sharply - an additional benefit to adopting a company-wide standard.
Even though IT had made significant strides in automating its backup and recovery processes, it still only covered less than half of the available servers worldwide. Everyone was using the same tape hardware, and most NT servers were using ARCserver under the corporate license agreements. Still, many groups still operated their own non-NT servers. There was a large Unix environment not being addressed. Though the number of servers was lower, the total storage on Unix was much higher. Each group had good solutions, but there was little sharing of resources between groups or applications.
Most had direct SCSI-attached tape systems with backup software running on top of the application server. This is always a problem. Backup systems take a lot of maintenance and you don't want to bring down your database application because your tape system needs maintenance. On the other hand, you need the SCSI-type speeds if you plan to back up, or more importantly, restore 500GB in a reasonable time.
The four-hour restore rule
I used to live by the rule that we shouldn't build any server we couldn't restore in four hours. This came from a disaster I once lived through. In the days of 50GB servers and 2GB tapes, we were seven hours into a restore when tapes began to fail. The customer was already upset about being offline for most of the day, and now we had to tell them not all their data was coming back. Not a pleasant situation. That taught me to get a clear agreement with your customer of how long they can stand to be down; then don't let the system grow any bigger than can be restored in that time. And remember to build in time to gather the tapes and get the right resources in place.
But ... back to the Unix problem. Most owners of Unix systems were now also running some NT and we wanted to merge the environments. For the first time, we were able to get NT administrators and Unix administrators to sit down at the table without blood-letting. A project was born - enterprise backup and restore, EBaR. For this project, we were able to pull together people from all over the world to design, test, deploy and manage a single strategy covering almost everyone's needs. This didn't mean that we had a "one size fits all" solution. We went through an extensive process in defining our needs and picking a product that best met them. We chose a Veritas suite of backup products since they covered most of our flavors of Unix and other operating systems and provided a way to centrally manage the environment. The tiered architecture of Veritas' NetBackup made it possible to support large data stores and small servers under one system.
The timing for this project coincided with the deployment of a standardized storage area network (SAN) solution. As we consolidated servers and storage onto large (1TB to 3TB) SANs with 50 to 100 servers, we deployed fiber-based backup systems to service them. This deployment, of course, had its unique set of issues. When it was configured correctly and everything was running right, it was great. SAN bandwidth was adequate - plenty of redundancy and dedicated connectivity is wonderful for moving large amounts of data. But, and this is a big but, there are many variables that can make things go wrong. Servers or libraries were being taken online and offline for maintenance, tape drives failed and fiber bridges went out - which kept many of us from sleeping well at night.
This was first published in July 2002