Availability, a layered approach
Evan L. Marcus
In Part 1 of this series, we introduced the idea that implementing availability requires taking a layered approach. Figure 1 reprises that layering.
In many ways, the first layer, Good System Administrative Practices and Infrastructure, is the most important, and could surely take up two or three columns all by itself. These practices encompass many different disciplines and techniques, all of which combine to build the foundation of our layered approach to availability. Since all of these practices are equally important, the list that follows is not in any particular order. Neither the list nor the discussion that follows is exhaustive.
Assume Nothing: When you buy a new computer system, it does not come with high availability features built in. It takes planning, effort, and expense to achieve production quality availability on which you can base your business. Unless you pay extra, systems won't failover and disks won't be mirrored. Unless you do extra work beyond just unboxing your new systems and plugging them in, they will not get backed up, and they won't scale. Don't assume that your OS vendor has taken care of all this for you; he hasn't. He will, but it'll cost you more money and time.
Don't Be Cheap: Increasing availability is an investment in the systems that run your business. When they are operational, they are able to run your business and make you money. When they are not, they are costing money. Implementing high availability costs money, but if you do it right, you will make the money back many times over. (In a future column, we will take a look at the Return on Investment implications of increasing your availability; you'll see that it's a no-brainer!) Quality costs money, whether you're buying ice cream (compare Haagen-Dazs to the store brand with the little ice crystals in it), cars (compare a Ferrari with a Geo), or skilled personnel and extra systems to increase the availability of your systems.
Maintain Tight Security: Don't let unnecessary users on critical systems; if they don't need to be there, then they shouldn't be there. System access should be on a need-to-have basis. Make sure that your system administrators log in using their unprivileged accounts before they promote themselves to root or administrator; this way you can always tell who is doing what, and when. Use firewalls, and don't open up any more ports than are absolutely necessary. Enforce good password selection, although I am opposed to the combination of requiring specific character classes in passwords AND aging them so they must be reset every few weeks. Password selection becomes weaker and weaker, and/or users are forced to write them down. And keep those virus checkers up to date; it used to be that once a month was sufficient to update your virus scanners. Now the conventional wisdom is that you need to update them every day or two. (Unless you're a Unix administrator, in which case you can just smile smugly to yourself...)
Consolidate your servers. Mark Twain said it best: "The fool says, 'Don't put all your eggs in one basket,' which means that you should spread your resources all over the place. What you should do is put all your eggs in one basket, and then watch that basket!" I would only add one thing; if you build that basket out of titanium-reinforced concrete, your eggs will be safe. By consolidating, you reduce the number of things that can go wrong; admittedly you increase the impact of any single failure, but by implementing appropriate levels of protection, you can reduce the frequency of serious outages.
Watch your speed: It's no longer sufficient for servers to be up, they must also be fast. Slow servers are indistinguishable from down servers in some cases. There are software tools that can simulate the user experience, and show you what your users are seeing. Monitor resource utilization, and increase your resources BEFORE things become critical. Zona Research estimates that $4 Billion a year is lost in abandoned Web transactions because of slow Web sites.
Document everything: Create and use run books that tell users and administrators exactly what to do when things go wrong. And remember to keep them off-line, too, because if the instructions to boot the system up are on the system, they won't do anyone any good! Write the documentation for unskilled people to follow; don't assume that your knowledgeable people will be around when these procedures must be run. And review it regularly. The only thing worse than no documentation is bad documentation. Suggestion: when new people join the IT staff, make them productive from day one by letting them run through the procedure book; this makes sure they get tested, and it makes the new people productive from day one. That's a win-win situation.
Test everything, too: Before new applications or systems or procedures make it into production, they must be thoroughly tested in an environment that is as close to production as possible, without actually touching production. The farther from production that the test environment varies, the less valid the test is, and it doesn't take long before the test is totally useless. Ideally, you want to include real, users in your testing. Fact is, that almost never happens, but it's a wonderful goal.
Separate Your Environments: It is totally inappropriate to do anything but production work with production quality applications on production hardware. You should separate your environments so completely that there is no interaction of any kind between production, development, or quality assurance. In fact, you may need as many as six separate environments, five of which are: production, quality assurance, development, disaster recovery, and a sandbox or play area for new hardware and software. The sixth environment we call the production mirror. That is kind of a misnomer; it's really a time warp for production; this environment should look like production did, say, two weeks ago. This way, if you attempt to implement something in production, and it fails, you have a clean version of a production environment to go back to, or to restore from quickly and cleanly. If you have a clean back out procedure for absolutely anything that you ever change in production, then a production mirror/time warp isn't really required.
Build for growth: The physicists who read 8wire will all know Boyle's Law. It says that a gas will expand to fill all available space. (OK, that's not EXACTLY what it says, but it's close enough for our purposes.) Boyle's Law can also be applied to computer systems; resources always expand to fill capacity. No matter how much disk space you put in your system, no matter how much memory, or how many high-speed CPUs, you know that they will be consumed before long, and you'll be adding more. Plan for this eventuality by purchasing systems with extra capacity; space for extra disks, extra memory, extra CPU, and extra slots on your backplane. Eventually you'll fill them.
Build for repair: If a system fails and must be physically replaced, in most cases you'll want to take the disks from the old system and put them onto the new one. The easiest way to do that is to keep all disks on all systems external to the cabinet with the CPU and system boards in it. Consider a desktop client; if it must be replaced, then the system administrator must open the case, and remove the disks, which subjects them to a risk of static shock, being dropped, or mishandled in some other way. If the disks are kept outside the cabinet, then they need only be unplugged from the old system, and plugged into the new one. The risk of damaging disks is reduced, and the repair/replacement time is reduced significantly. Users can back on line quickly, and are happy.
Choose Mature Software and Hardware: If you go to Sun (or any other system vendor; we're not picking on Sun here) and take the very first E3800 off the assembly line, and grab the pre-release version of Solaris 9, and put that system into production, you will almost certainly run into more trouble than someone who takes Solaris 8 (with all the required patches) and installs it onto an E3500, and puts that into production. If you insist on getting bleeding edge technology and rushing it into production without proper testing, you are setting yourself up for failure. Mature software (avoid 1.0 releases or any x.0 release) will surely run better and with fewer problems than brand new stuff that may not have undergone adequate testing.
Reuse configurations: If a configuration works well in one implementation, it will work well in others. Having fewer configurations is easier to support, you'll have a higher degree of confidence for new rollouts, and you may be able to bulk purchase equipment. What's more, it means that there is less for new personnel to learn; they can come up to speed faster. Choose 3 or 4 server configurations, based on size requirements, and 1 or 2 desktop configurations, and try to stick to them. They will change over time, but shouldn't change more than necessary.
Exploit external resources: Twenty years ago, it was possible for someone in the IT industry to know "everything". Life was much simpler then, with fewer vendors, fewer products, and fewer classes of products. Today, nobody can possibly know everything there is to know in IT. So that means you need to call on experts. Attend conferences, surf the web, use vendor consulting and training, contract reference sites before you buy new products, read magazines and journals, and hire knowledgeable people. If you guess on major purchases or major implementations, you will likely fail; by using the proper resources, you can significantly increase your likelihood of success.
And finally, the KISS principle: Keep it simple, sister. Don't put stuff on systems that don't need to be there. Get unnecessary applications and hardware off your production servers. (No playing DOOM in production, even at night!) Limit network connections. Avoid monitors and screen savers on servers. Why? Because people make more mistakes on complex systems then they do on simple ones. And mistakes lead inexorably to downtime. If you keep things simple, you give your users and administrators less change to make mistakes that will bring systems down. Plus, new system administrators have a serious learning curve to climb when they join your team; by keeping things simple, you can flatten out that curve, and make your new people productive much sooner.
In future columns, we will move much more quickly up the availability index shown in figure 1. We plan to demonstrate that achieving high availability requires a layered approach, and cannot be achieved by simply installing failover software and walking away.
In the meantime, this is The Maven; if you have any comments or questions, send them to me at firstname.lastname@example.org.
Copyright 2002, Evan Marcus