Previously we discussed The Myth of The Nines, and how the traditional model of measuring availability by straight percentages is flawed. If you agree with what I wrote then, the next problem we must address is how you handle Service Level Agreements (SLAs).
When they draw up SLAs with their management and customers, system administrators are asked to bet their jobs (or salaries, or raises, or bonuses…), effectively, on their ability to guess how many system outages will occur over the next 12 months, and what the total downtime from those outages will be. Once the bet-upon amount of downtime has been reached or exceeded, no more downtime can be permitted.
For the system administrator, this type of SLA is a sucker bet. He almost certainly can't win it over a long period of time. There is no way for a system administrator, or anyone else, to know how many times his systems will go down over any period of time. He is wagering on the quality of the hardware, the operating system, the applications, and the networks that he has implemented. He's betting that black hat hackers and script kiddies won't find a way around his firewall or his anti-virus software. Since the SA has no control over these external factors, how can he bet something as significant as his job on them?
With that in mind, how can an IT department build responsible SLAs?
I propose a model different from simply using the nines. Base SLAs on how much downtime each incident will cause. You might choose to break incidents into different classes, where you allow different amounts of downtime depending on the class into which each outage falls.
For example, if the system crashes, and you have Failover Management Software (FMS) in place, it is reasonable to expect a takeover to complete in five minutes (your mileage may vary), and reasonable to write that (plus a small additional threshold) into an SLA.
If a major disaster occurs, and your DR testing indicates that it will take four hours to switch operations over to your backup site, then it is reasonable to agree to a six hour window for downtime in a disaster.
It is not realistic to sign an agreement that says that a system will only be down two hours over the course of a calendar year, with no other constraints on what counts, or what exemptions there might be, or how many independent outages occur.
As each outage occurs, the system administrator should be able to look back at the timeline of the outage, and determine if there is anything that he can do to improve response time for future outages, thus making it easier to stay within SLA guidelines.
Pointy-haired bosses will not be happy with this model because it requires more thinking in advance of the agreement, and it's more complicated than simply saying "99.99% uptime". But for SAs it is the only fair model; it ensure that they are taking responsibility for things that they can control (duration of downtime), as opposed to things they cannot control at all (frequency of downtime).
Evan L. Marcus is Data Availability Maven at VERITAS Software.