Nearly every discussion of risk has to start with a definition of the term--and I won't buck that trend here. Risk is the level of exposure to unfavorable consequences from unplanned or unforeseen circumstances. In the world of enterprise storage, unfavorable consequences range from an inability to meet some aspect of a committed service level, such as performance, to the catastrophic, where substantial or all storage resources are unavailable to the business units.
It's important to note that unplanned circumstances that expose companies to risk are different than unforeseen circumstances. For example, an unplanned circumstance might be that while we thought about the possibility of our East and West coast campuses experiencing simultaneous disasters, we considered it so remote an occurrence as to be unworthy of any investment in mitigation or planning. On the other hand, an unforeseen circumstance means we didn't even think about the possibility of a particular event occurring--like the night shift operator deciding to spray WD-40 on all the disk drives to accelerate the backup process. Because the risk was unidentified or inconceivable, no mitigation or planning was in place for such an event.
To mitigate and plan for risk, we have to understand three dimensions:
- Risk source for the purposes of this column means internal vs. external risks. Our staffer with the WD-40 is an internal risk, while an earthquake is an external risk.
- Risk impact is the degree to which any risk--unplanned or unforeseen--can bring about business interruption.
- Risk probability estimates the likelihood of a particular type of risk affecting operations. Depending on your location, an earthquake might be a likely risk but a hurricane might not be.
The Web is a great resource to find articles by experts on the types of risks you might be exposed to, both external and internal. The Federal Emergency Management Agency (FEMA) has an excellent site on external risks run by the Department of Homeland Security. This site provides maps showing the likelihood of various natural disasters in specific geographic areas. The Disaster Recovery Institute International (DRII) identifies additional risk areas, including collocation to hazardous sites or location between transportation arteries that could be carrying hazardous cargo. And, of course, the media keeps us constantly aware of additional risk from terrorist organizations and the havoc that hackers and viruses can cause. Paranoia can escalate exponentially as risk area after risk area is encountered, each seemingly as likely as the next and all just waiting to happen.
Risk mitigation best practices
No matter how clever your team, or how diligent they are in considering risks, it's unlikely they'll identify every potential risk. In fact, Murphy's Law says the risk event that occurs won't be one that was pre-identified.
In addition to external risks, there are scores of internal risks that can occur because of inadequate planning of the environment or architecture, or even inadequate policy and procedure. We're learning about the need for physical and logical secured access, and we've grasped the concept of dual power supplies and emergency power. In addition, an environment with no single point of failure is well understood. Today, these fundamental risk-mitigation components are accepted as baseline best practices.
More recently, we've begun to understand the need for risk mitigation to deal with hackers, virus invasions and--perhaps the most feared event of all--internal attacks by disgruntled employees. Virus detection is currently a standard best practice, yet because of the nature of the threat, it's destined to always be one step behind the innovations of the malicious minded.
Managing risk in your environment
How does the prudent storage manager demonstrate professional risk management? We've considered the source of risk and found it to be almost infinite in its extent and prevalence, both internally and externally. Still, prudence dictates that we formally consider the risk areas identified by industry best practices, FEMA and DRII. This should be considered the minimum effort because there simply is no excuse for overlooking mitigation in those areas.
One might assume that a logical first step is to make a list of generally accepted risks and rank their probability. This may sound simple and straightforward, but it's actually a complex process. Insurance companies have scores of actuaries who spend sleepless nights and mega-mips of processing power to calculate the risk of a particular event such as a flood or earthquake.
If your company has a corporate risk officer, that individual may be able to help you rank and assess the probability of potential risks occurring. Taking a harshly pragmatic view, the reality of probability is that it's inevitably binary: It will or won't happen.
The probability of a particular risk occurring is almost irrelevant. The real issue is what will happen to the organization if a risk event occurs and impacts the organization by jeopardizing the storage environment. It doesn't matter what the risk event is because:
- We can't conceive all of the possible risk events.
- We can't afford to mitigate against most risk events.
- We have no idea of the probability of a particular risk occurring, except to know that one such event is all it takes.
The outcome of a risk event is also binary: Your business can either continue its normal operations or it can't. If your business can't continue in a normal manner, how much money will you lose this month, this quarter or this year?
The impact on the bottom line dictates the level of investment in risk-outcome mitigation. By focusing on mitigating the outcome of a risk and not on the risk itself, you invest in protection in direct proportion to the data's value to the business. You can get a handle on risk mitigation by having a production strategy that permits operations to continue in the event your production site is impacted, disabled, inaccessible or totally destroyed.
Business-continuance planning is the ultimate risk-mitigation strategy because it doesn't matter what unexpected risk occurs, how probable that risk may have been or how easy it was to predict. The answer remains the same: High-value business data must be on an infrastructure capable of business needs-driven availability.
Conventional disaster recovery plans that depend on data recovery and execution at a chosen alternate site can be out of date before they're published. Despite significant investment, conventional disaster recovery plans may be incapable of executing to target expectations.
The bottom-line requirement for effective risk mitigation is a comprehensive disaster recovery capability that's integrated into the production environment and supported by an equally developed business-continuance plan. The ability to truly control risk in today's uncertain world can only be achieved by building continuance into every component of the infrastructure.