An analysis of multiple large and mid-size organizations indicated that, in 2004, many of them experienced multiple...
incidents of SAN-related application downtime.
These incidents led to a total of tens, and in some cases hundreds, of hours of unplanned application downtime per year in each of these organizations due to SAN issues. These SAN-related disruptions were a nasty surprise to the CIOs and top management, who often were not aware of this high-risk exposure of SANs.
For example, within the last six months, the complete ATM service of one of the largest banks in Europe was interrupted for about a day-and-a-half following a human configuration error in a SAN. In one of the largest investment banks in the world, a high-value trading application went down due to a SAN problem. In a leading health-care organization, a SAN change error caused data corruption and downtime to a number of internal applications.
So what is the main reason for the high amount of SAN-related business disruptions?
Existing SAN fault-tolerance mechanisms, such as dual-fabric redundancy, are simply insufficient for dealing with the dynamic nature of today's SANs.
To understand why, let us first consider in general the principle of a redundancy-based fault-tolerant mechanism, and then map it to the realities of current SAN environments.
A redundancy mechanism is meant to ensure that it takes at least two independent local faults, occurring within a short interval, to cause an end-to-end global failure. If the likelihood of the local faults is below a certain threshold, then the likelihood of the global failure is considered to be sufficiently low.
A redundancy-based fault-tolerance mechanism can be rendered ineffective by any of the following three deficiencies:
In SANs, traditional fault-tolerance mechanisms, such as dual-fabric redundancy, exhibit all three types of these deficiencies. It is these deficiencies that result in the large amount of business disruptions.
Most of the local faults in a SAN are not easy to detect, because they require understanding of end-to-end dependencies between multiple components within SAN access paths. Furthermore, these dependencies may rapidly change as SANs constantly evolve and grow (reflecting evolution of technology infrastructure and growth of business demands).
For example, in mid-sized and large organizations, there are typically thousands of SAN access paths and hundreds to thousands of SAN changes per year -- each of which can create a fault in any access path due to some relationship between components being violated.
In many organizations, it is typical to find that 5% to 15% of their SAN access paths are not dual-fabric redundant as required, due to some undetected fault.
This increases the likelihood of an application disruption. In a leading insurance company, a planned switch upgrade change brought down a business critical application, because the redundant access path, which was planned to take over, had a fault which was never detected. In another organization, a simple component failure brought down a number of applications, because the dual-fabric redundant path was not set up correctly.
Most local faults in SANs are caused by changes and operations errors. Operations are typically performed twice -- once for each fabric in a dual-fabric redundancy set-up. Consequently, a large percentage of these actions result in related faults in two redundant access paths. Of these actions, which are typically performed manually, as high as 5% or more may have some kind of error.
Last year, in one of the largest banks in New York, there were a few incidents in which production application downtime was caused by corresponding zoning configuration errors repeated across both dual-fabrics. In an insurance organization, multiple application disruptions were caused by the same interoperability violation in the corresponding dual-access paths.
There are some classes of SAN faults for which a single fault is sufficient for causing an application disruption, despite the redundancy fault-tolerance mechanism.
For example, if a storage LUN is erroneously allocated to two different applications (two separate access paths), data corruption and applications downtime often results. Such conflicting allocations are fairly likely, due to the fact that old configurations are often not fully cleaned up following such changes such as migration and decommissioning.
In over 30% to 40% of device-decommissioning and migration changes, some old configurations are not cleaned up. In each case, a single additional event, such as allocation of a previously-used component, may result in data corruption and application downtime.
At one of the world's largest investment banks, reuse of an old HBA in a new server resulted in data corruption and disruptions to business-critical applications, because old zone configurations were not cleaned up in fabric switches.
All of the fault-tolerance deficiencies above can be addressed by an approach that augments traditional redundancy mechanisms. This approach, which can be considered a "process-redundancy" enhancement, incorporates continuous feedback and feed-forward end-to-end validation steps into SAN management processes. Process-redundancy enhancement eliminates all the above deficiencies of traditional redundancy mechanisms (undetected faults, dependent faults and out-of-scope faults).
Software solutions such as Onaro's SANscreen provide such process-redundancy capabilities by:
After implementing such an enhanced approach with Onaro's SANscreen solution, State Street Global Advisors (SSGA), a leading investment bank, dramatically reduced SAN-related application disruptions to zero in 2004.
As demonstrated at SSGA and numerous other organizations, such an enhanced SAN validation approach not only significantly reduces the disruption risk, but it can also reduce operational and capital costs and can also be extended to other parts of the IT infrastructure. Ultimately, this approach leads to better quality of service and better quality of business.