News Stay informed about the latest enterprise technology news and product updates.

Got insanity? Meet cross-domain correlation engines

Finding the root of a device failure can lead to many headaches in an enterprise IT infrastructure. Storage analyst Arun Taneja says that cross-domain correlation engines could lead you to that light at the end of your nightmare tunnel.

The enterprise IT infrastructure is getting so complex these days that when a device does fail, and fail they do, it is often a guessing game to figure out what failed, let alone why it failed. Of course, all smart IT managers are building their infrastructures with copious amounts of redundancy, starting with power supplies, fans, RAID controllers and then adding dual HBAs and storage attached network switches to eliminate any single points of failure.

Storage blogs
Read what all of our expert bloggers have to say on data protection, storage networking and more. Click here.
But when a failure does happen -- say an HBA goes -- the failed device starts spewing all kinds of alerts that can cause havoc downstream due to something called sympathetic alerts. It is not unusual for a failed or a failing device to create thousands of alerts that all register on the administrator's console. Identifying the root cause is a process that leaves many an administrator breathless. Not only does this issue affect the storage administrator but, depending on the problem, a barrage of alerts is sent to the application, network and the database administrators as well. Each one of these guys has a stake in the delivery of the SLA. Instead of one person scrambling to resolve the issue, often there are several that try to look at the alerts to determine the root cause.

You can see why we have such an issue today. There are hundreds of moving parts that make up the IT infrastructure. This is the price we pay for having networked practically everything. Today, everything is connected to everything else, either directly or indirectly. It is little surprise, then, that we have surpassed the point where human beings can correlate these hundreds of inter-related items and extract the root cause of the problem. I see even the best storage administrators struggle with scripted rules for how each device in the network behaves under failure. But, given the increasing complexity, this is losing battle. Even the smartest administrator cannot solve today's problems with the tools that exist. And mind you, I totally accept the fact that all hardware devices and all software today comes with decent diagnostics capability. But, as you see from above, the issue is not solvable by these diagnostics alone.

Fortunately, a few large vendors and a few startups have figured out that the problem exists and needs to be solved. Given that the problem is complicated, it is little surprise that the action is coming only from a few vendors. Before we talk about who they are, let me describe how they address this problem. Simply put, the products require automated fault management and use modeling techniques rather than the rules-based methodologies used to solve simpler versions of this problem. At the heart of these products is a cross-domain correlation engine, which programmatically understands the relationship between all connected devices and is able to quickly (almost in real time) identify the root cause of the problem. Of course, all these products come with auto discovery functionality to discover and map the topology of the infrastructure, both physically and logically. How they gather the information, in what format they store the information and the accuracy of their analytics engine determines their effectiveness. This is a non-trivial exercise, requiring some serious Ph.D-type knowledge.

First of all, there are devices that have failed. Relatively speaking, these are easier to find. Devices that have started to perform poorly for one reason or another are extremely difficult to find, as they don't generate a trap, and therefore, cannot be detected via software that operates on the basis of a trap. And of course, the device is still functioning, so all other software in the environment still interacts with it and finds no flaw. And yet, to answer some of the questions below, we need to not only detect this poorly-performing device but determine its relative capability.

These questions include:

  • What is the system's ability to tell me proactively if an incremental change occurring in the environment (as simple as growth in data or the number of users) will put the application outside the SLA in 10 days?
  • How about telling the administrator that a specific backup job will fall outside the backup window assigned to it, in a few days?

The number of predictive items one can conjure up is limitless. Take an extra step forward. How about the system's ability to tell me what needs to be done to optimize data protection for a backup stream? How about telling the compliance officer that certain aspects of the system are about to fall outside the requirements and then offer up a solution to solve the problem? Do you see how we go from a pathetic state of "pulling our hair out" to fine tuning an environment proactively and instantly knowing the business impact of an IT issue? This is the only way I see us progressing. The good news is that we have seen serious progress in this arena in the last three years.

The products I am most impressed with include EMC SMARTS, Onaro and WysDM. IBM just introduced IBM Process Manager 1.1 that looks very promising as well, but it is too early for me to have an opinion. Onaro has attacked the problem of change management in an FC SAN, whereas WysDM has solved the problem on the data protection side. If you recall, EMC acquired a company called SMARTS a while back, and we have just seen the first set of results from that acquisition. The most recent incarnation of SMARTS came out as Storage Insight for Availability and is targeted at automated fault management of FC SANs. All these products boast a cross domain correlation engine of some sort, where vast amounts of information are modeled and the root cause established. Some products, such as SMARTS are "root cause analysis" oriented and excel in fault isolation; others, such WysDM and Onaro, use a predictive analysis engine enabling them to go further into the world of "proactivity" ("what if" scenarios and predictions). Note that EMC resells WysDM as Backup Advisor. All of them are headed towards connecting IT issues with business impact; some have initial offerings already for compliance, for instance.

So, if your environment is complicated enough, and you are increasingly feeling out of control in managing your SAN or the data protection environment (I would truly like to meet you if you do not fall in this category), I suggest you look at these new offerings and see if they will help you Rogaine (I mean regain) some of your hair back. Keep in mind that all these products are in their infancy, relative to their potential, but they have all reached a level of maturity that makes them deployable in a production environment today. Just to be sure, you should also check out a few other data protection management (DPM) players, such as Bocada, Tek-Tools, Aptare and Illuminator; but I think you will quickly discover that, while all of them do a great job of reporting, not all are created equal when it comes to answering the predictive questions that require a solid cross correlation engine at the center.

About the author: Arun Taneja is the founder and consulting analyst for the Taneja Group. Taneja writes columns and answers questions about data management and related topics.

Dig Deeper on SAN technology and arrays

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.