What you will learn from this tip: How to identify storage events and resolve them programmatically. Need a downloadable copy? Go here.
Storage managers who are looking to automate problem determination and resolution have two choices: increase the budget and purchase a suite of software and services to deliver these capabilities, or take advantage of the methodology behind the tools to deliver services that can automatically address the issues before they become problematic. Here are eight steps to get you there:
Fight these storage fires -- automatically
Step 1: Identify the problem -- if there is one.
You may choose to undertake a comprehensive evaluation of your data center, or you may start on one site with one database, or anywhere in-between. Once you've determined the scope of your audit, you can begin. You need to know what hardware is out there, including hardware across the SAN and hardware that's directly connected. You will likely be pleasantly surprised at how much storage you actually have.
Step 2: Research the event.
When the switch reports a high port utilization event, a help desk ticket is created. It's usually 2:00 a.m., and there's no one at the data center to help you diagnose the issue. So, you hop into your car, or dial into your VPN, and research the issue. Then you realize that the time of the event seems familiar. It seems that the backup job kicks off at this time each night. When you look at the backup reporting tool, the tapes are spinning and everything looks OK.
If the backup reporting tool shows a failed backup job at this time, then you may have taken the traces from the event and made an alert to call for action to resolve this in the future.
Step 3: Define a corrective action.
The next morning, after a few hours of sleep and a well-deserved cup of coffee, you are back in the office explaining the situation to your team. If there was an actual alarm, this is where the team would get together and determine a corrective action. This could take the form of adding more throughput capability on the switches, breaking up the backup job, changing the schedule, or simply raising the threshold for throughput on the switch port. Both time and events can be part of an automated response to an issue.
Step 4: Document the solution and the traces of the problem into a knowledgebase.
The best way to define a storage management policy is to look at the events that occur in the day-to-day operations of a storage environment from the disk, tape, switch, host and application perspectives. Then you can then look at how the events and alarms are created and how the team processes each event -- from problem determination to resolution. I look at these events and alarms as the traces of a problem, and the actions that are taken to resolve them as a policy in its infancy. I have a saying that has served me well: "From practice, comes process. From process comes policy."
Step 5: The problem repeats. Now what?
Look at the knowledgebase where you documented the solution the first time and follow the steps to resolve the issue. If the traces from the event are different, then research the issue and provide a resolution.
Step 6: Go back into problem resolution mode.
The next morning, you should check the problem traces (events) to make sure that it is the same situation. This may be a recurring event, which calls for filtering technologies to remove the spurious event from the notification system, or changing the thresholds on the switch or infrastructure that generates the event(s).
Step 7: If the problem occurs repeatedly, automate a response with a simple script.
Now that you are a master of this issue, you can script a response or filter out for this type of event in the monitoring and alerting solution that you use.
Step 8: Continual monitoring and training.
Train all of your staff, including new hires, on this process.
In the end, if this allows you to sleep an additional few hours and focus your precious time on what is truly important, than you are doing the right thing and can move on to the next big thing.
The goal: To keep your team from working on problems that aren't real problems so they can focus on the truly large issues at hand.
For more information:
Checklist: Ten steps to troubleshooting SAN/NAS performance problems
Tip: Know your port errors when troubleshooting SANs
Tip: Troubleshoot SANs from the center out
About the author: Brett P. Cooper is a frequent speaker at storage industry events. During his recent tenure at Veritas Software, he was responsible for developing and delivering the first release of Veritas SANPoint Control, one of the industry's first storage management solutions. Brett was also one of the founders of the Veritas Press, where he acted as technical advisor for the well-known storage reference book, "Storage area network essentials: A complete guide to understanding & implementing SANs," by Paul Massiglia and Richard Barker. In Brett's current role for Network Appliance, he is responsible for delivering Fibre Channel Protocol (FCP) and Internet SCSI (iSCSI) solutions to the market.