It doesn't matter how good your metrics are if they're inaccessible to the people who need them. Don't cut corners when planning the presentation of your data. Spend some time determining which indicators are essential to understanding your operational efficiency. Metrics like utilization and cost per gigabyte can be powerful, but must be presented carefully to avoid sending the wrong message.
When designing a graphical display of your metrics (sometimes called a dashboard), pay particular attention to what the graphic itself is showing. Good designs include reference marks that show if you're hitting targets for the metric in question and allow direct visual comparison between sets. Don't let the graphics overwhelm your message. Avoid flashy 3D effects, textures, bright fill colors and other elements that may distract from the information being presented. Sometimes the simplest chart types are the most effective: columns (stacked columns are especially useful for utilization), pie charts for showing relative amounts and line graphs to illustrate trends.
If you're keen on making the most of your key performance indicators, I recommend the work of Edward Tufte (www.edwardtufte.com). His books and seminars on presenting statistical information are well worth the price, and should help hone your data presentation skills.
I quickly went about implementing this and it worked. First, I found out that I had a network bottleneck because I had based my LAN design on the shape of the building rather than on the number and type of workers in those areas. But the important finding came next. By proactively recording and sharing performance and availability metrics, my users became much happier with the state of our systems. They now understood that a single outage was an anomaly rather than the rule.
So let's talk about how you can use metrics to improve service and satisfaction. We'll start with a discussion of base metrics and parametrics, the inputs into the system. Then we'll look at performance indicators and how to decide on the few key performance indicators that will concisely tell the story.
Base metrics and parametrics
Most people could rattle off a dozen base metrics. These are the objective measurable values you get from management packages and performance meters. Pretty much anything with a unit of measure is a base metric, including capacity (megabytes, gigabytes, terabytes), throughput (megabytes per second) and other such familiar readings. Other useful metrics revolve around the backup system--how many tape cartridges do you buy in a month? How many are offsite or in the scratch pool?
There's another class of metrics to consider: parametrics, which are the constants you use to calculate your key performance indicators (KPIs). For example, the fully burdened cost of a full-time equivalent worker is a parametric, as is the average cost of a host bus adapter or server. Parametrics vary widely from region to region and company to company, so be wary of using industry averages.
So which base metrics and parametrics do you need? That depends on the performance indicators required. Because these are just the inputs, it's more important to consider what outputs are important to you.
Performance indicators distill subjective meaning from your objective metrics. They still have units, but their concepts are much more complex. And performance indicators usually apply to an entire system rather than to a single technical component.
These metrics serve to help storage managers meet customer expectations. For example, you'd calculate your backup success rate by looking at the number of successful, partial and failed backup jobs. In itself, the success rate isn't very useful, but if you watch this performance indicator, you'd know if a backup system problem was impacting your ability to protect your data.
Mean time to restore is another common performance indicator. It combines metrics related to user requests, staff responsiveness, tape recall and onsite storage, backup system utilization and absolute restore speed. Major variations in this indicator could point to problems anywhere in this chain. By monitoring this one metric, you're protecting the credibility of the entire storage and backup group.
Storage utilization is a particularly tricky metric. At the file system level, it's critical but not necessarily useful. Is a file system 99% full because a runaway application is spewing data or because it contains a single, fixed-size element? The same problem occurs when utilization is used across systems as a performance indicator. Fifty percent utilization may be appropriate for a highly flexible development system, or it might hide serious resource-sharing problems.
This illustrates the problem with metrics in general--what's valid for one environment may be useless in another. The indicators you choose to promote depend on your individual business requirements and management style.
Key performance indicators
Mixing up some base metrics and parametrics into performance indicators is one thing, but creating a quality set of key performance indicators is quite another. A KPI is a special class of performance indicator that relates to your business interests. Specifically, KPIs show how well your systems are serving your business.
Although it's difficult to generalize KPIs, they tend to fit into one of the following four categories:
Cost is of special concern for most businesses. One example of a key cost indicator is the total cost per unit for storage. Although this is sometimes abbreviated as "cost per gigabyte," it should include other total cost of ownership (TCO) variables such as maintenance, operations and environmentals in addition to basic hardware costs. It's likely this cost will decline over time--a rising TCO is usually a sign of trouble.
Risk can be tricky to calculate. Risk factors include technicalities like single points of failure and RAID levels, and can change based on the value of the data in question. Many people sidestep the risk equation altogether and rely instead on more objective measures like recovery time and recovery point objectives (RTO and RPO, respectively) but these are parametrics, not performance indicators. Compliance with these parametrics is a better indicator, but can also be tricky to measure.
Delivery indicators are inherently outward facing. How long does it take to get new storage online once a request is made? Does someone always answer the phone when a user calls? The answers to these questions can be the key to gauging user confidence in your storage department. Another objective delivery metric is availability--how much out-of-window downtime do users endure?
Efficiency is a meta indicator; it reflects your ability to meet the other indicators listed above. An effective management organization can strike some semblance of balance between cost, risk and delivery, while still achieving some level of efficiency. At the very least, an understanding of the tradeoffs chosen in your environment is valuable information when meeting with users.
How valuable are metrics? Consider this: The Boston Red Sox went 86 years without a World Series title. In 2002, the team was the first to embrace Sabermetrics, the objective study of baseball statistics. They even hired Bill James, creator of the Sabermetrics system, as a consultant. Though James' input is shrouded in secrecy and hotly debated, there seems little doubt that the team's approach to the game changed in the following years. Of course, the culmination of the Red Sox's incorporation of metrics was their historic title win in October 2004. Can metrics help you win the Series, too? You never know...