Published: 01 Sep 2010
Storage performance issues are often not related to the storage system at all, but rather to the storage network that links servers to disk arrays. These 10 tips will help you find and fix the bottlenecks in your storage network infrastructure.
Every so often there's a moment of calm in a data storage manager's life where nothing is broken and there aren't any fires to put out. As rarely as these times might occur, the momentary calm should be taken advantage of rather than savored. This is your opportunity to get some of the kinks out of your storage network so you can eliminate the next emergency before it happens or just be better prepared when it does. We spoke with experts from storage networking vendors -- Brocade, Cisco, Emulex and Virtual Instruments -- to discuss what storage managers should do to prepare their storage networks for the future and to maximize their investments.
The first few tips that follow have more to do with being prepared than actually tinkering with your storage-area network (SAN), but all of our experts agreed that trying to fine-tune a SAN without adequate preparation is like driving down a freeway without headlights. Before you can roll up your sleeves and get under the hood, you have to do some preparation. The rest of our tips go into more detail, describing specific steps (often at no cost) that you can take to improve SAN performance, efficiency and resiliency.
Tip 1. Know what you have
The No. 1 recommendation in fine-tuning your storage network is to first know what you have in the environment. If you have a problem and need to bring in your vendor's tech experts, the first thing they're going to want is an inventory of your networking environment. If you do the inventory ahead of time, you'll likely pay less for any necessary professional services and it may even help you avoid having to engage them in the first place.
It's important to document each host bus adapter (HBA), cable and switch in the environment while noting how they're interconnected. You should also record the speeds they're actually set at, and the versions of the software or drivers they're running. While all of this may seem painfully obvious, an inventory of what the storage network consists of and how it's configured is the type of document that can quickly fall off the priority list during the urgencies of a typical IT workweek. Taking time to level set and understand what's in the environment, and how it has changed, is critical.
Documenting this information may even pinpoint some areas that are ripe for fine-tuning. We've seen cases where over the course of time users have upgraded to 4 Gb Fibre Channel (FC) and, for some reason, their inter-switch links (ISLs) were still set at 1 Gb. A simple change to the switch configurations effectively doubled their performance. If they hadn't taken the time to do an inventory, this obvious mistake may never have come to light.
This could be a zero-cost tip because the information can be captured and stored in spreadsheets. While manually keeping track of this information is possible, in today's rapidly changing, dynamic data center it's becoming a less practical approach. Storage environments change fast and IT staffs are typically stretched thin, so manually maintaining an infrastructure isn't realistic. Vendors we spoke to, and many others, have software and hardware tools that can capture this information automatically.
Of course, those tools aren't free or as cheap as a spreadsheet. But if you weigh their cost against the cost of manually capturing the data, or the cost of missing an important change to the network environment, it can be a good investment. Automated storage resource management (SRM) tools also vary in the data they capture and the level at which they capture it. Many simply poll devices and record status data, while others tap the physical layer and analyze network frames.
Tip 2. Know what's going on
After you've developed a good picture of the components in your storage network infrastructure, the next step is to fully understand what those devices are doing at a particular moment in time. Many switch and HBA vendors build some of these capabilities into their products. But instead of going to each device to see its view of traffic conditions, it may be better to find a tool that can provide consolidated real-time feedback on how data is traversing your network. There are software solutions and physical layer access tools that can report on the infrastructure traffic. The tools that can monitor network devices specifically are important because, as all of our experts pointed out, there are situations where operating systems or applications report inaccurate information when compared to what the device is reporting.
These tools can be used for trend analysis and, in some cases, they can simulate an upcoming occurrence of a data storage infrastructure problem. For example, if an ISL is seeing a steady increase in traffic (see Tip 6), the ability to trend that traffic growth will help identify how soon an application rebalance or an increase in ISL bandwidth will be required. Other tools will report on CRC or packet errors to ports, which can indicate an upcoming SFP failure.
Tip 3. Know what you want to do
With your inventory complete and good visibility into your SAN established, the next step is to figure out what network changes will provide the most benefit to the organization. You may have discovered SAN features that need to be enabled, or perhaps you have new applications or an accelerated rollout of current initiatives that need to be planned. Knowing how activities such as those will impact the rest of the environment and what role the storage infrastructure has to play in those tasks is critical. Generally, the goals come down to increasing reliability or performance, but they may also be to reduce costs.
Tip 4. Limit the impact
When you feel you're at the stage where you're ready to make changes to the environment, the next step is to limit the sphere of impact as much as possible by subdividing the SAN into virtual SANs (VSANs).
Subdividing (in a worst-case scenario) changes made to the environment that yield unexpected results, like preventing a server from accessing storage or even causing an outage, will have limited repercussions across the infrastructure. Limiting the sphere of impact is by itself an important fine-tuning step that will help create an environment that's more resilient to changes in the future, and can help contain problems. For example, an application may suddenly need an excessive amount of storage resources; subdividing the SAN will help contain it and keep the rest of the infrastructure from being starved. This aspect of fine-tuning shouldn't require any new purchases as it's a setup and configuration process.
Tip 5. Test to learn, learn to test
Although it may seem to be something of a luxury, one key to fine-tuning is to have a permanent testing lab that can be used to try out proposed changes to the environment or to simulate failed conditions. Lab testing lets you explore the alternatives and develop remedies without impacting the production network. In speaking with our experts, and in our own experience, most SAN emergencies result from implementing a new feature in the storage array or on the SAN. If you lack the resources to create a lab environment, an alternative may be to work with your infrastructure vendors, as many have facilities that can be used to recreate problems or to test the implementation of new features.
Storage I/O performance is typically high on a fine-tuning top 10 list, and although it didn't make it into our top five tips, it rounds out the rest of the list. Before performance issues are tackled, it's important that the environment be documented, understood and made as resilient as possible. While slow response time due to lack of performance tuning is a concern, zero response time because of poor planning is a lot worse.
Tip 6. Understand how you're using ISLs
ISLs (interconnects between switches) are critical areas for tuning, and as a storage-area network grows, they become increasingly important to performance. The art of fine-tuning an ISL is often an area where different vendors will have conflicting opinions on what a good rule of thumb is for switch fan-in configurations and the number of hops between switches. The reality is that the latency between switch connections compared to the latency of mechanical hard drives is dramatically lower, even negligible; however, in high fan-in situations or where there are a lot of hops (servers crossing multiple switches to access data), ISLs play an important role.
The top concern is to ensure that ISLs are configured at the correct bandwidth between the switches, which seems to be a surprisingly common mistake as mentioned earlier. Beyond that, it's important to measure the traffic flow between hosts and switches, and the ISL traffic between the switches themselves. Switch reporting tools will provide much of this information but, as indicated earlier, a visual tool that measures switch intercommunication may be preferable.
Based on the traffic measurements, a determination can be made to rebalance traffic flow by adjusting which primary switch the server connects with, which will involve physical rewiring and potential server downtime. Another option is to add ISLs, which increases bandwidth but consumes ports and, to some extent, further adds to the complexity of the storage architecture.
Tip 7. Use NPIV for virtual machines
Server virtualization has changed just about everything when configuring SANs and one of the biggest challenges is to identify which virtual machines are demanding the most from the infrastructure. Before server virtualization, a single server had a single application and communicated to the SAN through a single HBA; now virtual hosts may have many servers trying to communicate with the storage infrastructure all through the same HBA. It's critical to be able to identify the virtual machines that need storage I/O performance the most so that they can be balanced across the hosts, instead of consuming all the resources of a single host. N_Port ID Virtualization (NPIV) is a feature supported by some HBAs that lets you assign each individual virtual machine a virtual World Wide Name (WWN) that will stay associated with it, even through virtual machine migrations from host to host. With NPIV, you can use your switches' statistics to identify the most active virtual machines from the point of view of storage and allocate them
appropriately across the hosts in the environment.
Tip 8. Know thy HBA queue depth
HBA queue depth is the number of pending storage I/Os that are sent to the data storage infrastructure. When installing an HBA, most storage administrators simply use the default settings for the card, but the default HBA queue depth setting is typically too high. This can cause storage ports to become congested, leading to application performance issues. If queue depth is set too low, the ports and the SAN infrastructure itself aren't used efficiently. When a storage system isn't loaded with enough pending I/Os, it doesn't get the opportunity to use its cache; if essentially everything expires out of cache before it can be accessed, the majority of accesses will then be coming from disk. Most HBAs set the default queue depth between 32 to 256, but the optimal range is actually closer to 2 to 8. Most initiators can report on the number of pending requests in their queues at any given time, which allows you to strike a balance between too much and not enough queue depth.
Tip 9. Multipath verification
Multipath verification involves ensuring that I/O traffic has been distributed across redundant paths. In many environments, our experts said they found multipathing isn't working at all or that the load isn't balanced across the available paths. For example, if you have one path carrying 80% of its capacity and the other path only 3%, it can affect availability if an HBA or its connection fails, or it can impact application performance. The goal should be to ensure that traffic is balanced fairly evenly across all available HBA ports and ISLs.
You can use switch reports for multipath verification. To do this, run a report with the port WWNs, the port name and the MBps sorted by the port name combined with a filter for an attached device type equal to "server." This is a quick way to identify which links have balanced multipaths, which ones are currently acting as active/passive and which ones don't have an active redundant HBA.
Tip 10. Improve replication and backup performance
While some environments have critical concerns over the performance of a database application, almost all of them need to decrease the amount of time it takes to perform backups or replication functions. Both of these processes are challenged by rapidly growing data sets that need to be replicated across relatively narrow bandwidth connections and ever-shrinking backup windows. They're also the most likely processes to put a continuous load across multiple segments within the SAN infrastructure. The backup server is the most likely candidate to receive data that has to hop across switches or zones to get to it.
All of the above tips apply doubly to backup performance. Also consider adding extra HBAs to the backup server and have ports routed to specific switches within the environment to minimize ISL traffic.
BIO: George Crump is president and founder of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments.