Published: 05 Jun 2014
Storage system performance bottlenecks can occur in different places. Here are six areas that can help you pinpoint the sources of your performance bottlenecks and eliminate them.
Finding and fixing storage performance bottlenecks has never been easy. Storage architects constantly probe storage arrays, the network, hosts and hypervisors in an attempt to put their finger on what's bogging down a storage system so they can come up with a remedy for the performance bottleneck. With the advent of high-density server virtualization that enables a single server to support dozens of virtual servers, coupled with flash-based near-zero latency storage, it becomes clear that the storage network is often the bottleneck. Furthermore, storage architects often have to choose between performance and capacity, finding that optimizing one typically negatively impacts the other.
To explore real-world storage performance bottleneck culprits, we talked with three storage architects to understand what specific storage performance issues they faced in their environments. We asked them about the configurations, tools, best practices and techniques they used to help pinpoint the sources of the bottleneck symptoms and remedy them. These are the storage pros who shared their experiences with us:
- Matthew Chesterton, president of Offsite Data Sync, a managed services provider (MSP) in Rochester, NY., that offers cloud, disaster recovery (DR) and data retention solutions.
- The IT director of a New York City-based hedge fund who, while sharing his experiences managing more than 800 terabytes in storage capacity, asked not to be identified.
- A storage architect at an MSP in the Midwest that manages just under 400 TB of storage capacity and provides fully managed DR and business continuity (BC) solutions. Citing his company's policies, he also chose to remain anonymous for this article.
It must be noted that higher performance comes at a cost, and can lead to higher capital and operating expenses. Achieving even modest performance improvements may require the use of many expensive hard drives, which increases data center space, as well as power and cooling costs.
1. It's all about I/O
Storage I/O is by far the biggest storage performance bottleneck. Most storage architects spend a majority of their time chasing an ever-elusive storage I/O moving target. Offsite Data Sync's Chesterton noted that amidst a tsunami of data growth, where growth is pegged at more than 60% per year, his customers are tasked to back up more and more data within the same backup window -- a phenomenon referred to as shrinking backup windows. To be able to back up an ever-increasing amount of data during the same backup window requires faster storage I/O.
MSPs that provide backup, DR and BC solutions are concerned about backup and restore performance because they have to abide by the service-level agreements (SLAs) they have in place with their enterprise customers. Failing to meet an SLA could result in penalties. In the event of data loss, MSPs quickly create a restore package and, if the data is larger than a few terabytes, the cloud storage service will typically load the restore package onto removable storage media that is sent by overnight courier to the customer.
To circumvent storage I/O issues and process backups from all customers, Offsite Data Sync's storage architects use automated storage tiering to offload older data or general-purpose (non-critical) applications to secondary and tertiary storage tiers. Those lower tiers usually consist of cheaper disk spindles, such as 7,200 rpm SAS drives instead of 15,000 rpm Fibre Channel hard disk drives (HDDs) that are reserved for mission-critical applications and data. This frees the storage I/O in the Tier-1 storage arrays to process critical data or applications that require faster storage I/O. Determining which applications are storage I/O-intensive and require faster Tier-1 storage and those that are less storage I/O-intensive translates into effective storage tiering. Dynamically tiering real-time data to the edge -- automatically moving active data to the edge near users and applications -- results in lower latency. Storage tiering can deliver desired capacity, performance and availability outcomes.
Due to relatively low cloud storage costs (currently averaging approximately 2.5 cents per gigabyte per month), public cloud storage providers such as Amazon Web Services are being used by some progressive enterprises and service providers as secondary and tertiary storage tiers.
According to Chesterton, "an MSP's storage purchase decision criteria mostly rests on which storage array can provide the fastest input/output operations per second."
Examples of I/O-intensive applications include transactional databases, Microsoft Exchange and virtual desktop infrastructure (VDI) among others. At the other end of the performance scale, examples of applications or workloads that are less I/O-intensive include archives/cold storage, Web applications (as long as caching and setup are taken care of), backup and so forth.
Bigger isn't necessarily better
More and faster everything -- nodes, controllers, interfaces, drives, cache, interfaces and so on -- may seem like the best option, but the sum of parts working in cohesiveness and harmony is the litmus test of the performance effectiveness of a storage system. Often, a less complex storage system will end up delivering faster storage I/O and lower latency.
2. Disk latency can stifle performance
Solid-state drives (SSDs) offer faster data access -- as much as 300 times the random I/O performance of HDDs -- and greater data center energy efficiency because of the smaller size of the drive and the lower amount of energy it uses. The input/output operations per second (IOPS) possible with SSDs is higher because SSDs can perform many more interactions per second and have much lower latency compared to HDDs. SSDs can deliver very fast storage system performance, but if they're paired with high latency or slow array controllers, it can result in serious bottlenecks plaguing the entire storage system.
If cost wasn't a concern, most storage architects would probably choose an all-SSD storage array due to its ultra-high performance that scales linearly and its smaller form factor that conforms to physical space constraints. But in the real world, cost is a significant factor and many companies choose "cheap and deep" storage arrays with all HDDs because of the lower capital outlay.
A New York City-based hedge fund implemented a VDI; they noticed that storage was the biggest bottleneck, particularly during an 8:30 am boot storm when all 140 employees in the test environment tried to log in at around the same time. The company planned for 240 IOPS per desktop for its 140 desktops, but they soon realized that even 10,000 IOPS weren't enough to handle the spikes. Some of the key characteristics of VDI workloads include 80% writes and 20% reads, bursty and unpredictable accesses, and small and highly random I/O streams (real-time mouse and keyboard). All of these require very high IOPS. The company's IT director immediately began evaluating SSD solutions to eliminate the disk latency issue.
Fast forward to today and the current production VDI environment is built on all-SSD primary storage. In addition to successfully implementing a nearly 1,400 virtual desktop production environment, the organization experienced additional benefits of SSDs in the form of greater data center energy efficiency by reducing the overall data center space and energy requirements. The director noted that "in the long run, SSDs proved to be cheaper than HDDs on a per-IOPS basis."
Interestingly, the Midwestern MSP still uses HDDs to support its desktop as a service (DaaS) offering that provides VDI to hundreds of customers. The company used hard disks and was still able to sidestep performance problems by trading off some advanced VDI features: they don't use non-persistent desktops, they use regular cloning technology instead of linked clones to update images, and they've turned off thin provisioning. However, they're currently evaluating hybrid SSD/HDD arrays from Nimble Storage to move mission-critical workloads and apps like VDI to SSD-enabled storage to reduce latency.
Many companies use solid-state storage primarily in hybrid implementations where the flash is used for warm or hot data workloads. In addition, flash is used in virtual environments to handle transactional workloads and for other niche applications such as big data analytics, PostgreSQL databases and high-performance computing. However, even the most experienced storage architects will tell you that one never knows which apps will be hot from day to day.
3. Ensure the storage network can handle the traffic
Without a good storage network, it's unlikely that you'll achieve decent storage performance. The port type, port speed and number of network ports can impact storage performance, especially if those ports are sharing paths or the backplane of network switches. Those situations can result in dropped frames and packets, and the ensuing retries or port renegotiations all contribute to degraded storage system performance. Upgrading to a faster network link, interface or port, or changing the path can often improve storage performance.
4. Balance workloads among storage tiers
Data can be moved among a variety of storage tiers, such as among different storage systems that may be local, remote or in the cloud. Storage can also be tiered within the same system, using the appropriate storage media for high capacity or high performance. The automated (or manual) data movement balances the workload distribution, resulting in improved performance of the entire storage system.
NAS array vendors are happy to sell you additional disk capacity, but the array controllers often wind up being the storage system performance bottleneck. Traditional NAS arrays don't allow scaling of controller CPU for a given piece of data or workload, so the data can get stuck behind the controller.
Inadequate disk capacity can negatively impact storage, especially in virtualized environments. Storage architects are always struggling to find the right balance between over- or under-subscribing the storage system. Caching can help to improve storage performance while increasing storage efficiency, which will also help users to avoid having to purchase additional server and storage hardware.
Six more tips
1. Keep an eye on I/O
2. Reduce the effects of disk latency
3. Make sure the network is up to speed
4. Load balance across tiers
5. Rev up servers
6. Keep a finger on the pulse of the storage environment
5. Server performance can affect storage
Insufficient server compute horsepower and memory can degrade storage system performance. Background processes such as rebuilding disks, disk parity, partitioning of databases and data scrubbing can also hinder storage performance.
Conversely, very fast servers need fast I/O paths, networks and storage systems. And ineffective utilization of storage and server cache can hamper storage system performance. Hence, managing memory and caching mechanisms is an important part of the process of achieving faster storage performance.
6. Diagnose and detect
Keeping track of storage performance metrics meticulously will give you a leg up on detecting and resolving storage performance bottlenecks. The metrics reached through rigorous testing will indicate if you're achieving your storage I/O goals; they'll also help to monitor and detect disk or network latency and variance, and detect component failures. As a best practice, always establish baseline performance indicators for normal and peak workload situations.
The most consistent message these storage architects conveyed was that one of the best ways to improve storage system performance is to ensure that your storage platform conforms to the best practices prescribed by the vendor. Toward that end, you should set up your storage array, network node and data store according to your vendor's recommendations. It's also a good idea to check your vendor's hardware compatibility list to ensure that any peripheral hardware you use is approved for that storage system.
About the author:
Ashar Baig is president, principal analyst and consultant at Analyst Connection, an analyst firm focused on storage, storage and server virtualization, and data protection, among other IT disciplines.
- Tiered Storage - Optimizing the Storage Infrastructure –Fujifilm Recording Media USA, Inc.
- Illuminating Insight for Unstructured Data at Scale –IBM