Published: 12 Oct 2007
Despite the attention they get, storage benchmarks can be manipulated to unfairly compare products with vastly different configurations.
Although storage performance is one of many considerations when selecting a storage system, performance benchmarking results get the most headlines. IBM Corp.'s July news release that touted the record-breaking Storage Performance Council (SPC) result for its System Storage SAN Volume Controller (SVC) 4.2 is a prime example of how companies play up their benchmarking news.
It's no secret that storage vendors are eager to cite performance improvements of their latest arrays, often without any reference to the configuration, under what conditions the performance boost can be expected or how the testing was conducted. For example, EMC claimed earlier this year that "The new EMC Symmetrix DMX-4 series will improve performance by up to 30%," but failed to say under what conditions and in what configuration it tested the DMX-4. If performance benchmarking is mostly a marketing tool for storage vendors to pump up their products, are benchmarking numbers of any value to users?
The benchmarking challenge
At first glance, measuring the performance of a storage system doesn't appear to be too difficult a task. But the benchmarking process can be easily manipulated because of the large number of variables that influence performance results. With everything else unchanged, performance greatly depends on the nature of IO requests storage systems have to cope with. Data is either written to or read from a storage system, can be accessed randomly or sequentially, and the size of the blocks and files transacted can vary from a few bytes to megabytes.
To make matters worse, the IO profiles of real-world applications vary widely. Benchmarks will always have a limited number of workloads and IO requests, which might not be completely representative of a specific application. Testing the raw performance of storage systems irrespective of apps is a valid way of looking at storage performance as long it's understood that the hardware performance may not be representative of an application's performance.
In addition, storage systems can be configured in numerous ways, including the number of spindles, cache size, fault tolerance configurations, data protection schemes, compression and deduplication (to name a few), all of which affect performance. Obviously, the configuration used to benchmark an array is unlikely to match your specific configuration.
The ultimate benchmarking challenge is the comparison of disparate storage systems from different vendors, which multiplies all of the problems mentioned earlier. Not only are there an astronomical number of permutations when diverse technologies, components and configurations are taken into account, but storage vendors must be willing to use standardized benchmarks that allow for an apples-to-apples comparison.
Types of benchmarks
Contemporary storage benchmarks can be divided into two groups: industry-standard and do-it-yourself.
Industry-standard benchmarks: The primary storage benchmarks in these groups are the SPC's SPC-1 and SPC-2 and Standard Performance Evaluation Corp.'s (SPEC) SPEC SFS. SPC and SPEC are nonprofit organizations with an agenda to standardize performance benchmarking and provide a vendor-agnostic way to compare storage systems. A committee or council consisting of organization members oversees, regulates and audits all benchmarking activities.
Among the key characteristics of SPC and SPEC benchmarks are mandatory peer reviews of all benchmark results, meticulous documentation of the tested configurations, and making test results and test details available for public consumption. To be representative of real-world applications, the workloads they use are derived from common real-world applications, and the SPC and SPEC pride themselves on gauging performance in a way relevant to enterprise computing. Unlike SPEC, SPC publishes the cost of tested configurations with the test results, providing a cost/performance metric. "You really have to look at both performance and cost," explains Greg Schulz, founder and senior analyst at StorageIO Group, Stillwater, MN. "It required a $3.2 million configuration for the IBM SAN Volume Controller 4.2 to get over 270,000 SPC-1 IOPS, or over $12 per SPC-1 IOPS, a relatively unfavorable cost per IO ratio if compared to other SPC-1 benchmark results," he notes.
Without question, industry-standard benchmarks are a great way to get objective, authoritative benchmark results of storage systems from different vendors; but they're not without challenges. Most importantly, their success and effectiveness hinges on storage vendor participation. With more than 60 members, the majority of NAS vendors are part of SPEC. SPC has more than 25 members, including all major storage vendors except EMC. Unfortunately, being a member doesn't necessarily mean participation. For instance, Hitachi Data Systems (HDS) Corp. and Network Appliance (NetApp) Inc. are SPC members, but have never published an SPC benchmark. The SPEC SFS is more established than the SPC benchmarks (almost 10 years older), and there's a higher level of expectation for vendors to participate in it than in SPC benchmarks. "We don't see customers ask for SPC or SPEC numbers," says Steve Daniel, NetApp's director of database platform and performance technology. "One of the primary reasons we continue to publish SPEC SFS benchmarks is for historical reasons; we can't just stop without raising questions."
Moreover, storage vendors are very selective regarding what products they benchmark using SPC. Participation in the SPEC SFS benchmark is significantly higher and includes most NAS vendors, in addition to EMC, HDS and NetApp. One of the reasons for the limited participation is cost. "Both SPC and SPEC benchmarks are quite expensive, and [it's] only if we see a clear marketing benefit or value for our end users [that] we participate," says NetApp's Daniel.
Do-it-yourself storage benchmarks: The primary benchmarking tools in this category are Iometer, IOzone and NetBench. Unlike industry-standard benchmarks, these tools aren't governed by a standards body, and there are no rules attached to how tests are conducted and published. "With do-it-yourself tools, you can compare two configurations without having to trust the published results from vendors' tests," says Brian Garrett, technical director of the ESG Lab at Enterprise Strategy Group (ESG) in Milford, MA.
Do-it-yourself benchmarking tools are some of the primary tools storage vendors and end users use to gauge performance. Contrary to industry-standard benchmarks, their workloads are usually highly configurable, which enables measuring very specific (and a wide range of) IO patterns. "Tests can be conducted at the engineering level to specifically characterize and compare two storage subsystems or configurations, or to approximate applications at a rudimentary level," explains Garrett.
|How cache affects benchmarking
Storage system performance depends on several key components: the number and type of storage controllers, the number of disk drives, the RAID level and how drives are striped, the number of front-end and back-end ports, available bandwidth, and the size of the available cache and cache options.
SPC-1: Determines the number of IOPS a storage system supports from an enterprise app perspective. Its workload is representative of transaction-processing systems like databases, enterprise resource planning systems and even mail servers. "We took traces provided by Sun [Microsystems Inc.], HP [Hewlett-Packard] and IBM, which were extracted from actual customer apps, and used them to develop SPC-1," says SPC administrator Walter E. Baker.
SPC-1 benchmarks report on the maximum number of IOPS (SPC-1 IOPS), a price-performance ratio expressed in $/SPC-1 IOPS, the total storage capacity utilized during testing, the data protection level of the tested system, as well as the total price of the system. Additionally, each report includes a diagram that depicts the response time in relation to the number of IO requests per second. Like all SPC benchmarks, SPC-1 comes with a full disclosure report about configuration and test conditions, and sufficient detail to enable a third party to reproduce the configuration and benchmark results.
SPC-2: Encouraged by the success of SPC-1, the SPC released SPC-2 in early 2006. Unlike SPC-1, SPC-2 measures storage performance from a throughput perspective, relevant to apps like video streaming, scientific computing and data mining. More specifically, SPC-2 consists of three distinct workloads--one for large file processing, one for large database queries and another for video on-demand--designed to demonstrate the performance of storage subsystems that require the large-scale, sequential movement of data. Those applications are characterized predominately by large IOs organized into one or more concurrent sequential patterns.
SPC-2 reports an overall benchmark result that aggregates the test results of the three workflows, and gives performance numbers for each of the three workloads. The reported data includes throughput expressed in megabytes per second (SPC-2 MB/sec), a price-performance ratio in $/MB/sec, the total storage capacity utilized during testing, the data protection level of the tested system and the total price of the system. According to Baker, SPC-2 will be available for purchase to non-SPC members later in the year. Unlike SPC-1, SPC-2 will be available in locked and nonlocked versions. The nonlocked version will enable customers to change workload parameters.
SPC-1 and SPC-2 are tailored toward objective and verifiable performance measurement of large and complex storage configurations. Their relatively high benchmarking cost, combined with configuration requirements, makes them ill-suited for testing storage components and smaller storage subsystems.
For those reasons, approximately three years ago SPC began modifying SPC-1 and SPC-2 to create derivatives, namely SPC-1C ("C" stands for components) and SPC-2C, which are a better fit for storage component benchmarking. They use the same workloads as SPC-1 and SPC-2, and their specifications and reporting requirements will be similar to those of their large storage benchmark equivalents. They'll be used to benchmark disk drives, host bus adapters/controllers, single-enclosure storage subsystems and storage software such as logical volume managers.
To add credibility and authenticity, all SPC-1C and SPC-2C benchmarks will be performed by an SPC-certified testing lab to which storage vendors will have to submit components for testing. SPC is also working on a file-system benchmark named SPC-3, which is similar to SPEC SFS. Unlike the current version of SPEC SFS, SPC-3 will be file-system agnostic. SPC-1C, SPC-2C and SPC-3 are all works in process with no release dates announced at this time.
SPEC SFS is an industry-standard benchmark used to measure NFS file-system performance of network file servers and NAS. All tests are performed by storage vendors and test results are submitted to the SPEC for peer review. After passing a SPEC audit, test results are published on the SPEC Web site.
Unlike SPC benchmarks, SPEC doesn't include cost/performance metrics and, as a result, benchmark results need to be viewed with caution. "When comparing SPEC SFS results, you have to be careful to compare like-sized and like-priced systems," cautions ESG's Garrett. "SPEC.org lists results that range from 10,000 SPEC ops/sec to 300,000 SPEC ops/sec without information about the system price." Therefore, it's crucial to look at the benchmark results in the context of the system configuration. File-system performance is highly dependent on the number of disks, disk controllers, memory (cache) and network controllers; as you add more of these, performance will rise, as will the system's price tag.
SPEC SFS benchmarks report two key metrics: throughput in ops/sec and response time in milliseconds. Each report includes a graph that illustrates the response time of the tested system in relation to throughput. The most relevant numbers are maximum throughput and average response time. Both are listed at the beginning of the published reports and they're typically the numbers cited and referred to by vendors and the press. Each test report also describes the tested configuration in great detail.
A strong point of SPEC SFS is its independence of NFS clients during the test and in measuring the real performance of the server. On the downside, SPEC SFS is currently limited to NFS, a shortcoming that will be addressed in a future version of SPEC SFS. "It is very likely that the next version of SPEC SFS will support CIFS," says Don Capps, chair of the SPEC SFS subcommittee.
Do-it-yourself storage benchmarks
IOMETER: Iometer is an IO workload generator and measurement tool that can be used to measure the performance of block-based and file-based storage systems. It was originally developed by Intel Corp. and has since become an open-source project. It can simulate and benchmark real-world workloads and synthetic workloads generated with a user-friendly GUI that allows a user to select one or multiple IO patterns from a predefined list of IO patterns that vary in data size, IO operations and type of access.
Iometer test results report on total IOs per second, total megabytes per second, average IO response time in milliseconds, maximum IO response time in milliseconds and total CPU utilization. The fact that Iometer is an active open-source project, combined with the tool's versatility, makes it one of the "most popular do-it-yourself benchmarking tools," says ESG's Garrett.
IOZONE: IOzone is a free file-system benchmarking tool that measures and analyzes IO performance of various storage IO operations at variable block sizes. Workloads can be generated synthetically by explicitly defining test parameters like file sizes, but real-world application workloads are also supported by importing IO traces captured with OS-specific tools like strace (Linux) or FileMon (Windows).
IOzone reports on latency and throughput in ops/ sec in relation to the number of client processes for the tested IO operations. By default, test results are presented in tabular form, but can be rendered as three-dimensional graphs that aggregate performance characteristics for measured data points. IOzone has a "techie" feel, starting with the way workloads are generated to how test results are presented.
NETBENCH: A freely downloadable CIFS file-system performance benchmarking tool developed by Ziff Davis. It uses a large number of physical test clients to generate a file IO-based workload using the CIFS protocol against a server under test. While testing, the clients record the amount of data moved to a file server or NAS, as well as the average response time for the file server as it responds to the various file IO requests made by the clients. NetBench test results report on the overall IO throughput expressed in megabytes per second and the average response time in milliseconds.
There are some things you need to consider when using NetBench. It requires a relatively large test bed. "To get valid performance results, you need at least 60 CIFS clients," says ESG's Garrett. Secondly, test results are dependent on client performance. Nevertheless, NetBench has evolved into the de facto standard for CIFS performance benchmarking.
Industry-standard benchmarks have two main purposes today. To a limited degree, they give users one more perspective in making a purchasing decision. By the same token, they give vendors an additional card to play when trying to close a deal.
Who's responsible for the sorry state of storage benchmarks? The fault lies with the SPC and SPEC by allowing vendors to submit any conceivable configuration and publish all results side by side irrespective of the class of storage and product type. To remedy the situation, test results need to be categorized by product type, configuration standards need to be defined for each category and vendors must strictly adhere to the configurations. Instead of vendors dictating the configuration they wish to show off, the SPC and SPEC need to set rules on permissible configurations. If that happens, we'll have a foundation for an apples-to-apples comparison and it'll be more difficult for vendors to harness standard benchmarks to their advantage by showing only what's favorable to them. Categorization and standards will at least lay a solid foundation, but success will still hinge on vendor participation.