There’s been a lot of talk lately about how data deduplication is moving from backup to primary storage. Dedupe’s great for trimming primary data stores, but there are other technologies that can do the job.
Standard in many backup and archival products, data reduction is now becoming more prevalent for primary storage. The main drivers for this phenomenon are measurable cost savings from having to buy fewer disks and reducing annual support fees, to lowering operational expenses related to storage management. Data reduction may also have a pleasant impact on data storage performance: by not having inactive data occupy valuable high-performance storage, overall storage and application performance may get a welcome performance boost.
In a typical enterprise, according to Storage Networking Industry Association (SNIA) research, 80% of files stored on primary storage haven’t been accessed in the last 30 days; the same report asserts that inactive data grows at more than four times the rate of active data. With these facts in mind, it’s no surprise that data reduction techniques have been making their way into primary storage.
But in contrast to data reduction methods for backup and archiving, primary storage systems can’t tolerate even a little impact on performance and reliability, the two most relevant attributes of primary storage systems. As a result, data reduction techniques vary and have different relevance on primary storage than they do in storage used for backup and archival. On backup and archival systems, deduplication and compression are the primary data reduction methods, but for primary storage those techniques are clearly second to more subtle and proven approaches that don’t hinder performance as dedupe and compression can. These are the main data reduction techniques that are being applied on primary storage systems:
- Choosing the right RAID level
- Thin provisioning
- Efficient clones
- Automated storage tiering
Choosing the right RAID level
Putting “choosing the appropriate RAID level” at the top of a list of data reduction techniques may seem strange at first, but unlike other data reduction approaches, it’s the only option available on all storage systems and it greatly impacts disk requirements, performance and reliability. Were it not for its detrimental reliability shortcoming, RAID 0 (block-level striping across all disks without parity or mirroring) would be the most cost-efficient and best performing option, but losing the whole RAID group with the loss of a single drive makes it a no-go in the data center. RAID 1 (mirroring without parity or striping) and RAID 10 (mirrored drives in a striped set), on the other hand, combine good performance and high reliability but require twice the disk capacity and are therefore the antithesis of data reduction. RAID 5 (block-level striping with distributed parity) with its requirement for a single additional drive has been the best compromise in recent years, but as disks increased in size and rebuild times grew longer, the risk of losing two drives while the RAID is rebuilt after a drive failure has increased to an uncomfortable if not unacceptable level. As a result, storage vendors have been implementing RAID 6, which extends RAID 5 by adding an additional parity block and drive, enabling it to withstand two concurrent drive failures without data loss -- but it comes with a varying performance penalty, depending on implementation. RAID 6 and a RAID 6 performance benchmark should be on anyone’s evaluation list when shopping for a new storage system.
“Unlike most of our competitors, we can do RAID-DP [NetApp’s implementation of RAID 6] with only 5% overhead,” claimed Larry Freeman, senior storage technologist at NetApp.
Until recently, there wasn’t a real alternative to overprovisioning allocated storage and, as a result, storage utilization has been dismal. It’s not unusual for companies to have hundreds of gigabytes of overprovisioned and unused storage in their data centers. “Before we had Compellent arrays and thin provisioning, we relied on users helping us estimate storage requirements and we added 20% to 100% to user estimates, depending on what application it was for,” said Brandon Jackson, CIO of Gaston County, NC, describing the unscientific and wasteful process used by many organizations to ensure sufficient storage capacity.
Thin provisioning technologies can help put an end to this profligate management of storage resources by allowing storage to be assigned to users and servers beyond actual available physical capacity. Storage is allocated to thin-provisioned volumes on an as-needed basis. For instance, thin provisioning enables allocation of a 100 GB volume even though it may only have 10 GB of physical storage assigned. Thin provisioning is transparent to users, who will see a full 100 GB volume. The cost savings of thin provisioning can be tremendous and enables storage utilization beyond 90%.
The number of vendors that support thin provisioning is growing quickly, and it should be one of the key criteria when selecting a storage system. Keep in mind, though, that not all thin provisioning implementations are equal. While some systems require setting aside areas that can be thin provisioned, in others all capacity is available for thin provisioning without the need for special reservation. The ability to convert regular “thick” volumes into “thin” volumes, how unused storage is recovered and the way thin provisioning is licensed are other areas of differentiation. With more storage provisioned than physically present, running out of physical storage is an ever-present risk in thinly provisioned environments. Therefore, alerts, notifications and storage analytics are essential features that play an even greater role in thinly provisioned environments than they do in traditionally provisioned storage.
Cloning is used to create an identical copy of an existing volume, and it has become more relevant with server virtualization where it’s frequently used to clone virtualized OS volumes. The most basic and still predominant implementation of a clone is creating a full copy of the source volume, with the cloned volume allocating the same amount of physical storage as the source volume.
SIDE BY SIDE: PRIMARY STORAGE REDUCTION TECHNOLOGIES
Enlarge SIDE BY SIDE: PRIMARY STORAGE REDUCTION TECHNOLOGIES diagram.
The next level up is the ability to clone thinly provisioned volumes. While some storage systems turn thinly provisioned volumes into thick volumes during cloning, others can create a copy of a thinly provisioned volume where the thinly provisioned source volume and cloned volume allocate the same amount of physical storage. “In our Virtual Storage Platform [VSP], we’re able to create a thin-provisioned clone from another thin-provisioned volume,” said Mike Nalls, senior product marketing manager at Hitachi Data Systems’ enterprise platform division.
The most efficient clones are thin clones, where a cloned volume holds no data at all, but instead references blocks on the source image. Thin clones only have to store differences between the original image and the cloned image, resulting in huge disk space savings. In other words, a fresh clone requires minimal physical disk space and only as clones change do differences from the original image need to be stored. NetApp’s FlexClone and the cloning feature in the Oracle ZFS Storage Appliance (Sun ZFS Storage 7000 series) are examples of storage systems that support thin clones today.
Automated storage tiering
Automated storage tiering is another mechanism for reducing data on primary storage. An array’s ability to keep active data on fast, expensive storage and to move inactive data to less-expensive slower tiers allows you to limit the amount of expensive tier-1 storage. The importance of automatic storage tiering has increased with the adoption of solid-state storage in contemporary arrays and with the advent of cloud storage to supplement on-premises storage. Automated storage tiering enables users to keep data on appropriate storage tiers, thereby reducing the amount of premium storage needed and enabling substantial cost savings and performance improvements.
There are a couple of key features to look for in automated storage tiering:
- The more granular the data that can be moved from one tier to another, the more efficiently expensive premium storage can be used. Sub-volume-level tiering where blocks of data can be relocated rather than complete volumes, and byte-level rather than file-level tiering, are preferable.
- The inner workings of the rules that govern data movement between tiers will determine the effort required to put automated tiering in place. Some systems, like EMC’s Fully Automated Storage Tiering (FAST), depend on policies that define when to move data and what tiers to move it to. Conversely, NetApp and Oracle (in the Sun ZFS Storage 7000 series) advocate that the storage system should be smart enough to automatically keep data on the appropriate tier without requiring user-defined policies.
Well established in the backup and archival space, data deduplication is gradually finding its way into primary storage. The main challenge that has slowed adoption of deduplication in primary storage is performance. “Dedupe and performance simply don’t get along,” said Greg Schulz, founder and senior analyst at StorageIO Group, Stillwater, Minn. Nonetheless, deduplication has found its way into a few storage systems and it’s simply a matter of time before others will follow.
NetApp offers a deduplication option for all its systems, and it can be activated on a per-volume basis. NetApp’s deduplication isn’t executed in real-time though. Instead, it’s performed using a scheduled process, generally during off hours, that scans for duplicate 4 KB blocks and replaces them with a reference to the unique block. Instead of generating a unique hash for each 4 KB block, NetApp uses the block’s existing checksum to identify duplicate blocks. To prevent hash collisions, which happen if non-identical blocks share the same checksum (hash), NetApp does a block-level comparison of the data in the blocks and only deduplicates those that match. As far as performance is concerned, “we can deduplicate an average 1 TB of data per hour,” NetApp’s Freeman said. NetApp’s deduplication is currently performed by individual volumes or LUNs and doesn’t span across them.
Archiving: Quick data reduction on primary storage
The simplest method of regaining valuable space on primary storage is through archiving. Companies, like individuals, have a tendency to keep too much stuff. Businesses keep reams of data on primary storage for the unlikely event it might be needed one day. Archiving can be as simple as relocating data to archival storage and restoring it back to primary storage when needed -- at zero cost. Those who want to automate the process of moving data into archival storage and restoring it to primary storage can use products like Symantec Corp. Enterprise Vault or Waterford Technologies’ archival products that can leave “stubs” (references) to archived data on primary storage that conceal the location of files from users. The archival product will automatically pull data referenced by “stubs” back into primary storage when accessed, fully transparent to users.
Similar to NetApp, Oracle features block-level deduplication in its Sun ZFS Storage 7000 series systems. But unlike NetApp, dedupe is performed in real-time while data is written to disk. “The overhead of deduplication is less than 7%, depending on the environment and amount of changes in the environment,” said Jason Schaffer, Oracle’s senior director of product management for storage. Among smaller players, BridgeSTOR LLC, with its application-optimized storage (AOS), supports deduplication.
Another vendor apparently committed to data reduction is Dell Inc. With the acquisition of Ocarina Networks in 2010, Dell picked up content-aware deduplication and compression technology, which it intends to incorporate into all its storage systems. “Starting the second half of this year, we’ll launch storage products with the Ocarina deduplication and compression built-in,” said Bob Fine, director of product marketing at Dell Compellent.
While the aforementioned companies developed or acquired data deduplication technology, Permabit Technology Corp. has developed Albireo, a dedupe software library it intends to license to storage vendors, enabling them to add deduplication to storage systems with the advantage of time to market and without the risk inherent in developing it themselves. “With Xiotech, BlueArc and LSI, we have three announced customers, and we expect first product shipments with Permabit deduplication later in 2011,” said Tom Cook, Permabit’s CEO.
Compression shares many of the challenges of deduplication in primary storage. Like deduplication, compression has a performance overhead; it’s limited to a volume and whenever data is moved out of that volume, it has to be decompressed, just like deduplicated data has to be deduped when moved from one volume to another. In an ideal world, different tiers, including backup and archival tiers, should be able to accept and deal with compressed and deduplicated data, but because of a lack of standards, they usually don’t.
Compression and deduplication are complementary technologies and vendors that implement deduplication usually also offer compression -- BridgeSTOR, Dell, NetApp and Sun all do. While deduplication is usually more efficient for virtual server volumes, email attachments, files and backup environments, compression yields better results with random data, such as databases. In other words, deduplication outperforms compression where the likelihood of repetitive data is high.
In addition to the above vendors, EMC Corp. offers compression in its VNX Unified storage products and with the single-instance storage feature for file-based content, which enables storing single copies of identical files, it does offer some level of deduplication. IBM offers its Real-time Compression Appliances (STN6500 and STN6800) to front-end NAS storage; the appliances and the compression technology came to IBM via its 2010 Storwize acquisition.
“The Storwize real-time compression software will be a software feature on some IBM arrays later this year, and it will be available across all lines within 18 months,” said Ed Walsh, director of IBM’s storage efficiency strategy.
A blend of new and old techs
Data reduction on primary storage is a reality today and with the unchecked growth of data, it will undoubtedly become a key part of storage efficiency. Data reduction features like RAID 6, thin provisioning, efficient clones and automated storage tiering are becoming must-haves and should be on anyone’s feature list when evaluating a primary storage system. Data deduplication and compression, on the other hand, are emerging technologies that will become more pervasive over time, but right now these relative newcomers are just beginning to have an effect on primary storage.
BIO: Jacob Gsoedl is a freelance writer and a corporate director for business systems. He can be reached at firstname.lastname@example.org.
- Tiered Storage - Optimizing the Storage Infrastructure –Fujifilm Recording Media USA, Inc.
- Illuminating Insight for Unstructured Data at Scale –IBM