What you'll learn: This tip discusses data deduplication for primary storage. It examines the data selection criteria you should use when deciding to apply deduplication technology to your primary storage, whether or not to use inline deduplication or post-processing deduplication, and the impact dedupe can have on your data storage environment.
Data deduplication has been a hot topic and a fairly common practice in disk-based backups and archives. Users' initial wariness seems to have given way to adoption, and a deeper focus on the technology has opened up more ways to leverage the benefits of deduplication. The next frontier for deduplication is in the realm of primary storage.
What is primary storage?
Primary storage consists of disk drives (or flash drives) on a centralized storage-area network (SAN) or network-attached storage (NAS) array, where the data used to conduct business on a daily basis is stored. This includes structured data such as databases, as well as unstructured data such as email data, file server data and most file-type application data. It's important to understand this difference because not all data is suitable for primary storage deduplication.
Types of data deduplication
There two main types of data deduplication: inline and post-process. Inline deduplication identifies duplicate blocks as they're written to disk. Post-process deduplication deduplicates data after it has been written to disk. Inline deduplication is considered more efficient in terms of overall storage requirements because non-unique or duplicate blocks are eliminated before they're written to disk. Because duplicate blocks are eliminated, you don't need to allocate enough storage to write the entire data set for later deduplication. However, inline deduplication requires more processing power because it happens "on the fly"; this can potentially affect storage performance, which is a very important consideration when implementing deduplication on primary storage. On the other hand, post-process deduplication doesn't have an immediate impact on storage performance because deduplication can be scheduled to take place after the data is written. However, unlike inline dedupe, post-process deduplication requires the allocation of sufficient data storage to hold an entire data set before it's reduced via deduplication.
Criteria for selecting data for deduplication on primary storage
How do you determine which primary data is a good fit for deduplication? This is where the difference between structured and unstructured data comes into play. A database can be a significantly large file, subject to frequent and random reads or writes. For that reason, the majority of this data can be considered active. That means any processing overhead associated with deduplication could significantly impact I/O performance. In comparison, if we examine data on a file server, we quickly see that only a small portion of files are written to more than once and usually only for a short period of time after they were created. That means a very large portion of unstructured data is rarely accessed, making it a prime candidate for deduplication. This allows rules to be set to deduplicate data based on a "last access" time stamp. Shared storage for virtual servers or desktop environments also presents good opportunities for deduplication because many operating system files aren't unique.
Other data selection criteria include format and data retention. Encrypted data, and some imaging or streaming video files, tend to yield poor deduplication results because of their random nature. In addition, data must reside in storage for some time to generate enough duplicate blocks to make deduplication worth the effort. Transient data that's only staged to primary for a short period -- such as message queuing systems or temporary log files -- should be excluded. And while archived data yields the best deduplication ratios, that type of data isn't suitable for our primary storage discussion.
Inline vs. post-processing deduplication
Let's say you've excluded encrypted data, streaming video and transient data, and you've established rules to determine "last access" and retention. You've identified primary data storage that's a good fit for deduplication. This is when you'll have to choose between inline or post-process deduplication. The ability to deduplicate files once they've been inactive, or not accessed for some time, would favor post-process deduplication over inline because only selected data can be processed at a later time based on specific criteria and after it has been written to disk. Remember, this contrasts with inline deduplication, which would process all data as it's written and may impact performance of certain types of data. Although inline deduplication processes all data immediately, it doesn't always make it a poor choice for implementation on primary storage. It just means that storage tiering -- determining where you need the best performance -- is a crucial first step before deciding to apply deduplication technology to primary storage.
Not all data is right for your primary storage
Data that requires frequent access with optimum write performance won't be a good fit for data deduplication. Data that's difficult to deduplicate due to its format can be stored on a no-deduplication, lower performance disk array to keep costs down. The remaining unstructured data that doesn't require frequent or high-performance access (such as application or user file data) can be stored on a deduplication-enabled primary storage array.
BIO: Pierre Dorion is the data center practice director and a senior consultant with Long View Systems Inc. in Phoenix, AZ, specializing in the areas of business continuity and disaster recovery planning services and corporate data protection. Over the past 10 years, he has focused primarily on the development of recovery strategies, IT resilience and recoverability, as well as data protection and availability engagements at the data center level..