When implementing a data deduplication system, it's important to consider scalability. Performance should remain acceptable as the storage capacity and deduplication granularity
Scaling and hash collisions
It is critical that data deduplication products detect duplicate data elements, making the determination that one file, block or byte is identical to another. Data deduplication products determine this by processing every data element through a mathematical "hashing" algorithm to create a unique identifier called a hash number. Each number is then compiled into a list, often dubbed the hash index.
When the system processes new data elements, their resulting hash numbers are compared against the hash numbers already in the index. If a new data element produces a hash number identical to an entry already in the index, the new data is considered a duplicate, and it is not saved to disk -- only a small reference "stub" that relates back to the identical data that has been stored. If the new hash number is not already in the index, the data element is considered new and stored to disk normally.
A data element can produce an identical hash result even though the data is not completely identical to the saved version. Such a false positive, also called a hash collision, can lead to data loss. There are two ways to mitigate false positives.
- The data deduplication vendor may opt to use more than one hashing algorithm on each data element. For example, the Single Instance Repository (SIR) on FalconStor Software Corp.'s virtual tape libraries (VTL) uses out-of-band indexing with SHA-1 and MD5 algorithms. This dramatically reduces the potential for false positives.
- Another option is to use a single hashing algorithm but perform a bit-level comparison of data elements that register as identical.
Scaling and encryption
Another issue is the relationship between deduplication, more traditional compression and encryption in a company's storage infrastructure. Ordinary compression removes redundancy from files, and encryption "scrambles" data so that it is completely random and unreadable. Both compression and encryption play an important role in data storage, but eliminating redundancy in the data can impair the deduplication process. If encryption or traditional compression are required along with deduplication, the indexing and deduplication should be performed first.
Check out the entire Data Deduplication Handbook.
This was first published in November 2007