How to ensure scaling and reliability in data deduplication systems

A data deduplication system needs to be able to scale so that your files are safe and available at all times.

When implementing a data deduplication system, it's important to consider scalability. Performance should remain acceptable as the storage capacity and deduplication granularity increase. Data deduplication should also be unaffected by data loss due to errors in the deduplication algorithm.

Scaling and hash collisions

It is critical that data deduplication products detect duplicate data elements, making the determination that one file, block or byte is identical to another. Data deduplication products determine this by processing every data element through a mathematical "hashing" algorithm to create a unique identifier called a hash number. Each number is then compiled into a list, often dubbed the hash index.

When the system processes new data elements, their resulting hash numbers are compared against the hash numbers already in the index. If a new data element produces a hash number identical to an entry already in the index, the new data is considered a duplicate, and it is not saved to disk -- only a small reference "stub" that relates back to the identical data that has been stored. If the new hash number is not already in the index, the data element is considered new and stored to disk normally.

A data element can produce an identical hash result even though the data is not completely identical to the saved version. Such a false positive, also called a hash collision, can lead to data loss. There are two ways to mitigate false positives.

  • The data deduplication vendor may opt to use more than one hashing algorithm on each data element. For example, the Single Instance Repository (SIR) on FalconStor Software Corp.'s virtual tape libraries (VTL) uses out-of-band indexing with SHA-1 and MD5 algorithms. This dramatically reduces the potential for false positives.
  • Another option is to use a single hashing algorithm but perform a bit-level comparison of data elements that register as identical.

The problem with both approaches is that they require more processing power from the host system, reducing index performance and slowing the deduplication process. As the deduplication process becomes more granular and examines smaller chunks of data, the index becomes much larger and the probability of collisions increases and can exacerbate any performance hit.

Scaling and encryption

Another issue is the relationship between deduplication, more traditional compression and encryption in a company's storage infrastructure. Ordinary compression removes redundancy from files, and encryption "scrambles" data so that it is completely random and unreadable. Both compression and encryption play an important role in data storage, but eliminating redundancy in the data can impair the deduplication process. If encryption or traditional compression are required along with deduplication, the indexing and deduplication should be performed first.

Check out the entire Data Deduplication Handbook.

This was first published in November 2007

Dig Deeper on Data management tools



Find more PRO+ content and other member only offers, here.



Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: