Home > How to ensure scaling and reliability in data deduplication systems
Data deduplication handbook:
EMAIL THIS

How to ensure scaling and reliability in data deduplication systems

19 Nov 2007 | Stephen J. Bigelow

Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us   

When implementing a data deduplication system, it's important to consider scalability. Performance should remain acceptable as the storage capacity and deduplication granularity increase. Data deduplication should also be unaffected by data loss due to errors in the deduplication algorithm.

Scaling and hash collisions

It is critical that data deduplication products detect duplicate data elements, making the determination that one file, block or byte is identical to another. Data deduplication products determine this by processing every data element through a mathematical "hashing" algorithm to create a unique identifier called a hash number. Each number is then compiled into a list, often dubbed the hash index.

When the system processes new data elements, their resulting hash numbers are compared against the hash numbers already in the index. If a new data element produces a hash number identical to an entry already in the index, the new data is considered a duplicate, and it is not saved to disk -- only a small reference "stub" that relates back to the identical data that has been stored. If the new hash number is not already in the index, the data element is considered new and stored to disk normally.

A data element can produce an identical hash result even though the data is not completely identical to the saved version. Such a false positive, also called a hash collision, can lead to data loss. There are two ways to mitigate false positives.

  • The data deduplication vendor may opt to use more than one hashing algorithm on each data element. For example, the Single Instance Repository (SIR) on FalconStor Software Corp.'s virtual tape libraries (VTL) uses out-of-band indexing with SHA-1 and MD5 algorithms. This dramatically reduces the potential for false positives.
  • Another option is to use a single hashing algorithm but perform a bit-level comparison of data elements that register as identical.

The problem with both approaches is that they require more processing power from the host system, reducing index performance and slowing the deduplication process. As the deduplication process becomes more granular and examines smaller chunks of data, the index becomes much larger and the probability of collisions increases and can exacerbate any performance hit.

Scaling and encryption

Another issue is the relationship between deduplication, more traditional compression and encryption in a company's storage infrastructure. Ordinary compression removes redundancy from files, and encryption "scrambles" data so that it is completely random and unreadable. Both compression and encryption play an important role in data storage, but eliminating redundancy in the data can impair the deduplication process. If encryption or traditional compression are required along with deduplication, the indexing and deduplication should be performed first.

Check out the entire Data Deduplication Handbook.



Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us   



RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary




Find Data Reduction
TechTarget Storage Media
Storage Magazine View this month\\'s issue and subscribe today.
Storage Decisions Apply online for free conference admission.
SearchStorage.com
HomeNewsMagazineTopicsLearningMultimediaWhite PapersBlogsEventsAbout Us

About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2000 - 2009, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts