Catching up with deduplication


This article can also be found in the Premium Editorial Download "Storage magazine: Surprise winner: BlueArc earns top NAS quality award honors."

Download it now to read this article plus other related content.

Jim Rose, manager of systems administration with the State of Indiana's Office of Technology, recently installed PureDisk at branch state offices as part of a mandate by Indiana Governor Mitch Daniels to centralize certain IT functions. At each of the 80 offices he manages, Rose installed a Microsoft Windows server with PureDisk software, as well as PureDisk agents on each of the servers targeted for backup. Rose found the backup of the initial server took between 24 hours to 36 hours, while the second backup took about half that time; by the second or third day, backup windows across all of these servers were almost back to normal, he says.

"Symantec PureDisk backed up new servers without any discernable performance hit," says Rose. "[But] servers older than three years took longer to complete the initial scan."

Michael Fair, network administrator, information technology division at St. Peter's Health Care Services in Albany, NY, finds that the performance and management overhead associated with EMC's Avamar is almost nothing vs. what he encountered when backing up his servers with CA BrightStor and Symantec Backup Exec. "I eliminated domain controllers in eight sites and can now run backups during the day if the need arises with no discernible impact to server applications," says Fair.

Introducing PureDisk allowed the State of Indiana's Rose to back up 300 servers across 80 sites in six hours, and he now has a demonstrable, working recovery

Requires Free Membership to View

plan for those sites. However, as the individual responsible for both remote offices and enterprise data centers, he recognizes the limitations of backup software deduplication. Taking 24 hours to 36 hours to complete an initial backup, coupled with high change rates on central databases, precludes Rose from deploying PureDisk in his core data center environment. For these more mission-critical servers, he looks to disk libraries to keep processing off the hosts.

Inline disk libraries
Disk libraries perform deduplication in two general ways: inline and postprocessing. With inline processing, the disk library processes backup streams and deduplicates the data as it enters the disk library. Inline disk libraries use three general deduplication methods to minimize the performance impact: hash-based, inline compare and grid architecture.

Data Domain's DDX disk library uses a hash-based technique. DDX takes an 8KB slice of the incoming backup data and computes a hash or fingerprint value. If the fingerprint value is unique, it deduplicates and stores the data. The main issues with this approach are the performance requirements to compute the hash and keeping the hash index in memory; as the hash index grows, it spills over from memory onto disk. To mitigate the performance overhead associated with retrieving the index from the disk, Data Domain developed a technique called stream-informed segment layout (SISL) that minimizes seeks to disk so the performance is CPU-centric; the faster the disk library CPU, the better the performance.

This was first published in June 2007

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: