News Stay informed about the latest enterprise technology news and product updates.

Compression, dedupe and the law

Data deduplication is the poster child of 2008. Everyone is rushing to add this capability to just about everything that could possibly ever sit on a network–I thought I saw an ad for a cable tester with de-dupe built in! On the face of it, de-dupe looks like the savior it’s made out to be (except in very isolated instances where it actually inflates the size of stored data, but that’s another subject for another time.).

But take a look a little deeper with my paranoid, curmudgeon-y, semi-lawyer-esque hat on.

De-dupe technology has been likened to “zip” on the fly (no pun intended), which is where I have a couple of problems while wearing my pseudo-legal hat. The first is the act of compression. Way back in the olden days of computing there was a product appropriately named Stacker; its purpose in life was to allow you to fit more on the ridiculously expensive devices we had in our computer called “hard drives”. Microsoft, not content with Stac backing out of a licensing deal, created DoubleSpace (got sued and lost), then DriveSpace (DOS 6.21).

Via the use of a TSR (even the acronym is dusty), these products would intercept all calls destined for your hard drive and compress the data before it got there. Sound familiar? Those disk compression tools had their run, I used them but it presented problems with memory management, at the time Bill Gates decided no one would ever need more than 640KB, amongst other things. This presented a phenomenally large problem when I would load up one of my favorite games at the time from Spectrum Holobyte: Falcon 3.0, Falcon fans know what sorts of contortions one had to endure to get enough lower memory to run Falcon, but I digress.

So I would try to get around having Stacker or DoubleSpace turned on all the time. That didn’t work out well for me, and I spent quite a bit of time compressing and re-compressing my hard drive, enabling and disabling Stacker and DoubleSpace and setting up various non-compressed partitions.

While I don’t see that specific instance as an issue now per se, I do have that (bad) experience, and because of it I have a problem with something sitting inline with my data, compressing it with a proprietary algorithm that I can’t undo if/when the device decides it doesn’t like me anymore. Jumping back 16 years, it wasn’t that hard to format and reinstall DOS, which was a small part of my (then gigantic) 160MB ESDI hard drive, to get around the problems I had. But today when we are talking about multiple Terabytes and such, I want to be sure that I can get to my data unfettered when I need it.

The reason I am paranoid about getting access to my data when I need it: compliance and legal situations. Which brings me to my second point. How will de-dupe stand up in court? Is it even an issue? Is compression so well understood and accepted that it wouldn’t even be problem? Even as paranoid as I am I would have to say … maybe.

Compression has been around for a very long time, we are used to it, we accept it, and we accept some if its shortcomings (ever try to recover a corrupted zip file?) and its limitations, but will that stand up in court? In today’s digital world there are quite a few things that are being decided in our court systems that may not necessarily make sense. Are we sure our legislators understand the differences between a zip (lossless) and JPEG (lossy) compression? How does the act of compressing affect the validity of the data? Does it affect the metadata or envelope information? The answer to these questions, while second nature for us technology folks, may not so second nature for the people deciding court cases. Because compressing and decompressing data is a physical change to the data itself, I can imagine a lawyer trying to invalidate data based on that fact.

I hope that doesn’t turn out to be the case. The de-dupe products currently on the market have some astounding technology and performance. They also return quite a bit to the bottom line when used as prescribed, and the solid quantifiable return on investment they represent does for most outweigh any risks.

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Block based de-duplication is just one of what is a set of de-duplication technologies. Object based de-duplication (Single Instance Storage) is another and it doesn't modify the existing data it just places pointers where other copies of that data might reside. While in some cases it's become the archive your backup shouldn't *be* your archive. If someone goes down the route of invalidating compression due to the fact it modify's the data that makes everything written to a tape drive with onboard compression enabled (or encryption enabled) suspect. The same thing if you're using backup software to encrypt or compress data or using backup software at all since that will inject it's own meta-data into the mix as the backup is written out in the backup app format. Ultimately all of these de-dup technologies (Except encryption but including thin provisioning) are *capacity optimisation* technologies. There is something different to chose depending on your needs. I have a MS launch T-shirt somewhere which says "We came, we saw, we DoubleSpaced" ;)
Tony, Good post! Let me add a couple of comments. I totally agree with you that inband compression is scary - it affects performance of writes and reads, and does not give you enough choice about what to compress or when. Fortunately, in-band compression is not the only option these days. Files can be compressed in the background, after they are saved. There's no reason you have to compress a whole disk, share, or directory either. Modern tools can use policies to decide what to compress (eg, all files with the extension .mp3, or all files that have not been modified for 60 days) and how aggressively to compress it (compress for maximum space saving or compress for fastest decompress time?). Dedupe is just one form of compression, and by no means is it the most effective for online storage (the files on your hard disk or NAS share). Dedupe is good for backups, because repetitive backups create duplicates. You won't find as many dupes in your online set of files as you will in 30 or 365 days worth of backups. Different compression techniques are called for if you are trying to reduce the size of an online data set. Further, almost every file that is driving today's storage growth is already compressed. Microsoft Office 2007 compresses every file on save, and PDF, JPEG (as you point out), all video formats, and most other common file types already include some form of compression done by the application itself when it saves its files. Typically, that compression is some variant of a common generic algorithm - Lempel-Ziv for example. So if the native format of a file is compressed, it's hard to say that additional lossless compression would alter its legal validity. For compliance purposes, compression has to be bit-for-bit lossless, and that can be verified by taking and storing cryptographic checksums before compression. That way, you can always compare your decompressed file with that original checksum to see if it's bit for bit the same. That makes sense if you're talking about corporate memorandums; it might not be as important if you're talking about your JPEGs of the family vacation last summer.
Hi All, i have written a simple very basic tutorial on Dedupe. Please let me know, how can improve it and if I should add more references. Thanks