Where does it make the most sense to start dedupe?
In my last column, I explained the biology around how and why we end up with so many data replicas, as well as why deduping in the backup process is such a great idea. So when do we apply this concept up the food chain? If it's good in backup, it should be great in the primary infrastructure. But different types of data are created in primary: records, files, objects, blogs and so on. Data lives in the primary infrastructure, but it goes through different lifecycle stages. So where and when does it make the most sense to kill data copies? Read on, my friends.
Stage 1. All data -- Word docs, PowerPoint presentations, trading data, arbitrage, video, MP3s -- is born dynamic or transactional. Everything is dynamic for some time. At this early stage, data tends to matter the most and has the highest degree of protection. If we lose data -- whether it's a document being written on a laptop or big money transactions being processed on a massive system -- the biggest impact is here. This is where we normally make our first replica; at a minimum, we probably mirror data at this point.
Stage 2. According to the universal data lifecycle, which I'm perpetuating because it's correct, simple and obvious, all data becomes fixed or persistent after some time. Not at the same time, but eventually. It's subjective, not objective (not truly, but it makes people feel better if I say that). At some point, data stops changing and simply "is."
The second stage is what we term "persistent active data," which is data that no longer changes, but is still very active. That doesn't mean access to that data is automatically less important; usually, it's more important at this stage. But this is where we tend to make the most primary copies of data. We replicate for disaster recovery by making backup copies and snapshots. We replicate to test/development systems and data warehouses. We email copies to our suppliers, partners and cousin Chuck. Then we back up the copies of the copies and make more copies. We need to provide these copies for as long as disparate systems and applications require them. We probably don't need to keep backing up all 87 copies but, to repeat the only phrase my 16--year-old daughter ever utters to me, "Whatever."
Stage 3. The third stage of life is when data enters the "persistent inactive" state: non-changing data that's rarely accessed. This is where 90% of all commercial data sits in its lifecycle and, thus, is where 90% of the capital and operational gains can be made from both process and technology. Why would anyone back up this data? It never changes, and you've already backed up copies of copies of it. It's the same with disaster recovery. At this stage, you want to be thinking about treating this data much differently than in previous stages. It needs to be on cheap, write-once, read-seldom-if-at-all, power- and cooling-efficient gear that preferably a monkey can manage. This is the stage where we want to massively reduce the copies of data we have. It's still primary storage, but by applying dedupe here we can probably chop at least 50% or more of our overall capacity off at the knees. If you couple that with some common sense backup/disaster recovery policy changes, you might get a free weekend or two.
Stage 4. The fourth stage is the "Who cares? I'm quitting if we ever actually need to go to this stage to recover" stage. It's the offsite deep archive or doomsday play. You have to do it, but you don't have to do it with 9,756 copies of the same non-changing data, do you? Three or four copies seems OK to me.
The inevitable next step is to figure out how to slide the dedupe lever closer to the point of creation, and the biggest value point will be at Stage 3. Eventually, it will go right up to the actual creation point itself, but for that to happen we're going to need data virtualization, which is a different topic. We also have to recognize that crushing backup data (which is brilliant, by the way) means deduping files, but in primary capacity we don't just have files. We need to dedupe blocks, records, objects and so on. Doing it all at backup is cool because we can take all of the data types and amalgamate them into files and deal with them, but we'll have to get smarter when we move upstream. There are only a handful of people talking about squashing the database, for example. Talk about a big money-play potential. The ROI of squishing data on the most expensive, most complex, most visible transaction systems will be huge. Backup is a pain in the backside for sure, but if dedupe in the backup process has created a few billion dollars of value, imagine what it could do in the transactional world. Video and multimedia will also be huge because of the sheer volume it will consume. Object-based stuff was born to hash, but it's still not a mainstream play outside of compliance. If you think you're done hearing about dedupe, it's about to replicate.
BIO: Steve Duplessie is founder and senior analyst at Enterprise Strategy Group. See his blog at http://esgblogs.typepad.com/steves_it_rants/.