Tools like automated tiering and thin provisioning help users cope with capacity demands; but more drastic measures,...
like primary storage data reduction, are needed.
Ten years ago, 10 TB was considered a large storage environment. Now it's common to have hundreds of terabytes and there are even environments with petabytes of storage in the double-digit range. It's safe to assume that data storage capacity growth will continue over the next 10 years as storage environments measured in exabytes begin to emerge and, over time, become mainstream. I actually talked to one customer who claimed they would have an exabyte of data in the next three years.
Having that much physical storage in the data center is ultimately untenable. So how do we solve the problem? A big part of the answer will be provided through a number of technologies. Hard disk drives will continue to become denser. Higher capacity disk drives have the ability to store more data within the same given physical space. However, fatter disk drives impact application performance. Therefore, intelligent tiering that enables demotion and promotion of active and inactive data between fast and dense storage tiers will balance performance and capacity.
Storage optimization technologies such as thin provisioning provide a better way to use the capacity you already have within your storage systems. Storage systems that use traditional provisioning methods typically have 50% to 70% of their capacity allocated but unused. Users who implement thin provisioning have a much higher utilization rate. If you can reduce allocated but unused capacity to 20%, it will yield significant savings in a petabyte world. For an environment with 1 PB of storage, implementing thin provisioning could result in 300 TB to 500 TB of capacity being saved. If you have 10 PB, then we're talking a savings of 3 PB to 5 PB.
These are great leaps, and I submit that another major leap will be data reduction (data deduplication) for primary storage. The math is simple and the value proposition is a no-brainer. Even moderate dedupe is economically attractive. If your data is consuming 100 TB of disk space and you're able to cut that in half, you would reclaim 50 TB of capacity. That's a fairly modest 2:1 ratio, which should be easily achievable. If you were able to get a 5:1 ratio, you're talking approximately 80 TB of reclaimed capacity. If we consider a petabyte data center, you can save 500 TB on the conservative side (a 2:1 reduction ratio), and 800 TB if you're more optimistic (5:1 ratio). For 10 PB of data, the result could be a capacity savings of up to 8 PB.
The savings are staggering when you consider just the capital costs, but it also drives down your maintenance costs. When you factor in the impact on operations and people resources, the value proposition becomes even more compelling. And if you add all of that to power, cooling and floor space savings, primary dedupe can completely change the IT landscape.
You would expect every storage system vendor to have deployed primary deduplication by now, but there are some significant issues to overcome, such as:
- There's a potential performance impact, which is a no-no in storage.
- Primary dedupe may require more internal resources (e.g., memory and CPU). In some cases, it isn't simply just a question of adding more because of design limitations.
- Even if there are no physical resource issues, some storage systems may require architectural changes to support deduplication. This could take years or may not even be possible in some cases.
- Regardless of what anyone tells you, primary dedupe is complex technology that's typically not a core competency for most vendors.
- If something goes wrong, the risk -- losing data forever -- is high, so vendors are cautious.
There are two storage system vendors that provide primary dedupe today. While both vendors have modest adoption, it certainly isn't extensive. The reason for this is that their deduplication products have distinct limitations in terms of scalability and performance. However, we're on the threshold of more and better products coming to market. You'll see announcements later this year and in 2011, and it will grow from there.
Data dedupe is a form of virtualization and I believe it will become as ubiquitous as server virtualization within all tiers of data storage. The amount of storage we have today and the growth over the next decade is a pervasive problem that has to be solved. The reality is that we can't just keep throwing money at the problem.
BIO: Tony Asaro is senior analyst and founder of Voices of IT.