Data storage capacity pitfalls and how to avoid them

Organizations waste a staggering amount of data storage capacity, much of which can be attributed to improper capacity management.

Like death and taxes, increasing data storage capacity demand is now one of life's certainties. But another truth...

is that we waste a lot of capacity through design and mismanagement.

Some drivers of capacity demand, other than data growth, can be gleaned from numerous studies that show how up to 70% of the capacity of every flash device or disk drive we deploy is wasted by storing a mind-numbing number of copies of the same file, a lot of data that is never accessed or from data whose owner no longer exists in the company.

From a design standpoint, we waste a lot of capacity by adopting software-defined storage (SDS) strategies that require a minimum of three storage nodes with cross-replication of all data across all nodes, or by using antiquated file systems that claim space -- even that not used to store the bits of an object. Or we waste capacity simply by allocating storage in a very flawed manner. Witness the time-honored practice of allocating data storage capacity to a server administrator who requests it and then letting him or her decide whether to lay on a file system to put the space to work or just "disappear the capacity" (set it aside for an emergency, or rather, forget about it).

Data storage capacity isn't just a user problem

The data storage industry has delivered two things for years: miracles of technology and monuments to greed in the form of storage infrastructure products. From the 1980s until very recently, disk capacity has doubled approximately every 18 months while the cost of drives has fallen by half about every 12 months -- a technological miracle with economic benefits. But the price of an array -- a collection of commodity storage components in a commodity rack using something like a server motherboard as a controller -- has actually accelerated by as much as 120% per year.

A lot of this cost has to do with value-added software, which vendors have been affixing to their proprietary storage arrays to stand out in the marketplace and lock customers into their brand. SDS is predicated in part on the idea of returning arrays to their commodity state by abstracting all the value-added software into a free-standing software services layer that lives on a server. Vendors call this revolutionary, but I see it as a return to something akin to mainframe system managed storage (dfSMS) controlling a bunch of direct-attached storage devices.

Storage costs from 33 cents to 70 cents of every dollar spent on IT hardware today.

It remains to be seen whether this strategy will make storage any less costly to acquire, especially with respect to the SDS approaches advanced by leading server hypervisor vendors that are just as constricting and closed from an architectural perspective as the SAN and NAS systems they purport to replace. Because of the way storage is packaged and sold, the capital expense (Capex) is huge. Storage costs from 33 cents to 70 cents of every dollar spent on IT hardware today, and we are talking a lot of money even if we divide the cost of hardware, warranty and maintenance contracts, and software licenses by the number of years of service we hope to get from the investment.

Figuring out the true cost of ownership

But Capex (acquisition) costs are just part of the storage equation. To determine the cost of ownership, you need to look at operating expense (Opex), a number that is not nearly as available in most organizations. According to Gartner, annual Opex for storage is somewhere between four and five times annualized Capex.

Opex includes backup and recovery, planned downtime, management and administration, and facility expenses (space, power and cooling). These numbers are probably buried in the books rather than noted as a clearly defined line item. However, we know that because we sometimes fail to manage storage infrastructure with great efficiency, real Opex can be pretty high.

Why don't we manage storage well? We tend to buy mostly Tier-1 storage, which is low-capacity, high-performance gear intended to make freshly minted applications shine on the day they are deployed -- whether they are critical applications or not. But I say that's wrong at the outset.

Traditionally, storage comes in different flavors for a reason. Some gear is optimized to store fast-accumulating data at sufficient speed to match the performance of demanding transaction systems. Other gear is designed to store large volumes of data that are routinely or occasionally updated or modified. Still other systems are designed for long-term mass storage of data with infrequent access and extremely low change rates. In well-managed environments, data moves from tier to tier according to a policy that has been automated using hierarchical storage management or archiving software. That is about the only way we can contain storage costs.

Unfortunately, to make tiering work, you need integrated infrastructure or at least a common management scheme. In some cases, vendors make it difficult to make their gear work with that of a competitor. Value-added software can be used to obfuscate efforts at common management, and with some arrays, proprietary file layout systems are deliberately implemented to preclude the sharing of data between heterogeneous platforms.

And the latter problem is not just associated with proprietary legacy arrays. Hypervisor vendors are implementing their own SDS stacks in ways that prohibit the sharing of their storage resource with data from workloads that have been virtualized using a competitor's hypervisor SDS model.

If you're wondering how "your" storage became "their" storage, you aren't alone. These proprietary barriers exacerbate our ability to automate data movement across storage tiers. It's a byproduct of the way in which SDS borrows from high-performance supercomputing hyper-clusters, which is where they seem to get many of their topology concepts. The impact of many SDS architectures is to build a flat storage infrastructure that is identically configured and deployed behind every virtual server. These form building blocks -- hyper-converged infrastructure nodes of server, storage and SDS middleware -- that can be rolled out of some inventory and deployed rapidly to respond to the demands of business managers and their changing business processes. Need another 50 seats of ERP? Roll out three more building blocks for adequate compute, network and storage capacities.

This may sound like the road to real agility until you realize that the infrastructure is inherently flat. There is no expensive-but-low-capacity storage tier to use with hot data. And there's no less-expensive, high-capacity storage tier for active but less frequently accessed data, and no cheap high-capacity storage tier for inactive and archival data. Failure to tier storage and manage data across tiers is a huge cost accelerator all on its own.

Considerations around SDS and data storage capacity

Software-defined storage allows you to consolidate storage services into a software layer on a server. This is a good thing, because it pulls them off array controllers where their functionality is isolated and benefits are confined to only one rig. It makes little sense to confine deduplication or thin provisioning to one stand of disk rather than extend it as a service across all storage platforms. To this extent, the SDS folks aren't pulling our collective leg.

What is misrepresented is that the consolidation of these value-added software functions in a server-side storage software stack is the panacea for storage management. It isn't.

In addition to storage services, storage management entails the management of the storage resource: the physical infrastructure, its operational status, and its allocation and de-allocation to workload. Just managing data replication services does little if mirroring does not work because a disk drive or flash device failed on a storage node and no one knows about it. Moreover, it contributes nothing to agility if storage cannot be mixed and matched behind those applications that need it to support changing workload demands.

You need to consider virtualizing your data storage capacity at the same time you centralize your storage services.

When you think about it, the only reason storage resource management isn't part of a hypervisor vendor's preferred SDS stack seems to be the relationships some vendors have with storage hardware vendors. If we virtualize the storage infrastructure, abstracting away not only the value-added software services but resource capacity management, the storage resource can become just as software-defined as storage services and be much more readily allocated, de-allocated and tiered.

Storage virtualization is nothing new. DataCore Software has been doing it for more than a decade, and IBM's SAN Volume Controller is making a push into this territory. But today's chief hypervisor players don't seem interested in creating sharable a storage resource that can be used with SDS to match data to the right kind of storage and services to optimize data use and storage costs.

If you are hoping SDS will tame the storage cost beast, software-defined storage, hyper-converged infrastructure, and old-fashioned proprietary storage arrays and fabrics are not a panacea. You need to consider virtualizing your data storage capacity at the same time you centralize your storage services. Only then can you preserve your tiering strategy and share capacity and services with all your data in an efficient way.

Next Steps

Manage data storage capacity with compression and deduplication

The importance of data storage resource management

Look at the differences between object storage and SDS

Dig Deeper on Storage optimization