idspopd - Fotolia
Dragon Slayer Consulting
Published: 02 Oct 2015
Capacity. It's usually the first word that comes to mind when thinking about storage. Not performance, reliability, availability or even serviceability. Data capacity is almost always the top concern: how much of it is left, and how fast it is being consumed.
That is the odd quirk about storage technology. Unlike all other computing and networking technologies, only storage is consumed. All others are utilized over and over again. The IT predisposition to never dispose of any data -- ever -- demonstrates how storage is constantly consumed.
For many shops, a new storage system must have enough capacity to contain all of the data stored on the one it will replace plus all of the projected data that will be created and stored during its lifetime. Data storage never shrinks. It just relentlessly gets bigger. Regardless of industry, organization size, level of virtualization or "software-defined" ecosystem, it is a constant stress-inducing challenge to stay ahead of the storage consumption rate. That challenge is not getting any easier.
More devices than ever are creating additional data to be stored at an accelerating rate. The Internet of Things (IoT) promises to boost data storage growth into warp speed, and the vast majority of that data is unstructured. Unstructured data historically had nominal value as it aged, but clever technologists have changed that paradigm. That unstructured data today has huge potential business intelligence (BI) value that analytical tools such as NoSQL or Hadoop can extract for competitive advantages. With more devices than ever creating data and more uses for that data, organizations are even more reluctant to deep-six anything.
Storage has become the biggest technology line item in many data centers. Data growth velocity has not slowed and is, in fact, accelerating. Budgets, on the other hand, are not accelerating.
To make matters worse, media capacity density increases have slowed. Capacity growth increases in all media have slowed based on quantum technology limitations. The rate of deceleration varies by technology (lower deceleration for NAND flash, higher deceleration for hard disk drives, tape and optical).
So, IT is forced to find cost-effective ways to cope with expanding data capacity requirements without breaking the bank.
High capacity media
Even though media capacity gains are slowing, they are still occurring. For example (all capacities are "raw"): the largest 3.5-inch HDDs available today are 8 TB (Seagate) and 10 TB (WD-HGST); the largest 2.5-inch SSDs are 4 TB (SanDisk, Samsung, Seagate and Toshiba); the largest PCI Express flash drives range up to 15.36 TB (Samsung); the largest custom form modules (CFMs) range up to 8 TB (SanDisk); the largest LTO tape is currently 2.5 TB (all vendors); and the largest tape drive is 8.5 TB (Oracle).
Use high-density shelves
One way to reduce rack space and floor space is to utilize high-density shelves. The high-density shelves for 3.5-inch hard disk drives (HDDs) come in a range of drive and rack unit (U) sizes. The most popular utilize 4U with populated drives of 48, 60, 72, 84 and 98. Utilizing the 10 TB highest capacity 3.5-inch HDDs enables nearly a PB of storage in 4U. That's a lot of density.
There is a downside to these high-density shelves. Drives can only be accessed from above, typically requiring a ladder. And the weight of these shelves can easily exceed a few hundred pounds. Sliding shelf brackets are not set up to handle that amount of weight. Many vendors specify a "server lift" (powered or hydraulic) to support the shelf when it is pulled out. That adds cost and time.
There are also flash solid-state drive (SSD) high-density shelves. SanDisk puts up to 512 TB (raw) in 3U; Toshiba puts 192 TB (raw) in 2U; and HGST puts 136 TB (raw) in 1U. The SanDisk and Toshiba shelves also need to access the drives from above. A ladder will be required. But weight is not a problem.
Utilizing high-capacity media and high-density drive shelves does not reduce data capacity consumption, but it does reduce total systems, management and supporting infrastructure required to meet that capacity consumption.
Stop storing everything forever
Seems like a simple concept that, in reality, very few IT organizations implement. Not all data needs to be stored forever. IT teams need to set policies defining retention times for different types of data and enforce them. There is a lot of data that has limited or no value over time.
Take the example of video surveillance. Video consumes a lot of storage. How long does surveillance video need to be saved? One week, two weeks, a month, a year? There are smart IT organizations that have a policy of keeping their video data no more than a couple of weeks or at most a month. Obviously, if there is something of interest on a particular video, it's kept longer.
Systemic enforced data retention policies will significantly slow the consumption of storage capacity. Making it happen requires time, cooperation (buy-in), effort and discipline to enforce. But, keeping valueless data forever is simply not smart or financially sustainable.
Take out the garbage
How much valueless data currently consumes the organization's storage?
Emails, Word documents, spreadsheets, presentations, proposals and more from employees long departed. Is there much or any value in this data? Why is it still there? How much of that consumed storage is taken up by personal MP3s, photos, videos and so on? There is a lot more than most IT managers realize.
How about multiple iterations, versions, or drafts of files, documents, spreadsheets, presentations, price lists and so on that consume storage and are outdated or obsolete? Data tends to be sticky. These crimes of inefficient storage consumption are amplified when that storage is replaced because that "garbage" data continues to consume capacity on the new system as well as every tech refresh after that. In other words, you may be buying more storage than you need.
But, how do you find that garbage data? It's not as if the garbage data sends out an alert claiming garbage status. The good news is that there are several software applications and services out there that analyze unstructured (file) data and provides that analysis (e.g., Caringo FileFly, Data Dynamics, NTP Software, Varonis and so on).
These applications identify orphaned data, personal data such as MP3s, photos and videos, and old unaccessed data. They can enable the data to be deleted or migrated to low-cost storage options such as LTFS tape, local or cloud object storage, optical storage and cloud cold storage. The amount of prime real estate storage capacity reclaimed can be enormous and typically pays for the software application or service many times over.
Migrate data as it ages
Data tends to be very storage-sticky. The first storage location new data lands, is where it likely stays until that storage is tech refreshed or upgraded. Even then, it will stay in the same relative location. That's a tremendous consumption waste of the most expensive storage capacity.
Data's value decreases as data ages. Data is generally most valuable and most frequently accessed within the first 72 hours after it is stored. Access declines precipitously from that point forward. The data is rarely accessed after 30 days and almost never after 90 days. And yet, it frequently stays on high-priced storage months or years after its value has plummeted.
The main reason this occurs is that migrating data among different types of storage systems can be difficult and manually labor-intensive. In addition, moving the data often breaks the chain of ownership, making it difficult to retrieve the data if or when it's required.
Hybrid storage systems have storage tiering within the array that enables movement of data from high-cost, high-performance storage tiers to lower-cost, lower-performing storage tiers and back again based on user-defined policies. Many only provide data movement between tiers within the array. Some can move data within the array and to external lower-cost, lower-performing storage as well. They may utilize cloud storage such as Amazon Simple Storage Service (S3), cloud storage with S3 interfaces or LTFS tape, as a much lower cost tier. A stub is left after the data is moved so that the chain of ownership and metadata remains intact. When a user or application seeks data that has been moved to a lower cost and performing tier, the stub retrieves that data placing it back on its original tier transparently. It just takes a little longer to access it.
Although this technology does not reduce the amount of capacity consumed, it does align the data value better with the storage costs. One more thing about hybrid storage systems: they are not just between flash storage tiers and HDDs, object storage, cloud storage or LTFS tape tiers. There are hybrids that use fast flash and slower capacity flash as their storage tiers. There are others that just utilize high-performance small form factor (2.5-inch) HDDs and large form factor (3.5-inch) nearline HDDs as their storage tiers. All of them reduce the cost of storage capacity but not the total storage capacity consumed.
There are also third-party software and services (i.e., Caringo FileFly, Data Dynamics, NTP software and others) that will move data from a costlier storage tier to a lower-cost tier by policy, between systems, to object storage or to LTFS tape storage using stubs to ensure that data can be accessed if necessary.
There are two key differences between third-party software and hybrid storage systems. The first is that the hybrid storage systems mostly operate only within the system and in a few cases S3 API object storage. Third-party software works between different vendors' storage systems, S3 API object storage systems, LTFS tape and so on. The second is that the third-party software allows the storage administrator to choose to delete or eliminate garbage data based on policy. Therefore, unlike hybrid storage systems, the third-party software can actually reduce the amount of data stored.
Make the most of data reduction technologies
Data reduction technologies have gained significant adoption in most storage systems, software-defined storage and even hyper-converged systems over the past few years. These technologies include thin provisioning, data deduplication and compression.
Thin provisioning does not actually reduce data storage consumption. It instead significantly reduces storage wasted by overprovisioning. Applications do not like running out of storage capacity. When it happens, they crash. It is not a pretty situation, and one that causes urgent and serious IT problems. IT attempts to avoid this by overprovisioning storage capacity to applications -- especially mission-critical applications. That overprovisioned capacity per application can't be utilized by other applications. This creates a lot of unused and unusable storage capacity (often called orphaned storage).
Thin provisioning essentially virtualizes that overprovisioning so each application "thinks" it has its own unique overprovisioned storage capacity, but in reality is sharing a single storage pool with every other application. Thin provisioning eliminates orphaned storage and significantly reduces the amount of storage system capacity purchase requirements. That reduction has the same net effect as reducing the amount of data stored.
Data deduplication first gained traction on unique target storage systems for backup data (i.e., EMC DataDomain, ExaGrid, HP StoreOnce, NEC HYDRAstor, Quantum DXi and others). Today, most backup software has data deduplication built into the software.
Data deduplication has also made its way into both hybrid storage and all-flash arrays as well as traditional legacy storage arrays. The rationale for moving data deduplication into the array is to decrease the cost of "effective" usable capacity. Effective usable capacity is the amount of capacity that would be required if no data deduplication took place. So, if the amount of capacity required without data deduplication is approximately 100 TB but only 20 TB with data duplication, then the effective usable capacity of that 20 TB system is 100 TB. There is generally not as much duplicate data in primary data as there was in older backup data. This means data reduction ratios tend to be less. Some workloads, such as VDI, create a lot of duplicate data. Others, such as video data, have very little or none. In addition, compressed or encrypted data cannot be deduplicated.
Data deduplication significantly reduces the amount of data stored. For primary workloads on flash storage (hybrid or all-flash arrays), the reduction generally averages from 4:1 to 6:1 (data reduction will greatly vary by data type). For primary workloads on HDD storage, the data reduction generally averages 2:1 to 3:1. Either way, that is a lot less data to store.
It's important to remember that there are performance tradeoffs with data deduplication. Inline data deduplication is the most prevalent form of deduplication: It requires that every write must be compared against stored data to identify unique data.
Unique data is stored and the system creates a pointer for the duplicated data. That comparison creates additional latency for every write. As the amount of data stored on the system increases, so does the metadata and latency. And every read requires that the data is "rehydrated" or made whole. That adds latency to reads. That latency also increases with consumed data capacity similarly to the writes. Primary application workloads have response time limitations. Too much latency and an application can time out.
This has led to two different variations of inline deduplication: one for flash-based storage and one for HDD storage. The three orders of magnitude (1,000x) lower latency of flash storage allows for more in-depth data deduplication, producing better deduplication results.
The other type of deduplication is post-processing. Post-processing data deduplication does not add latency on writes because it happens after the data has been written. That processing is pretty intensive and must be scheduled in an idle time window. It also requires more capacity to land the data and does nothing to reduce read latency.
Compression technologies operate similarly to data deduplication but are limited to working within a block, file or object. Results are usually the same or less than deduplication and latency concerns are similar.
The key thing to remember about these data reduction technologies is they are not mutually exclusive. They can and should be used together. Just remember: Deduplication must occur before compression. Compressed data cannot be deduplicated.
There is one caveat: Moving the data from one storage system generally, but not always, requires the data to be rehydrated before it is moved.
Use efficient data protection apps
Data protection products historically created a lot of duplicate or copy data. But, most modern data protection products feature native deduplication. Many IT pros perceive that it is an onerous process to change out from legacy data protection to modern data protection. There are a couple of false assumptions that underpin that perception.
The first is they have to migrate data protected under the old system to the new. That's not true. Old backups or other types of older data protection data are not archives and should never be used as archives. This is because they have to be recovered in order to search them. The only reason to keep older backups is for compliance reasons. This doesn't mean those backups have to be migrated to newer data protection systems. The software can be turned off from creating any new backups. The old backup data just stays static until it ages out past the compliance requirements and then it can be destroyed. The original software can still be utilized to recover older data for things such as eDiscovery.
The second false premise is that implementation of modern data protection is just as painful as legacy data protection. Things have changed considerably. Many modern data protection systems are relatively easy to implement.
To reduce secondary data capacity consumption appreciably, be sure that your data protection software is up to date.
Manage data copies
Dragon Slayer Consulting surveyed 376 IT organizations over two years and found a median average of 8 copies of the same data. Copies often resided on same and different systems. Copies are created and used for dev ops, test dev, data warehouses, business intelligence, backups, business continuity, disaster recovery, active archives and more. This can have a huge amplification effect on storage consumption.
The key to controlling out-of-control copies is to utilize variations of redirect-on-write or thin-provisioned, copy-on-write snapshot technologies. That can take place within a storage system (most storage systems, software-defined storage, even hyper-converged systems) or separated out using a dedicated appliance or with software (such as Actifio, Catalogic, Cohesity, IBM SVC, Rubrik and others) utilizing lower-cost storage. These snapshots are fundamentally a virtual copy. They look and act like a real data copy. They can be written to and modified like a real copy. But they consume a very tiny fraction of the storage capacity.
Managing data copies is an essential capacity coping strategy with huge potential savings in data capacity requirements.
Higher data capacity made possible through wireless networking
Big data, Hadoop change data capacity management
Virtual server capacity planning guide
- Tiered Storage - Optimizing the Storage Infrastructure –Fujifilm Recording Media USA, Inc.
- Illuminating Insight for Unstructured Data at Scale –IBM