From WORM to WORST
Active data stays online and old stuff goes offline, but what about data you need only occasionally?
Everyone thinks about online data in the same way: You write it, read it, rewrite it and keep it forever. But this type of "active" data is actually a minority in most environments. Many organizations have far more data that's written once, read a few times and kept alive forever. You might say this bulk data is "write once, read several times" (WORST), and it can bloat your storage environment.
|Online, offline and nearline|
Online, nearline and WORM
Companies have traditionally lumped all of their active data together. For open systems, the content of a disk was online and the content of a tape was offline (see "Online, offline and nearline"). This was more a symptom of the limitations of open systems than any real requirement; Unix and Windows hosts couldn't access tape without relying on a third-party backup application. Storage was online on an active disk or off-line on tape simply because there was no other choice.
But mainframe systems have long been more creative, even using tape to temporarily store data for later recall by apps. This was called hierarchical storage management (HSM) for many years, but the concept has emerged in open systems as so-called nearline storage. When we implement this type of storage, we're saying data might be needed, but we can afford to wait a bit for it.
Most HSM applications use tape, but a few use optical disks. Similar to CDs or DVDs, these optical "platters" use light rather than (or in addition to) magnetic charges to store data. Historically, this media has had the advantage of being less expensive than magnetic disk while retaining the ability to access any piece of data at random. This stands in contrast to tape drives, which must reel through much of their length to locate a piece of data and often read much more data than is needed to get at a single, valuable piece of information.
Although rewritable optical disks have long been available, the majority are of the write-once type commonly called write once, read many (WORM). So, in a classic case of the cart driving the horse, businesses have attempted to identify applications requiring immutable media with long load times. Predictably, these applications are rare, limiting the usefulness of WORM in the real world.
One would think that archiving would be an ideal use of WORM media because data integrity is an important part of compliance. But, ironically, another aspect of WORM media actually interferes with its usefulness. Because data can't be deleted, and a single "platter" can store many gigabytes, immutable media can interfere with compliance policies that call for deleting a single piece of data.
With optical WORM effectively marginalized, many vendors began offering a new type of nearline storage with both random access and atomic control. This nearline storage was developed primarily by disk vendors and the apple didn't fall far from the tree; most offerings are simply repurposed disk storage arrays.
That's not to say there isn't innovation in nearline storage. To the contrary, developing certifiable WORM storage based on conventional disk has proved challenging. Some vendors even went in an entirely different direction, creating so-called content-addressable storage (CAS) that deals with objects rather than files or blocks.
The WORST data
Although vendors have been clever at finding a place, and even a replacement, for WORM media in the market, it remained a technology in search of a purpose for many years. Truth be told, most organizations have a desperate need for another type of storage, one that can inexpensively store unchanging data forever.
For example, consider the digitization of paperwork. A company scans images of filled-out forms, writing them as files to a file server or CAS device. These images never change and may never be accessed, what we in the industry jokingly call write-only data. But the data will remain online and accessible for years, and someone will occasionally view the file.
As an analogy to WORM, we can call this data type WORST. Key aspects include online availability (no one wants to wait on the phone while a tape is loaded) and lengthy endurance without modification. Despite its "onlineness," performance requirements are likely low even though the volume of data is large.
The chances are good that you have a lot of WORST data in your data center. Typical applications include scanned images and other media files, engineering reference documents such as schematics and parts lists, and captured scientific and technical data. All of these consume large amounts of disk space, but aren't edited like office documents or source code. And all are likely to have a very long shelf life.
So what storage products best suit WORST data? The low-performance requirements and large volumes point to the largest and cheapest disks available. Some would call this Tier-3 storage, but I like to call it "bulk." Think 500GB SATA disks in large RAID 5 or RAID 6 sets. The type of storage depends on the application. Many organizations use Fibre Channel disk arrays for WORST, but lots of applications would be better suited by large NAS filers.
If you use NAS, make sure you create an extensible file-system structure because you're likely to grow it to millions of files and hundreds of terabytes of data. These applications can often be extremely structured, much more so than regular user files. Rather than relying on people to comply with the directory structure, you can simply program the data acquisition application to conform to a structure. Pick something with a few high-level directories that could later be split across multiple devices if needed, and make sure that the files will be balanced evenly across all of them.
CAS is an interesting alternative. It stores files based on their intrinsic content, eliminating the question of directory structure, and most CAS devices are highly extensible and low cost. But both your writing and reading apps must support the CAS device's API. This isn't a problem for a brand-new application, but it could be a significant challenge if you're bringing in new storage to support an existing data set.
Of course, WORST is just one type of data to consider. There are many other data types out there that could be served with some fresh thinking. What's the best way to store remote data, collaborative office files or temporary data? Perhaps if we think critically about the unique requirements of these data types rather than simply what storage we have today, we can develop a truly effective infrastructure for storing them.
- Taming Hadoop: Storage Tiering for Big Data –Western Digital