Pavel Losevsky - Fotolia
Reality can't be ignored. In most data centers, 80% or more of stored data hasn't been accessed in more than a year. Tighten that time frame up, and we find 95% of data has not been accessed in the last 90 days. That means the vast majority of data just sits on that expensive and speedy flash array you bought to serve active data.
The problem is most IT professionals hesitate to take an aggressive step such as moving 95% of their data to a secondary storage tier. But the truth is, with proper design, IT can reach this goal with few complaints. Here are four basic rules that will get you on your way:
Rule No. 1: Archive response can be almost as fast as primary
Your data archiving strategy should rely on storage using high-capacity HDDs, assisted by deduplication and compression, to drive as much cost out of the archive storage tier as possible. While all those technologies could affect data recall performance, in most cases, a recall from a properly designed active archive is almost as fast as primary storage.
That's because primary storage is responding to hundreds, if not hundreds of thousands, of recall requests per second, while an archive typically responds to one or two per hour. Archives are usually busier dealing with inbound write traffic than old data being accessed. With less I/Os to respond to, disk-based archive storage can respond to individual requests almost as fast as primary storage. Note, though, that archives don't have to respond as fast as primary storage, they just have to respond fast enough that users won't notice the difference.
Rule No. 2: Don't archive everything on day one
IT has, with good reason, developed a distrust of everything. Archive software vendors and, especially, hardware vendors brag about ROIs showing data archiving strategy investments paying for themselves 30 seconds after installation. The problem is to get this rapid ROI, customers must buy 100 TB of archive or secondary storage and move 80% to 95% of their data as soon as the archive platform is stood up. Any IT professional worth their certifications isn't going to do that. There's no need. The primary storage that holds all this old data is bought and paid for, and most vendors aren't going to let you send back half of a storage array for a refund.
A more logical data archiving strategy is to archive data on an as-needed basis -- typically, as those primary systems come off of maintenance, have reached end of life or are full to the point that more capacity or another primary storage system must be purchased. You'll want to know how much of the data on that array can be archived. With that information, you should buy just that amount of storage from your archive vendor, enabling you to put off the purchase of a primary storage system or to run a much smaller high-performance storage system. With an archive strategy in place, the only reason to buy more primary storage is to gain performance, not capacity.
Rule No. 3: Transparent recall may or may not be critical
If an aggressive data archiving strategy -- such as archiving 80% of primary storage -- is followed, then prepare for more frequent data recalls from users. Considering the gradual move to archive storage described in rule No. 2, however, recalls may not be a frequent as you'd expect.
First, make sure most of those recalls can occur without IT interruption. That means you need to select software that can set transparent links between where the file used to be and where it is on the archive. It's also important to remember the archive might be multistep, on-premises disk to tape or on-premises disk to the cloud, which means that these links must be updated with the file location each time it moves to another storage device.
The other side of the coin in transparent recalls is setting up an apparatus in the architecture that has stub files or a centralized metadata control layer. Like any apparatus, there's a certain amount of rigidity to this control layer, including a potential management issue with stub files and a certain amount of lock-in to the data management vendor. You must decide if the downsides of transparent recall are worth the upside.
Rule No. 4: Expect more frequent recalls
If your organization goes all-in with a 95% data archiving strategy or evolves to that point, be prepared for more recalls. Whether recalls are done transparently or manually because of the lack of the transparent recall component, you can now measure them in dozens per hour. The higher the recall rate, the more you'll want to lean toward a disk-based archive, either exclusively or as a front end to tape.
If most of the archive is disk-based, a high recall rate shouldn't affect performance. At the very least, the front end of the archive should be disk- or cloud-based. Tape, if used at all, should either serve as the deep archive or solely as a backup to the archive. While tape is a robust and reliable technology, its role in a data archiving system as that archive becomes more active requires more planning.
Don't go on a data archiving strategy diet
No question, 95% of your data is likely eligible for archiving. Archiving shouldn't be looked at as a storage diet that's done every so often. Instead, it's an organizational change that occurs gradually and, once fully applied, never stops. Data should constantly flow through your enterprise from primary storage to archive storage, and occasionally back to primary.
Capitalize on flash for better secondary data storage
How to use tape for an efficient data archiving system
Five ways to move data to a cloud archive