Users stop modifying or accessing their files within a few weeks after they are created. The same goes for most data -- shortly after it's created, it stops being accessed or modified.
Nearly all this data resides on expensive primary storage. And nearly all of it should be moved to a less expensive class of storage that is optimized for both capacity and power efficiency.
Yet even though as much as 70% of this data is inactive, many storage managers are reluctant to implement an archive, believing that it's easier to add primary storage.
Some storage systems, especially virtualized storage, have made it easier for storage adminisrators to expand storage. Still, avoiding downtime when expanding traditional storage systems requires planning, which includes ordering the rght drives in terms of size and speed.
- need to become their own new array and be concatenated with the prior volume at the host, or
- they would need to be added to the existing array group as 148 GB drives.
Once the storage is in place, expansion of the volume on the array is relatively straightforward. But appearances can be deceiving. A storage administrator still needs to spend time making sure that:
- The expansion uses the correct-sized drives to extend the array.
- The speed of those drives matches the speed of the current drives.
- The appropriate number of drives are added to maintain RAID integrity and performance.
So far, I have assumed that a SAN or NAS is in place. If this expansion is being done to direct-attached storage (DAS), there will almost certainly be downtime while the new drive is installed and logically added to the system.
Once the drive is prepped and ready for presentation to the host server, the new capacity needs to be either extended onto a current volume or else concatenated to that volume. Depending on the storage array, the operating system and the applications involved in this event will most likely cause a stoppage of the application or service. For example, a file server may need to have access stopped. In some cases, the server may require rebooting.
More challenges in adding primary storage
There's another challenge to adding more storage: All this new additional data needs to be protected via the backup process. To accommodate the additional data set, more backup capacity will likely be needed. Ironically, in a disk-to-disk backup environment, this leads to a need for even more disk capacity (with this capacity added to the backup target). Adding primary storage can also lead to expansion of the tape library in a disk-to-disk-to-tape (D2D2T) strategy in the form of additional tape media, tape slots or tape drives.
Expanding primary storage can have a significant impact on backup capacity and time-frame requirements. This is because most environments perform full backups on a weekly or monthly basis. Not only must this new data be backed up, it must be stored over and over again with each successive full backup.
The irony of keeping all the data on primary storage is that users still can't find the data. Even though most data stops being accessed once it's created and modified, the data is kept around just in case it needs to be found. As the primary storage area grows, users will have trouble finding the files they're looking for.
Indexing should solve this, but indexing of primary storage is not widely implemented. The creation and updating of the index is a constant process, one that may interfere with production activities. The index that is created also consumes storage capacity; a content index may be 15% of what was scanned. Storing this index on primary storage adds more cost to primary storage and more complexity to the backup process.
Downsides to expanding Tier 1 storage
The result is that simply expanding Tier 1 storage will cost money. Primary storage is more expensive than archive storage. It takes time to plan, configure and implement the addition of that storage. Backing up an ever-growing pool of inactive data will cause additional expenditures in disk backup capacity and tape capacity. Finally, expanding primary storage does not actually make the data any more accessible. As the data set grows it becomes increasingly difficult for the user to find that data.
So simply adding more storage is not so simple, nor is it less expensive than developing an archive storage area. Archive solutions from vendors such as Copan Systems, Permabit Technologies and Prostor Systems offer simple alternatives to adding primary storage. An archive layer could be designed to be the final resting area for data, and data should be moved to it soon after that data stops being accessed. With an archive in place, primary storage will need to be expanded less frequently because inactive data can be quickly accessed from the archive storage area.
Simplicity of archive storage
Archive storage solutions are designed from the ground up for expansion and can be expanded seamlessly on the fly. Volumes can auto-grow or the new storage can be allocated, depending on need. Drive sizes can be mixed and the overall capacity of all those drives can be made available.
These systems are designed for very dense packing of capacity and doing so in a space-efficient footprint. Some archive systems even have the capability of power management to turn off shelves of drives that are not actively being used.
An archive actually makes backups faster and more efficient by removing that data from the backup process. Instead of copying the same data over and over again to the disk and tape backup targets, the data is moved one time to the primary archive. As a result, the amount of data to be backed up can be reduced by as much as 80%. This will actually lower the cost of the backup infrastructure. There will be limited need to add additional disk capacity or tape capacity, and the bandwidth required for backup will be greatly reduced.
An archive storage area is actually simpler to manage. Disk-based archives are easier to move data to because they are just a disk area that data can be copied to. Data movement applications no longer need to manage the complexity of tape libraries and global file systems can even eliminate the need for stub files.
Retrieval from disk-based archive alleviates the user perception and resistance that recovering their old data is going to take too long. Pulling old data from even a slow or powered off disk drive will be almost unnoticeable to the user.
Disk-based archives have a greater set of options for search and index, a critical component as the archive grows. Unlike primary storage, archive storage is not being frequently accessed, so running a detailed context index process on this storage has little impact on performance, and more importantly, allows the users to find data as easily as if they were searching for a video on YouTube. And because the index is stored on archive storage, the space it takes has a negligible cost impact.
An archive can simplify primary storage by significantly reducing how often it needs to be upgraded or changed. An archive also delivers a sizable cost savings over primary storage and reduces the cost of the backup infrastructure. Finally, because an archive provides a suitable home for content-based indexing, finding and retrieving data can be easier than in primary storage.
About the author: George Crump is founder of Storage Switzerland, an analyst firm focused on the virtualization and storage marketplaces. Storage Switzerland provides strategic consulting and analysis to storage users, suppliers and integrators. An industry veteran of more than 25 years, Crump has held engineering and executive management positions at various IT industry manufacturers and integrators. Prior to Storage Switzerland, he was chief technology officer at one of the nation's largest integrators.
This was first published in November 2008