Faced with the challenges of significant storage and application growth, shortened backup windows and limited IT resources, many organizations are embracing hierarchical storage management (HSM) to archive infrequently accessed data on less expensive storage.
|Storing 80 million images a year|
GE Medical Systems, Waukesha, WI, is a provider of health care productivity solutions and services and is a leader in medical diagnostic imaging technology. The company has more than 300 offices worldwide and $8 billion in annual revenues. Its application service provider (ASP) group offers remote archiving services for digital diagnostic images, allowing its customers to focus on delivering healthcare, rather than managing large-scale storage systems.
GE Medical Systems provides an archiving solution for Digital Imaging and Communications in Medicine (DICOM)-based medical exams and images. "The sheer volume of medical image data is unmanageable for most of our customers," says Sander Kloet, manager of data center operations for GE Medical Systems' ASP services. "Our customers' storage requirements are growing at over a terabyte per year, and managing that growth is one of the biggest challenges faced by these organizations. The demands on a hospital's IT organization are extreme, and often times they don't have the resident storage expertise."
GE Medical Systems supplies a WAN solution to its customers and provides primary RAID-based storage, long-term archival storage or a combination of both, depending on customer requirements. At the heart of the storage solution is an HSM infrastructure consisting of StorageTek's Application Storage Manager (ASM) software running on a pair of Sun Fire 4800 servers that are clustered with a Veritas Cluster Server (VCS). The HSM storage environment consists of a 10TB EMC Symmetrix storage array for the primary storage and a 6,000- slot StorageTek PowderHorn library with 9840B tape drives for secondary storage. The DICOM application servers access the shared ASM file system. ASM manages its file system via user-defined policies to provide unlimited capacity to the application. The DICOM application servers maintain the file system relationship of the hospital's patient records and their associated medical exams in a separate database.
For many years, HSM software solutions such as IBM's DFSMShsm or Innovation's FDR/ABR have been used in the mainframe environment to offset the high cost of enterprise class disk and improve the utilization of tape capacity. HSM--while popular in the mainframe space--has only recently been partially successful in the distributed computing environment.
Factors limiting the use of HSM include: the continuing decline in the price of disk, dramatic increases in disk capacity, limited network or storage bandwidth for data migration and recall and lack of fast access to secondary or tertiary storage devices (optical or tape). But things may be changing. Now storage area networks (SANs), network-attached storage (NAS), Fibre Channel (FC), Gigabit Ethernet and fast access tape solutions provide the technological foundation to build a robust HSM solution.
HSM is the automated migration of files and data across a hierarchy of storage devices. Data management policies govern data migration of inactive data from primary disk or NAS to lower cost storage devices such as nearline tape. The HSM software performs this data migration transparently to the user, and provides fast data retrieval from either online or nearline storage.
Typically, a two-tier HSM strategy is deployed consisting of a high-performance RAID disk or NAS as the primary storage and automated tape as the secondary storage. Optionally, a three-tier strategy would include lower cost, high-capacity disk as the secondary storage and automated tape as the tertiary storage. Each media type in the storage hierarchy represents a trade-off between cost and data access time. HSM hardware and software solutions are available from a variety of vendors including ADIC, Hewlett-Packard, IBM, Legato, StorageTek, Sun, and Veritas.
How does HSM work?
Most commercially available HSM software manages data movement between the storage hierarchies. The HSM software virtualizes storage capacity to users and host servers by representing the physical disk or tape storage capacity as a file system image that's infinite in size. The software also manages the storage media and its own catalog, consisting of pointers mapping the logical file data to its actual physical location. The policy-driven HSM engine will periodically scan the file system directories and identify files that have met a predefined criteria for migration. Once identified, the HSM engine will:
- Migrate (copy) the data from primary to secondary or tertiary storage
- Mark the online storage space available for reuse
- Update the file system directory entries to indicate the files have been moved
- Reclaim the online disk space
As data ages, it's typically accessed less frequently and may be migrated to cheaper storage. It's estimated that as much as 80% of disk space may be comprised of older inactive data. Target applications for an HSM solution include medical records management, document imaging, seismic exploration, data warehousing and e-mail, as well as applications generating a huge amount of data that's accessed infrequently after creation. According to Fred Moore, president of Horison Information Strategies in Boulder, CO, the probability of reusing data typically falls by 50% after the data is three days old.
|Storing 80 million images a year (continued)|
Medical images that comprise an exam are written to primary disk in a logical directory structure within ASM. Medical images may typically be added, modified or deleted over a period of four to 12 hours. Based on frequency of access policies that are set within ASM, the associated records for each exam will be migrated to nearline tape. Policies can also be set to ensure patient records are grouped together on the same tape. This grouping facilitates faster retrieval times. Optionally, the images may reside on primary disk until a predefined capacity threshold is reached. Upon migration, the images may be written to one or more different tapes for data availability or off-site vaulting purposes. When a doctor searches for an exam, the DICOM application queries the ASM file system. If the exam--and its associated medical images--resides on the primary disk, it is immediately available for viewing. If the exam isn't on disk, the client experiences a minimal delay, while ASM quickly recalls the exam from tape back to disk. This recall is transparent to the user.
GE Medical Systems chose an HSM solution because of its lower cost of ownership and reduced storage management characteristics. Most hospitals generate between 5,000 and 300,000 medical exams per year and are experiencing storage growth of 5TB to 6TB per year. With its HSM solution, GE Medical Systems plans to store more than 80 million medical images and 25TB per year. Based on this storage growth and the typical frequency of access characteristics for medical exams, it simply wasn't cost effective to purchase disk arrays to store all of this data. In addition, GE Medical Systems wanted to reduce their storage management burden with HSM.
Kloet says, "Once the HSM system is set up with the right policies defined, there is little ongoing management involved." The ASM-based HSM solution also facilitates automated data replication on multiple tapes, one of GE Medical Systems customer requirements. "Customers typically want two copies of data for disaster recovery purposes," he says. "The second copy is sent to an off-site vault."
The design and implementation of an HSM solution can be complex, given the different vendors and storage components involved. Kloet's advice is to get all of the vendors communicating during this phase of the project. "You cannot set up the HSM system right out of the manual. You need different types of expertise, including Fibre Channel, SANs, tape systems and clustering. You need all of the parties involved," says Kloet. He also recommends implementing the HSM solution in a test environment, having a good test plan and verifying the system functionality before rolling it out in production.
HSM and databases
Active--or live archiving--is a new approach to HSM for large databases and data warehouses. While HSM software is well-suited for images and inactive files, databases require a more robust data management solution to facilitate data movement without impacting database integrity or performance. Active archiving software is used to improve the effectiveness of HSM solutions by removing inactive data in a database and creating a file that may be managed by the HSM solution.
Active archiving software will transparently remove inactive historical data from production databases and save it into an archive. The active archiving process will also save the metadata that describes the tables, columns and relationships used to create the archive.
As with all HSM solutions, active archiving allows administrators to define data management policies based on frequency of data access, data type and data relationships that specify when database information will be archived. For example, an insurance company may want to archive all policies that were created more than three years ago for a selected client. This information will be removed from the production database and saved in an active archive file. The user may also restore a subset or the entire archive file with full referential integrity. Optionally, the user may let an HSM solution migrate the active archive file to tape for long-term storage.
An example of active or live archive solutions include Princeton, NJ-based Princeton Softek and their Archive for Servers. Another example would be LiveArchive, a product made by OuterBay Technologies, Campbell, CA. "Most database applications were not designed with data archiving," says Jim Lee, VP of product marketing at Princeton Softech. Once a portion of a database is archived, Lee says, Princeton Softech's customers typically experience a 20% to 25% improvement in database performance.
HSM software allows the storage administrator to set management policies for the automated migration of data from one storage device to another. These policies include such things as file size, frequency of file access, retention period, type of media used for migration and disk capacity. Through the setting of high- and low-capacity thresholds--or watermarks--the storage administrator may control online capacity utilization. Once a high watermark is reached, the HSM engine will search for files meeting the policy-based conditions and automatically migrate them until the low threshold is reached. In addition, the administrator may exclude certain files from the migration process, such as system files, to avoid performance problems.
For example, magnetic resonance images (MRI) files from a radiology center may be written to a high-performance disk array, and if they aren't accessed after a defined period of time, the HSM engine automatically migrates the files to a cheaper storage medium such as tape and leaves a stub file on the primary disk. The stub file is a pointer to the actual location of the data on the secondary or tertiary media, and allows the file to appear to be immediately available. If the image is subsequently needed, the HSM software will intercept the request, automatically recall the image from secondary storage and stage it back to primary disk. If the file isn't changed, it is simply released from online storage, based on the policy settings. If the file is changed, the previous copy on secondary storage is marked invalid and the new file is migrated to a new location. Over time as data is migrated and recalled, the HSM software invokes a process that's called reclamation to free up secondary or tertiary storage by copying the remaining active files off of a highly inactive piece of media onto a new piece of media.