Optimize your storage for fixed content

Much of your data gets written once, read often and never changes. Here's what's available to handle so-called fixed-content storage--and how storage managers are making use of it.

Fixed-content information has grabbed many storage headlines recently. And for good reasons: There's been an explosive growth of fixed-content storage, which has far surpassed the growth of other forms of enterprise data such as transactional or database information. Storage managers need to know what solutions are available today for fixed-content data. They also need guidelines for deciding how to choose the right solution for their business requirements. More importantly, they need to understand if implementing a fixed-content storage solution is right for their business.

Fixed content is information that never changes after creation. It's actively referenced, typically shared among users and must be retained for a long period of time. Examples include: electronic documents, presentations and e-books; rich media such as movies, videos, digital photographs and audio files; check images and financial statements; bioinformatics, X-rays, MRIs and CAT scans; CAD/CAM diagrams and blueprints and e-mail messages. Unlike transactional information, fixed-content data doesn't require microsecond response times, but rather, subsecond response times. While transactional data is constantly updated and typically has short-term retention periods, fixed-content information is static, has long retention periods and is constantly accumulating.

"The digitization of everything in our lives is going to cause a fundamental shift in how we think about information and how we value that information," says Peter Gerr, a senior research analyst at the Enterprise Storage Group (ESG), in Milford, MA.

Pros and cons
of content-addressed storage
Provides faster access to information and improves operational efficiencies
Lower cost per megabyte than traditional primary disk
Self-managing design simplifies storage provisioning, management and administration
Facilitates broader information sharing and repurposing
Unique object identifier enhances data security and provides controlled access to information
Applications must support vendor's API for object-based storage and retrieval
Potential network bottlenecks could impact performance
Limited number of competitive solutions --beware of potential vendor lock-in
Limited backup application support for creation of disaster recovery tape copies

However, it's measured, fixed-content data that requires a lot of storage: A 3,000-person organization generates approximately 1TB of e-mails per year. A picture archive and communications system (PACS) in a large hospital may generate more than 5TB per year in digital X-rays or MRIs. Most major banks are scanning millions of check images per year, requiring multiple terabytes of storage. Additionally, the new focus on U.S. Homeland Security is driving an increase in storage requirements for audio and video surveillance activity. ESG estimates that by the end of 2005, reference information will represent 54% of all new corporate and government information, up from 37% at the end of 2001.

Increasingly, organizations are leveraging the value of their fixed-content information to improve customer service, reduce access time to information and gain competitive advantage (see "Understanding your information requirements").

Richard Banta and Andy Porter, senior engineers at St. Vincent Hospital in Indianapolis, IN, are experiencing this growth in fixed-content information first hand. St. Vincent Hospital provides centralized storage of PACS radiological images for more than 80 satellite locations. The hospital employs a McKesson ALI UltraPACS system to capture medical images and store them on a StorageTek archive solution consisting of Application Storage Manager (ASM) software, a small disk array and a StorageTek tape library. After converting to a filmless operation last April, the hospital experienced a surge in radiological imaging and storage activity. With the improved efficiencies of automation, doctors are able to schedule and perform more radiology studies, with faster access to patient records and medical images.

"In healthcare, we're faced with data growth and retention periods that we've never had before." Our PACS radiological study volume is up to 270,000 studies and it continues to grow at about 20% to 25% per quarter," says Porter. St. Vincent Hospital archives 3TB of PACS information per year, and has approximately 5TB of other types of fixed-content information currently stored on disparate optical platforms. Banta adds, "HIPAA regulations are driving information growth, with retention requirements of five, 10 or 21 years depending on the type of medical information."

St. Vincent Hospital plans to consolidate all of its fixed-content information onto a single storage solution to simplify storage management and reduce cost. In addition, Porter and Banta are planning to reduce the time it takes to retrieve a patient's MRI or X-ray from a few minutes to less than a second by implementing a much larger disk array to store the information online for up to 24 months. Reducing information access time enables doctors to make informed decisions more quickly, improving the overall quality of healthcare for the patient. "We're constantly looking for ways to improve the patient experience," Porter says.

File-based approach to storage
A file-based fixed-content solution consists of a shared, network-attached file server with a common repository of ATA storage arrays. The fixed-content application interfaces with the file server and stores information in a standard file system directory structure located on the ATA array. The ATA storage array could be deployed as the primary layer in an overall storage hierarchy, consisting of HSM software and secondary storage for long-term retention as the data access frequency drops off to near zero.

Understanding your
information requirements
Choosing the right fixed-content solution is the key to realizing the benefits of faster access to information, lower storage costs and simplified storage management. Consider the following questions before making a storage decision:
What applications are generating fixed-content or reference information?
Does this information need to be shared among different users?
What's an acceptable access time for this information?
Is this information currently on disk, optical or tape?
What's the rate of information growth across the enterprise?
How frequently is this information accessed over its lifespan?
How long must the data be retained and accessible?
Can the information be leveraged or repurposed for driving new business or revenue?
Are there any special regulatory requirements that apply to this information?
How many copies of the information are required?
Is this reference information mission critical?
Does the information require remote replication or backup to tape?

An example of this type of solution is the StorageTek BladeStore and ASM system. BladeStore is a storage area network (SAN)-attached storage array powered by an LSI Logic RAID controller with a StorageTek-developed ATA disk array. Each BladeStore array contains up to 10 800GB storage blades, for a capacity of up to 8TB per array. A single BladeStore system can scale from a minimum of 4TB up to 160TB of capacity.

ASM is storage management software with a high-performance file system that runs on a Solaris or Windows server, and may be shared across an IP network via NFS, common Internet file system (CIFS) or SAMBA. The ASM software integrates with fixed-content applications such as e-mail archive, document management, video surveillance and medical imaging. ASM is certified with applications from AGFA, Siemens, Kodak, Philips, and others. ASM can also replicate the information across multiple arrays, perform an automated backup to high capacity tape or optionally migrate the data to secondary storage. Don Baune, vertical markets manager for StorageTek, says, "ASM is not tied to any specific storage technology. You can choose the technology based on the service levels required for your data."

BladeStore disk is priced between approximately 1.5 cents/MB and 2 cents/MB, while a complete storage solution consisting of a server, ASM, BladeStore and an optional tape library for disaster recovery is fewer than 3 cents/MB, depending on configuration and capacity.

Network Appliance's NearStore system is another example of a lower cost ATA disk-based solution. NearStore is based upon Network Appliance's filer technology, Data ONTAP operating system, and WAFL (Write Anywhere File Layout) file system. The NearStore R150 is available in two system module capacities, 12 TB or 24 TB. Multiple NearStore modules may be configured and managed via Network Appliance's storage management software. Fixed content applications read and write information directly to the NearStore, which appears as a very large file system shared via NFS or CIFS. No application changes are required. "File based access is very flexible. Over 95% of applications know how to do it", says John Kim, Marketing Manager, Rich Content Storage, for Network Appliance. For fixed content data with regulatory requirements that specify that information cannot be changed or deleted (such as SEC 17a-4), Network Appliance offers an optional WORM (Write Once Read Many) function for NearStore called SnapLock. With SnapLock, either a portion or all of the NearStore capacity may be configured as WORM storage. Files written to a SnapLock volume can be copied, but not altered, moved, or deleted. Stored information may also be replicated to another NearStore at a remote location for disaster recovery purposes. NearStore interfaces with fixed content software applications from third party vendors including AGFA, Documentum, FileNet, IXOS, and KVS.

"File-based access is very flexible. Over 95% of applications know how to do it," says John Kim, marketing manager, rich content storage, for Network Appliance. For fixed-content data that can't be changed or deleted, NearStore provides an optional snapshot point-in-time copy capability. Stored information may also be replicated to another NearStore at a remote location for disaster recovery purposes. NearStore interfaces with fixed-content software applications from third-party vendors including AGFA, Documentum, FileNet, IXOS, and KVS.

Priced at approximately 1.2 cents/MB to 1.6 cents/MB, NearStore is attractive to budget-constrained users interested in consolidating their fixed-content information onto a low-cost storage repository. At 12TB, the system scalability isn't as granular as other solutions, so Network Appliance gives users option to partition the back-end ATA-based disk, and share the storage through Fibre Channel (FC) SAN connectivity. Because most of the leading backup software vendors including Computer Associates, IBM, Legato, and Veritas support NearStore, backup to tape is also an option.

Object-oriented approach
Object-oriented (OO) storage technologies offer a new and different approach to meeting the growing storage demands of fixed-content applications. With OO storage, applications interface with the storage system's API over an IP network to store information as objects, rather than as files or blocks as in traditional network-attached storage (NAS) or SAN storage architectures. While traditional file systems provide the host application with a location-based directory of where a file is stored, they typically don't provide a mechanism for capturing additional attributes, or metadata about the file. An OO storage system assigns a unique identifier or fingerprint to the stored object for application access and retrieval. Like a fingerprint, the identifier is permanently associated with the object, even if the underlying storage technology changes. Additional metadata about the object such as retention period or expiration date may be stored with the object as well.

OO storage systems essentially present the image of a large storage pool to the application. Because the application only needs to know the storage system, IP address and object identifier to access the information, it doesn't need to be aware of a file system layout or the physical storage configuration. The OO storage system stores and manages object information transparently to the fixed-content application, simplifying storage management and administration. If additional storage capacity is required, it may be added to the storage pool in a non-disruptive manner.

EMC's Centera is an example of an OO storage system specifically designed for fixed-content applications. Centera is a network-attached device, but it isn't NAS. EMC refers to Centera as content-addressed storage or CAS (see "Pros and cons of content addressed storage"). Centera's hardware architecture is based upon a redundant array of independent nodes (RAIN) architecture consisting of storage and access nodes. Each node is comprised of a 1GHz Pentium III processor, four 250GB ATA disks and three 10/100 BaseT network connections. The access nodes provide an interface to the client applications, while the storage nodes store application information in object form. Additionally, the nodes may be deployed in a clustered configuration for availability and performance. Centera's entry point is an 8-node configuration configured for either 2.9TB of usable capacity (mirrored protection) or 4.3TB of parity protected capacity. Each Centera cabinet can contain up to 32 nodes, and 16 cabinets can be configured as a single cluster. Centera can also be managed as a domain, which scales up to seven clusters, holding more than a petabyte of storage.

Unlike file-based storage solutions, fixed-content applications must support the Centera API for object storage and retrieval. An example of an application that's fully integrated with the Centera API is Xact Enterprise Content Integration Software, from Systemware in Dallas, TX. According to Systemware, Xact provides users with the ability to automatically set and enforce data retention policies, enable Web-based access to information, and repurpose existing content to drive new business opportunities. With Xact, users with fixed-content information on Unix, Windows, Linux and even mainframe platforms can centralize it all on Centera. According to EMC, there are over 50 Centera-integrated applications now available.

"The value of the solution is in the software," says Roy Sanford, Vice President Marketing & Alliance Development for EMC's Centera Division. Sanford is referring to CentraStar, the CAS software that powers Centera, which is responsible for storing, retrieving, verifying, and replicating objects. According to Sanford, Centera was built for fixed-content applications with regulatory requirements such as data immutability (proof that the data hasn't changed), and long-term retention periods. Objects are stored in a WORM format, meaning they cannot be updated or changed in place. If an object is read and changed in any way, a new Content Address will be generated. Centera will automatically prevent identical objects from being stored twice to minimize wasted space. For example, Centera will store only one copy of identical email attachments and return multiple Content Addresses or pointers to the stored object back to the email archive application.

With the Compliance Edition option, Centera prevents deletion of the object until after the retention period has expired. EMC is targeting this capability to stock brokers that must conform to Security Exchange Commission (SEC) regulatory requirements for records which must be unaltered, and kept in a non-erasable and non-rewritable format. Centera with Compliance Edition Plus stores these records as objects that can't be deleted or erased. "In this case, the only way to dispose of the record is to take the disk drive out and destroy it," says Sanford.

Centera is a good fit for companies interested in implementing a scalable disk-based solution to reduce data access time, while ensuring regulatory compliance for fixed-content information. Centera's content addressing scheme and ability to generate a unique ID for each stored object provides a secure safeguard to ensure the information can't be modified or erased. However, be sure that your fixed-content application vendor has certified their software with the Centera API. Be prepared to pay extra for Centera's enhanced functionality. Centera is priced at approximately 3 cents/MB to 4 cents/MB, depending on capacity and data protection method.

Store objects on commodity hardware
Cluster File Systems, Inc., Mountain View, CA, is developing another new OO storage technology called Lustre (the name is a combination of Linux and Cluster) due out later this year. Lustre is an open-source high-performance cluster file system designed to store objects on commodity hardware.

Lustre handles files as objects, and separates its cluster file system operations from the actual storage of the object. The Lustre architecture is modular and consists of clients that interface over a network with both metadata servers (MDS) and object storage targets (OST). Metadata servers are responsible for managing all Lustre file system operations such as file creation and lookup and file metadata. When a new file is created, the MDS contacts an OST to create an object. The OSTs communicate with all clients and interface with the underlying physical storage devices to store information in object form.

According to Cluster File Systems, this design improves performance, data availability and system scalability. Additional storage devices may be added to the pool of available storage that an OST can access, and Lustre can accommodate new storage technologies as they become available. Lustre concurrently supports 10,000 clients and 1,000 of storage nodes. The initial implementations will scale to 100TB of storage capacity with data transfer rates of hundreds of gigabytes per second.



Choosing a solution
With various fixed-content storage systems now available--and more to follow in the near future--storage managers face the challenge of choosing the right solution for their business requirements. Under standing how your information is used, accessed, shared and retained is the key to making the right decision. Typically, as with most IT decisions, users must not only consider how well the features and functions of a given solution address their individual needs, they must also consider system scalability, performance, ease of management and of course, price. Although the commercially available fixed-content storage solutions from EMC, Network Appliance, StorageTek and others are based on ATA disk arrays, they differ in approach and implementation.

Here's the Cliff Notes: IT managers willing to pay more for a content specific disk solution that offers flexible scalability and assurance of data immutability (non-erasable, non-rewriteable WORM) should consider EMC's Centera. Those interested in consolidating their fixed-content information on a low-cost storage repository should consider a Network Appliance NearStore system or StorageTek's BladeStore disk array. For policy-based management of information throughout its lifecycle--including integrated backup and long-term archive to secondary storage-- StorageTek's ASM/BladeStore solution. Alternatively, IT professionals should consider new storage technologies such as Lustre for building a flexible and scalable fixed-content storage repository, using low-cost commodity hardware, customized to meet their business requirements.

Robert Terdeman, CTO and senior VP of Rogers Medical Intelligence Solutions, New York, NY, implemented a fixed-content storage repository earlier this year. Rogers Medical publishes hundreds of custom reports per year that provide doctors with independent data and analysis on the latest clinical findings. "The savings in terms of business process time and energy by having a single consolidated content repository is huge," explains Terdeman. Terdeman's advice to others looking to implement a fixed-content solution is twofold: First, understand your data access characteristics and performance requirements. And secondly, ensure that all of the vendors involved completely understand your business requirements. Their solutions must be flexible and applications need to support a customized, integrated solution.

Dig Deeper on SAN technology and arrays