Creating a reference data storage policy

Reference data growth is exploding, but many shops still use high-priced transaction-class storage for it. A smart reference data policy can help you squeeze more out of your storage dollars.

In the current economic climate, CIOs live with constant budget cuts and staff reductions. The last thing they want is to plan for radically new storage requirements. There's a change coming, however, that demands IT executives' attention today, so companies can avoid being buried in a sea of data. I'm talking about the imminent deluge of reference data.

What's reference data? The Milford, MA-based Enterprise Storage Group (ESG) defines reference data as: "digital assets retained for active reference and value. It includes, but is not limited to: electronic documents such as contracts, e-mail and e-mail attachments, presentations, CAD/CAM designs, source code and Web content; certain digitized information such as check images, blueprints, historical documents, medical images, geophysical, satellite, and surveillance information, computer-generated images (CGI), genomics, bioinformatics, video, photographs and voice data." (For more information, see "Reference Information: The Next Wave")

Reference data is different than traditional data in several ways. For one, it's growing faster. According to ESG, reference information is growing at 92% CAGR through 2005, compared to 61% for traditional data. Part of this explosive growth is related to the fact that reference data is composed of large files that can be several megabytes apiece. Reference data also has different usage patterns than other corporate information.

Transactional data, databases and PowerPoint files are often accessed on a daily basis and maintained only as long as necessary. Reference data is usually accessed infrequently, but maintained for years or decades. Finally, reference data can be industry-specific and as such may have a regulatory element to it. Think financial services (customer information and the Graham-Leach-Bliley Act) or healthcare (patient information and HIPAA).

Companies trying to manage their reference data within a traditional data infrastructure could run into numerous issues. Storing reference data on high-priced enterprise storage equipment creates an unnecessary expense, but opting for tape may not satisfy business or performance needs. In addition, reference data growth will place burgeoning demands on storage operations teams tasked with configuring systems, responding to outages and backing up and restoring data.

To properly address business, financial and IT requirements, companies must develop a reference data strategy. Here are three steps for developing a successful reference data strategy.

Step 1: Assess the business need
Any reference data project should start with a full understanding of current and future business requirements. CIOs should assign this task to a business-savvy project manager who can match business and IT strategy and work with the storage team.

A good place to start is by sorting existing data into two buckets, traditional and reference. You'll probably find that a lot of the storage capacity contains information that can be classified as reference data, but is managed like traditional data. You can probably improve this situation with more cost-effective storage, archival tools and appropriate backup management. That exercise alone is worthwhile, as it will expose inefficiencies and lead to operational improvements.

You should also explore future business initiatives. Are there plans to digitize data or are there impending regulatory changes that will mandate these types of activities? Are there new business opportunities that will require this type of information? You'll need to know who will access the data, how often, where they will be located and what type of performance is expected. These are difficult questions, but a thorough exploratory process will yield an appropriate, affordable business solution.

Step 2: Assess current and future needs
Once business requirements are defined, examine how things are done within IT and what changes are necessary. The data classification exercise that took place during Step 1 should uncover some inefficiencies. Measure how large these problems are and start defining a reference data solution. Removing reference data may have unforeseen benefits. Enterprise storage capacity may be freed up, allowing you to defer new equipment purchases, inspiring storage consolidation.

The definition of current and future reference data needs should also drive new IT analysis activities. Project managers should forecast reference data capacity needs for the next three to five years. Many companies will discover that reference data could scale into multiterabytes, or even as high as in the petabyte range. These estimates may be simple back-of-the-envelope calculations right now, but they will help IT assess budget requirements, staff planning and physical space needs. Don't forget to include the time and cost necessary for training or new operational procedures.

With these estimates in hand, CIOs should have storage specialists explore reference data solutions and application managers discuss reference data strategies with industry application vendors. Reference data storage solutions are fairly immature right now, with EMC Centera platform grabbing most headlines. But there are products available from established companies such as Network Appliance and StorageTek and smaller players such as Avamar, Isilon, Storigen, and Xiotech.

While storage solutions are important, recognize that reference data technology will really be driven by applications such as CAD/CAM for manufacturing, picture archival and communications systems (PACS) for healthcare and document imaging in insurance. Check with application vendors to see what types of solutions they'll offer. Be sure to understand how they'll handle issues such as scale, availability, systems management and storage vendor partnerships. Given the growth of reference data, ignoring these details could turn into IT headaches sooner rather than later.

Step 3: Design a solution
Most companies won't have an immediate need for a complete reference data infrastructure, but beginning the process now will help IT with project timing and budget containment. To keep costs under control from the start, firms should design a small, cheap infrastructure that can grow quickly and incrementally. Rack-based ATA storage is probably the best fit. As reference data grows, you'll need solutions that automate administrative tasks and provide robust management information. Designing these things into a reference data infrastructure upfront will pay long-term dividends.

Companies that plan on sharing reference data will need a more sophisticated infrastructure. That should include server and network load balancing, adequate bandwidth for peak consumption and network security. Obviously, this creates a bigger project that demands participation from networking, telecommunications and security groups within IT. These teams should be included in the reference data infrastructure project.

Final thoughts
It's easy to continue to maintain a procedure that treats all information the same. Savvy CIOs will anticipate the reference data need by beginning a reference data project now. A reference data policy helps streamline operations and lower hardware cost. A smart reference data strategy that includes people, processes and technology could help the company meet its business objectives without breaking the IT bank in the process.

Dig Deeper on Data center storage