To get some perspective on data growth, let's be a bit contrarian. We're all storing, replicating and backing up...
more information than ever before, and the volume of data we must deal with increases every year. But is it really so unmanageable?
The need for more capacity isn't a new problem. We've been through this with CPU performance and networking bandwidth, and technology has always saved the day. Remember Moore's Law (performance doubles every 18 months)? It continues to hold true with regard to CPU performance. In the area of networking, performance has increased from 10Mb/sec to 100Mb/sec to 1,000Mb/sec and now to 10,000Mb/sec. If someone told us in 1990 how much data would be transmitted over LANs on a daily basis in 2005, we'd probably have questioned their sanity. Is the problem in storage so different?
While data is growing enormously, our ability to store data is also expanding. Disk capacity continues to increase dramatically. Today, we're storing terabytes of data and anticipate--with some degree of trepidation--the prospect of storing petabytes. What's the big deal? It would have been impossible in 1995 when we would have needed 250,000 4GB disk drives to hold a petabyte of data (without considering mirrors, RAID or replication). But today, with 300GB disks soon to be available, we're talking about a mere 3,300 or so drives.
It's startling to consider that an EMC Symmetrix 3100 circa 1997 had a capacity of 139GB. It would have required more than 7,500 Symms of that era to store a petabyte--unmirrored--and managing it would have been a daunting challenge. A single frame of a new, high-end Hitachi TagmaStore array can hold one-third of a petabyte (raw), so a petabyte of storage today might mean just a few frames. That's certainly manageable.
An intriguing new book due out this spring may hold the answers to these questions. Pervasive Data: Harnessing the Power of Convergence, by Chris Stakutis and John Webster, offers a glimpse into the near future and suggests that the widespread adoption and convergence of some fundamental technologies, including radio frequency identification (RFID), video, electronic messaging, various wireless communications technologies, XML and a few others, will produce an exponential data explosion like never before. The authors say the availability of all this data in a form that can be easily interpreted and transmitted will feed upon itself: As more data exists, people will devise new ways to exploit it and increase its value.
And don't think this won't apply to your industry. Stakutis and Webster present a very compelling case across a broad range of industries, including retail, manufacturing, health care, government, agriculture, entertainment and others. Here are a few examples:
The retail supply chain spends enormous amounts of money on manual processes, such as tracking goods as they move from supplier to warehouse to truck to store shelf to your shopping cart. RFID devices, in conjunction with wireless and GPS technology, can streamline and automate this process. Eliminating manual processes, controlling shrinkage and improving just-in-time supply could save companies billions of dollars. Not surprisingly, Wal-Mart Stores Inc. is mandating that its suppliers adopt RFID. What will be the impact on the infrastructure in terms of data? The estimate for Wal-Mart's in-store usage alone is 7TB per day.
The medical field is moving to electronic patient records, but this endeavor is still in its infancy. In the future, a person's entire medical history, including X-rays and even DNA sequences, could be stored in a single, easily accessible place and format. Assuming network access to this information, doctors anywhere in the world will be able to more effectively treat patients. The data impact today is significant. For example, Massachusetts General Hospital in Boston saves every radiological image indefinitely. The hospital has passed its 100-millionth image, and data is currently growing at the rate of one terabyte every two weeks. This information is required for regulatory mandates such as HIPAA, but it also has a high collective value for research using computer-based analysis of petabytes of existing images.
Fighting the war on terror also creates new sources of data. The Department of Homeland Security is installing sensors with wireless networking capabilities in thousands of locations to aid in threat detection, and to prevent biological or chemical attacks. These sensors generate continuous streams of data that must be coordinated and analyzed to identify complex events.
In a related area, the Bioterrorism Act of 2002 requires traceability of food and other similar products through the manufacturing supply stream. By 2005, companies with 500 or more employees must provide detailed information about product flow within four hours of a request. Real-time data collection and storage is the key to accomplishing this.
Data types add to growth
Data formats will also contribute to data growth. Among the key enablers for the pervasive data revolution are self-descriptive data formats, XML in particular. XML uses data descriptors so data isn't dependent upon a specific application being available to interpret it. This allows data to be leveraged for a variety of purposes and across a spectrum of applications. But data expressed in a descriptive format creates larger files than traditional binary data files: for example, storing the value "5" requires less space than storing "
New uses for video data will also impact storage. Computer-based video analysis in manufacturing, retailing and security will generate enormous quantities of video files that will need to be stored and retained. With wireless cameras selling for less than $250 and the availability of smart software than can identify "unusual" events in video frames, usage will explode.
Stakutis and Webster also describe the impact of pervasive data on storage and data management, but I won't steal their thunder. Suffice it to say that there will be changes: Storage densities will increase dramatically, backup as we know it will cease to exist and the network will play a greater role in data management than ever before. And be prepared to really begin thinking about data classification--it will be needed more than ever.
Storage professionals are taking the first steps toward managing this onslaught. By offering tiered levels of storage, and leveraging technologies like virtualization and interconnecting data across the enterprise, we're positioning ourselves to address these future data needs. Data migration capabilities that are automated and transparent will also be critical. Maybe information lifecycle management (ILM) has a future after all (see Inside ILM).
Data never dies
At issue is the question of whether all data can live on disk and be managed efficiently. One fact of life that can be inferred from the pervasive data concept is that, under most circumstances, you should assume data will never be destroyed. The challenge is figuring out what to do with all this "old" data--determining who owns it and devising ways of accessing it, while maintaining security and privacy in accordance with regulatory statutes.
There are some technologies that can help. Management capabilities provided by appliances, software that performs volume management, and global namespace and other forms of aggregation will play a major role.
Much has been written to debunk the storage management organizational metric of terabytes per administrator (see Building your storage management group). A few years ago, the metric was stated in gigabytes and someday will likely be stated in petabytes and, eventually, exabytes or zettabytes. (OK, I may be going a bit too far). Some of this will occur simply by riding the capacity curve, as discussed earlier. Much will come from other technological advances, but a great deal will depend on putting into practice the organizational and procedural best practices that have evolved over the past few years. A good New Year's resolution would be to stop worrying about managing data growth and (to quote Nike) just do it!