In an effort to expand the capacity of a file server supporting hundreds of users, healthcare IT manager Neil Smith plugged an external 14 drive enclosure into the outside port of his RAID controller (which had a SCSI channel already being used internally as part of a drive array spanning multiple channels). Unfortunately, when he added the enclosure, the original RAID configuration was lost -– more than 400,000 files (about 250 GB) of data lost. The IT team attempted to rebuild, but when it didn't complete after running through the night, they discovered it was overwriting the original array with a new one that included all the drives in the server as well as in the enclosure.
Luckily, a data recovery company was able to connect remotely and restore more than 99% of the 400,000+ files from the reconfigured and overwritten RAID set, but this illustrates a central paradox of data storage: as the complexity and sophistication of storage increases, so too does the rate of hardware, software and operator failures. In fact, according to Enterprise Strategy Group, even with all the advancements in storage technology, only about 20% of backup jobs are successful.
Each year, hundreds of new data storage products and technologies meant to make the job faster and easier are introduced, but with so many categories and options to consider, the complexity of storage instead causes confusion –- which ultimately leads to lost time and the loss of the data that these enhancements were designed to protect. Hence the question for most IT professionals who have invested hundreds of thousands of dollars in state-of-the-art storage technology remains, "How can data loss still happen, and what am I supposed to do about it?"
Why backups still fail
In a perfect world, a company would build their storage infrastructure from scratch using new storage solutions and standardize vendors and options. If everything remained unchanged, some incredibly powerful, rock-solid results could be achieved.
However, in the real world storage is messy. Nothing remains constant -– newly created data is added at an unyielding pace while new regulations, such as Sarbanes-Oxley, mandate changes in data retention procedure. Since companies can rarely justify starting over from scratch, most tend to add storage in incremental stages -– introducing new elements from different vendors -– hence the complexity of storage.
All this complexity can lead to a variety of backup failures that can impact companies unprepared to deal with the ramifications of data loss. One reason why backups fail is bad media. If a company has their backup tapes sitting on a shelf for years, the tapes could become damaged and unreadable. This often happens when backup tapes are not stored properly. Another reason why backups fail has to do with companies losing track of the software with which the backups were created. For a restore to be successful, most software packages require that the exact original environment is available. Finally, backups fail due to corruption in the backup process. Many times, companies will change their data footprint but not change their backup procedure to keep up –- so they are not backing up what they think they are. Without regular testing, companies are susceptible to all of the sources of failure.
What to do when your backup fails
No matter how much a company tries to speed operations and guard against problems with new products and technology, the threat of data loss remains, and backup and storage techniques do not always provide the necessary recovery. When an hour of down time can result in millions of dollars lost, including data recovery in your overall disaster plan is critical, and may be the only way to restore business continuity quickly and efficiently.
When a data loss situation occurs, time is the most critical component. Decisions about the most prudent course of action must be made quickly, which is why administrators must understand when to repair, restore and recover data.
When to repair
This is as simple as running file repair tools (such as FSCK or CHKDSK- file repair tools attempt to repair broken links in the file system through very specific knowledge of how that file system is supposed to look) in read-only mode first, since running the actual repair on a system with many errors could overwrite data and make the problem worse. Depending on the results of the read-only diagnosis, the administrator can make an informed decision to repair or recover. If the file repair tool finds a limited amount of errors, it is likely that running the repair will yield good results.
Note: If your hard drive makes strange noises at any point, immediately skip to the recovery option.
When to restore
The first question an admin should ask is how fresh their last backup is and if a restore will get them to the point where they can effectively continue with normal operations. There is a significant difference between data from the last backup and data from the point of failure, so it is important to make that distinction right away. Only a recovery can help if critical data has never been backed up. Another important question is how long it will take to complete the restore –- if it will take too long they might need to look at other options. A final consideration is how much data are they trying to restore. Restoring several terabytes of data, for example, will not be practical due to the length of time associated with tape backups.
When to recover
The decision to recover comes down to whether or not a company's data loss situation is critical and how much downtime they can afford. If they don't have enough time to schedule the restore process, it is probably best to move forward with recovery. Recovery is also the best method if backups are too old or there is some type of corruption. The bottom line is, if other options are exhausted, it is best to contact a recovery company immediately. Some administrators will try multiple restores or repairs before resorting to recovery and actually cause more damage to the data.
Through a series of interrelated system maintenance activities, Wolters Kluwer Corporate Legal Services (formerly CCH Legal Information Services) lost access to data stored on the company's NAS storage array. After opening a normal service call with the manufacturer, they determined that the nature of the loss was much more significant than originally proposed. Due to network and other constraints, they did not have the data backed up. After preliminary discussions, the manufacturer shipped the storage to a data recovery company and after only a couple of days, they recovered 100% of the data.
Despite this company's and your best practices, one thing is clear –- no matter how much time and money a company spends planning, creating and maintaining their storage environment, with the complexity of storage, the threat of data loss remains.
In the end, the only answer to the question "How can data loss still happen, and what am I supposed to do about it?" is to ensure data recovery is included in your plan.
Jim Reinert is senior director of software and services at Ontrack Data Recovery. Reinert handles technology and business development, as well as product-line management of the recovery services and software business lines. He joined Ontrack Data Recovery, a subsidiary of Kroll, in 1987.