"Once the decision has been made to migrate from direct-attached storage to Fibre Channel [FC] fabrics, it's easy...
for data protection concerns to get lost in the details."
That's the advice of a senior SAN engineer and a LAN backup specialist who work at a federal agency who asked (for reasons of agency policy) that their names be withheld.
Attending a recent Network Storage University event at the University of Maryland, the two men detailed their agency's migration from a Novell network to a Fibre Channel fabric interconnecting Windows 2000 servers with approximately 15 TB of HP/Compaq storage.
Why did they move to a new platform? "We were encountering scaling limits with our direct-attached storage. Storage was out of control and management was becoming ridiculous. Whenever new capacity was needed, we would just deploy another server, controller and shelf of drives. They were spread out all over several buildings on different floors, and there was no rhyme or reason guiding the use of LUNs for volume scaling."
Designing a SAN
The engineer reports that the original design of the SAN (FC fabric) took into account data protection requirements. For example, the decision was made early on to leverage the SAN for volume replication to a mirror-image data center approximately 100 miles away from the agency's headquarters. And, he added, they paid a lot of attention to creating secure domains to separate data by servers and giving authorized users access to select data.
But, he said, "It is easy to lose sight of your original strategy when you get overwhelmed by all the work going on underneath the fabric. Rolling out the SAN can reveal a lot of issues that have to do with interoperability and interdependency that don't necessarily show up in the plan. As you deal with them, it's possible to completely forget about backup, disaster recovery and encryption."
During the SAN implementation, device limitations created setbacks for the agency. "We found out at the last minute that we did not have enough controller connections to support the Disaster Recovery Manager [DRM] that HP had convinced us to buy," the engineer said. "To provide resiliency for our Windows environment, we were deploying our platform in a cluster with multi-bus failover mode. Each of our controllers provided 12 communications ports ('coms'), of which eight were used to cluster the storage, one was used for the SAN Appliance used to manage the configuration, and three were unused.
DRM required four coms to do long distance data replication between our HQ SAN and our backup facility SAN, but we only had three left. It was a design oversight that we should have caught. What's more, we submitted it to HP engineering for validation and they didn't catch it either. The result was that we had to put DRM implementation into limbo for now, even though it was the centerpiece of our data protection strategy when the SAN was first conceived."
Vulnerabilities in SAN design
Based on this experience and several similar incidents, the engineer said he had reached his first rule of disaster-free SAN migration: "Do not rely on your vendors. They are salespeople in most cases and don't have a clue about technology." At the same time, he urges managers to pay attention to another vulnerability of the design process: deficient testing."
He continued, "Many problems we have encountered have stemmed from a lack of understanding, testing and validation by those responsible for acquiring technology for the SAN. In our shop, I have a good understanding of the interdependencies and 'interreactions' of SAN software and hardware components, mainly because I am working with them every day. Someone from the design group will call me and tell me that they have selected and tested product X to deploy in the SAN, but when I ask them whether they have considered the impact on other storage hardware or software, it is quickly apparent that they haven't."
SAN deployment and the domino effect
His second rule derives from this experience: "A lot of deployment activity tends to happen on the fly. You need to have someone who knows the SAN from the ground level advise you on product selection to avoid problems."
He also recommends that due diligence of "two to three hours" be done before deploying anything new or performing even seemingly minor tasks. He notes that Microsoft "crushed our SAN" recently, owing to a conflict between Service Pack 3 and the vendor's innocuous "checkdisk" utility. A well-intentioned storage administrator started checkdisk to search for disk errors and inadvertently changed security permissions for all servers and users of the SAN.
The engineer said, "It turned out to be a 'known issue' when I checked Microsoft's board. You needed for all Windows 2000 servers to be updated with the Service Pack 4 patch and you needed to quiesce all data in controller caches before running the checkdisk program. If you didn't do this, Windows 2000 saw a gaping hole in security and created a whole new profile for everyone. That is exactly what happened, and we are still digging ourselves out from underneath it."
As a third general rule, he said, "You need to know the impact of virtually anything you plan to do in a SAN. That requires a search on your operating system vendor's home page, your equipment vendor's home page, storage software vendor's support page, and numerous forum pages in advance of deploying anything or launching any application or utility. It takes me about two hours of on-line research -- my due diligence -- before I can be reasonably comfortable that whatever step I am about to take won't create a major disaster."
The backup process
His fourth rule is to "do a full volume copy of all data before you add anything to your SAN. I am not a big proponent of tape because of problems I have had with media in the past, but before I install anything new or re-host any data in our SAN, I always try to take a full tape backup, then make an additional disk-based copy or snapshot as well. It's just too easy for data to just disappear in a SAN due to a software or hardware conflict."
The fifth rule for avoiding a data disaster during a SAN migration, said the LAN backup manager, is to be cognizant of how your backup software "sees" data in the SAN, then architect the storage accordingly. "If you don't, we are finding that you end up with gigantic backup volumes that are of such enormous size, they are difficult and inefficient to restore."
The LAN backup manger adds a cautionary note about security issues and data recovery in a SAN: "With Windows 2000, users can see every folder available by its name, regardless of their access permissions." Not only does this potentially disclose privileged information (you may be able to see, for example, that the legal department has a folder named "Bank XYZ" – indicating an investigation of that bank), it may also create havoc for data recovery.
"We had an incident recently where an end user apparently moved his files to a protected volume. I don't know how the user dropped the files in the wrong folder. He probably had rights to both folders. But, other users of the files were in different groups and not all have rights to all folders. The user went to retrieve the files from the folder where he thought he had saved them, and called us when he couldn't find them anywhere."
"The files were found by searching the backup histories for file names the owner provided. Once these were located on tape, I was able to enlist the aid of a user with access permissions to locate them on the server. We moved them back to their original location and restored their permissions. Had the files been restored from tape there would have been duplicate copies on disk, with unauthorized users having rights to many files and folders.
"The fact that we violated security to find the files created quite an uproar. In this case, there was over 700 MB of disk space at stake, there was potential confusion from having extra copies of files on disk, and there were security risks due to having the extra files in the wrong folders. So, we broke with normal help desk procedures to get the files back." Said the backup manager, this was just another example of how still-maturing SAN security technology might impact data backup and restore.
Bottom line: Migrating to networked data storage technologies is fraught with problems. Smart managers recognize the risk to data and take measures to insulate it from loss or unauthorized disclosure while the bugs are being worked out. Once the infrastructure has stabilized, a new chapter in data insecurity begins. As a former NSA official, now professor emeritus at a prominent New England university, has said, "The only way to keep data safe in a network is not to put it there."
About the Author: Jon William Toigo has authored hundreds of articles on storage and technology along with his monthly SearchStorage.com Toigo's Take on Storage column and backup/recovery feature. He is a frequent site contributor on the subjects of storage management, disaster recovery and enterprise storage.