Many vendors claim that their microcode updating process is "non-disruptive," but, all too often, these upgrades...
are far from seamless.
ALL VENDORS TOUT "NON-DISRUPTIVE upgrades," adding new code to hardware or software without interrupting the applications running on the platform or the applications' access to data. But when your vendor says the upgrade to your storage system's microcode is non-disruptive, beware. All too often, installing upgrades is an arduous process. That's why storage managers routinely perform microcode upgrades on weekends or late at night--so they have time to deal with glitches that could cripple production systems.
"As long as I don't expect to apply upgrades in a non-disruptive fashion, I have yet to be disappointed," says Karl Lewis, storage administrator at the University of Michigan's College of Engineering in Ann Arbor, which is an EMC Corp. Clariion CX700 user. "Our standing plan is to no longer use the non-disruptive upgrade path, and to simply schedule a few hours of downtime for it."
Operating system upgrades
Microcode is the firmware--or embedded software--that drives the hardware. It resides in special memory on arrays and switches, and functions as an operating system for the managed hardware. Upgrading this software is like upgrading a server operating system such as Windows. Simple patches to the operating system are easy to deploy and usually non-disruptive, but major revisions are complex and require careful consideration. "With all the appropriate preparation," says Arun Taneja, founder and consulting analyst at Hopkinton, MA-based Taneja Group, "there's still a moment when you close your eyes and pray."
|Microcode upgrade dos and don'ts|
A product's microcode is critical to performance as it contains all the rules and definitions of how the hardware should work. Among the most common reasons for a code upgrade is to update a piece of hardware to work with a new standard or to deliver a "next-generation" feature.
A classic example of an OS on a never-ending upgrade cycle is EMC's Clariion midrange array, which is well over a decade old. It still runs Flare, the original operating OS. Steve Duplessie, founder and senior analyst at the Milford, MA-based Enterprise Strategy Group (ESG), estimates that there have been "between 30 to 50 revs" of Flare, not including patches and minor releases.
Upgrading versions of Flare released after 2001 but prior to Flare 14, which shipped in July 2004, is a non-disruptive process as long as the user applies each release without skipping any. Since Flare 14, however, users have been able to skip from 14 to 16, or 14 to 19 without loading interim releases. (There's no version 15 or 18, and version 17 specifically enables iSCSI support.)
"We realized that not all customers can run off and do an upgrade, [that] there are times when they fall behind [or] they want to skip releases and still have a non-disruptive upgrade," says Barry Ader, senior director of Clariion marketing at EMC.
EMC added code to the software that would help change how the data was laid out, a process that previously took several ordered steps and resulted in the enforced sequential upgrade. The ability to let users skip releases means more work for EMC in the testing process because the company has to support a larger number of potential upgrade paths. "The simple way was to make users go through the logical steps A to B to C," says Ader. "The more options, the larger the investment on our part."
However, EMC realizes that many of its customers--and there are tens of thousands of sites running multiple Clariions--don't keep up to date with releases. "Not all those customers have upgraded to the current code ... if [an older version is] stable, they may not want to go through an upgrade," says Tom Joyce, vice president and general manager of IP storage at EMC. He believes that because many users have been burned in the past with disruptive upgrades, some EMC customers still have upgrade policies that don't reflect the improved state of today's upgrade technology.
In January 2005, Steve Schaub discovered a nasty bug in some RAID 5 software on his previous employer's EMC Clariion CX500. To reduce the amount of tape in its Tivoli Storage Manager backup environment, the company installed a CX500 for disk-based backup. When they kicked off the migration from tape to disk, the CX500 problems began. "EMC [told] us a disk in our RAID 5 set had gone bad; then, during the rebuild, a second disk failed," says Schaub. The firm had to back up Oracle database logs to the backup server and the server was down all weekend. "It was really close to a crisis situation," adds Schaub, who's now a systems engineer for backup and recovery at BlueCross BlueShield of Tennessee.
IBM's DS8000 bug
One notable microcode upgrade story involved problems IBM Corp. users encountered with early models of the DS8000 storage array in 2005. SearchStorage.com reported last August that IBM's newly shipping flagship array, which can cost more than $1 million, contained a bug that automatically shut down the system without warning every 49 days. IBM quickly released microcode Revision 22.214.171.1248 to fix the problem. However, users in Europe and North America reported major outages due to the fix being a disruptive upgrade.
A large North American clothing retailer, which prefers to remain anonymous to avoid any conflict, received the bug fix from IBM before it was affected by the problem, but for several months was unable to block off a large enough window of time to apply the patch. The company relied on outsmarting the bug by rebooting the system every six weeks before the counter timed out. With only a few hosts connected to the DS8000, rebooting the array took less than an hour. "It's been manageable, but certainly not seamless," says the user. "IBM's testing was crazy. They know that every customer on the planet expects to keep these things up forever" without requiring a reboot.
The firm eventually found a long enough window to apply the fix. IBM estimated it would take between seven and 12 hours to patch the code, but it took only approximately three hours. IBM provided the higher estimate as a cushion in case something went wrong, says the user. Since applying the upgrade, "everything is good, knock on wood," he says.
On the other hand, the initial bug was enough to scare a U.S. utility company off the system altogether. "We decided not to go with the DS8000 because of the code problems; we didn't feel safe," says Chris Andriano, storage administrator at the company. It was planning to migrate from two IBM 2105 "Shark" arrays to the DS8000, but went with an EMC Corp. DMX2000 instead. "We would still consider IBM in the future," says Andriano. "In 12 months, it'll be a more stable, more viable product." The company uses third-party storage management software from StoreAge Networking Technologies Inc., so it isn't dependent on a single disk vendor. "We can go from IBM to EMC effortlessly and back again if we choose," says Andriano.
IBM acknowledged that the clock problem occurred in a small number of early systems, but declined to explain why. However, it did e-mail us the following response: "IBM is able to continuously and incrementally improve the field quality of its storage products through regular upgrades to higher levels of microcode. For IBM's external disk systems like the DS8000, DS6000, DS4000 and ESS, this upgrade process is non-disruptive to customers' normal production usage of the product."
EMC engineers eventually determined that the problem was caused by a microcode bug that tricked the RAID 5 software into seeing a "phantom failure on the second disk during the rebuild," says Schaub. To compound the problem, Schaub and his team didn't know if the bug caused the first disk to fail as well. The company lost a 900GB logical unit number (LUN) and Schaub was also using a pre-release version of the latest CX500 firmware. "I'll never do that again," he says.
Another Clariion user, Kelly Carpenter, senior technical manager at the Genome Sequencing Center at the Washington University School of Medicine in St. Louis, is sick of hearing about bug fixes for the Clariion microcode. Carpenter says that going back two years, his configuration included 146GB disks that were supposed to be a hot spare for both 73GB disks and 146GB disks. If a 73GB disk failed, the 73GB hot spare was supposed to take over transparently with no loss of service. The 146GB hot spare was supposed to function the same way for the 146GB disks.
When Carpenter had a second 73GB disk fail, the 146GB hot spare attempted to take over, but it failed and the disk array "panicked and crashed with no data loss, but complete loss of service," he says. The ability to use both disk sizes for hot spares was a standard feature in the Flare code, says Carpenter, but it didn't work on several occasions. EMC and partner Dell Inc. later admitted that it was a bug in the Flare code that would require an outage to fix.
The problem reappeared a year ago, requiring another Flare code upgrade that solved that issue but produced "data access issues" and another upgrade, says Carpenter. "I'm told that the new code will fix everything," he says. "Even if it's true, I have a hard time believing it ... two years' worth of experience says different." He says the reliability problems are due to EMC not testing the code in a wide enough variety of scenarios. Jay Krone, EMC's director of Clariion platform marketing, adamantly disagrees. "We test an awful lot of scenarios," he says. He described current interoperability matrices as being as thick as "the Manhattan Yellow Pages."
Large number of fixes
EMC says major upgrades add support for new hardware or add new features and functionality, while patches deliver functional improvements. On average, vendors will issue two major releases, approximately eight minor releases and up to 20 patches a year, says Ashish Nadkarni, senior consultant at Framingham, MA-based GlassHouse Technologies Inc., adding that "it's often more than this."
Patches can be administered non-disruptively, claims EMC, but when it comes to upgrading code on the Clariion, the firm's Krone notes that because the array is a two-board system, you're required to upgrade one processor at a time. "There is a blip in performance as half the resources are offline," he says. EMC recommends that customers upgrade at a time when the load is lower so that overall performance isn't affected.
To help make upgrades less complicated, EMC ships PowerPath/SE with the Clariion. This supports back-end failover for users who don't have dual host bus adapter (HBA)-equipped servers, but still require failover capabilities for their Clariion. The upgrade process works as follows: Processor A is brought off-line and the LUNs are failed over to Processor B (or failed over using PowerPath/SE); the upgrade is done on Processor A, which is then brought back online. The process is repeated for Processor B. PowerPath/SE provides the hosts with continuous access to the LUNs.
Tier-1 upgrades more stable
The upgrade process is a little more seamless and stable with Tier-1 storage. For example, each CPU in EMC's Symmetrix DMX (there can be as many as 100) is upgraded one at a time. PowerPath/SE assists with the same failover in the Symmetrix as in the Clariion, except that new versions of the operating environment are loaded CPU by CPU within the timeout limits of the host OS in a matter of seconds.
Joe Meyer is the senior storage architect at Level 3 Communications Inc., a 350TB EMC shop in Broomfield, CO. Two years ago, he notes, the timeout issue wasn't handled quite as smoothly and the EMC Symmetrix 68 code "could be problematic." There were timeouts of 20 to 30 seconds, which were potential risks to his Oracle applications. To avoid the problem, Level 3 "cycled through the arrays, upgrading them one at a time offline," he says.
Since then, says Meyer, EMC has improved the process considerably. The current code is well within the timeout values for a major revision change. "Once a quarter, we patch the Symms and it goes without a hitch," he says. As a rule, Level 3 schedules EMC to perform upgrades in off-business hours. Meyer says his company doesn't implement every patch, preferring to wait for bundled revisions to limit the number of times he has to fiddle with the environment. "We patch only when we're experiencing a serious problem," he says. The latest version of the Symmetrix OS, called Engenuity and known internally as 71 code, has been available for more than a year, but Level 3 won't be installing it until the first quarter of this year. "The value of those new features didn't outweigh the potential risk of bugs in the new code," says Meyer.
3PAR supports non-disruptive code loads
Because of the nature of its operations, Factiva, a Dow Jones & Reuters Company, has little tolerance for downtime. As a result, its Hewlett-Packard (HP) Co. Enterprise Virtual Array (EVA) products had to be augmented with something that would support non-disruptive code loads.
|Understanding upgrades: What to ask vendors|
"Our business is a 24/7 global operation; there's always someone, somewhere, working," says Karin Borchert, chief operations officer at Factiva. The company turned to 3PAR Inc., Fremont, CA, purchasing three of the startup's InServ Storage Servers with high-availability features that include non-disruptive firmware upgrades. "We did move some of our critical data that required high availability from the EVA 5000s to the 3PAR," the company said in a written statement.
3PAR's InServ includes eight clustered controllers that communicate with each other in a mesh-like architecture. As with high-availability server clusters, a portion of the 3PAR cluster can be upgraded while the other portion carries the workload. In addition, the data is striped across several hundred drives and multiple CPUs can be working on a single volume. This differs from the EVA and other traditional arrays that lay out data in splotches that are heavily dependent on a particular portion of the array.
"Simply put, we have different storage requirements for different applications, and the HP EVA 5000s satisfies some, while 3PAR satisfies others," says Diane Thieke, Factiva's director of global public relations.
HP offers a high-end storage array touted to support non-disruptive code upgrades. Dubbed the XP12000, it's a re-badge of Hitachi Data Systems' (HDS) TagmaStore Universal Storage Platform. Jacob Roersma, storage administrator at Priority Health in Grand Rapids, MI, is an XP12000 user and confirms that upgrades are non-disruptive.
There's "a rigorous process to make sure we stay in support with HBA drivers, tape libraries, etc.," says Roersma. "They [HP] run our environment through a matrix and then recommend what revisions we should be on." Despite the smooth upgrades, Priority Health maintains a dual-attached architecture that fails over from one controller to the other when upgrading.
Interoperability and testing hurdles
Despite vendor claims of non-disruptive upgrades, it's painfully clear from these accounts that many users suffer operational disruptions when loading new microcode. ESG's Duplessie lays the blame squarely on vendors not performing enough regression testing of production environments. "When it works in the lab, it doesn't always work in the real world with all the interdependencies that exist," says Duplessie. "Users should never assume that non- disruptive means non-disruptive to them, only to the vendors."
To be fair, vendors would never introduce a new product if they tried to test it against every possible scenario. "The reality is that with thousands of customers, all with unique environments, we test as much as we can but we can't get everything," says Chris Bennett, senior director of product management at Network Appliance Inc.
And some vendors readily acknowledge that early product adopters are guinea pigs. In fact, HDS customers are contractually precluded from running beta code in their production environment. "Most customers hold to that," says Claus Mikkelsen, chief scientist at HDS. "They get special attention, insight and education from the vendor, and they get to deploy it a lot sooner," which is why many take the risk. Upgrading directors According to Duplessie, core switches are updated more frequently than anything else these days as more intelligence moves into the fabric. Cisco Systems Inc.'s MDS directors support what in networking is called hot-code activation. A close read of Cisco's configuration guides reveal that its directors have hot-code activation for Fibre Channel ports, but not for IP ports. IP Storage Services modules use a rolling-upgrade install mechanism where each module in a given switch can only be upgraded in sequence. To guarantee a stable state, each IP Storage Services module in a switch requires a five-minute delay before the next IP Storage Services module is upgraded.
Similarly, Mario Blandini, product marketing manager at Brocade Communications Systems Inc., acknowledges that during a non-disruptive firmware upgrade there's a short pause in management operations (e.g., new devices being authenticated) while new software is activated. "Switching traffic continues to flow throughout the activation and management operation, and processing resumes once the activation is completed," says Blandini. He likens non-disruptive code upgrades to "keeping people awake during brain surgery ... it's possible today," but still risky.
Still, the impact of taking hundreds of ports offline for an upgrade is unacceptable to many organizations. And the consensus among analysts is that as the world continues to move toward 24/7 operations, the need for truly non-disruptive upgrades will be mandatory.