Clustering ERP apps

For mission-critical apps, availability is the key. Clustering those applications can ensure they stay up and running, but clustering often conjures up images of complex technologies and an environment that's fragile and complex. Still, for most companies, the benefits of clustering are profound enough to mitigate its risks.

Clustering techniques increase the availability of mission-critical ERP apps, but they're sometimes complex to set up and manage.

Some IT shops cringe at the thought of introducing "clustering" into their data centers. The term conjures up images of complex technologies interwoven to create an environment that's too fragile to be touched and too complicated to be understood. Yet some organizations have found that clustering enables their IT staff to spend nights at home and weekends at the beach. What is it about clustering that makes it a risk that some IT organizations are willing to take?

It's all about availability. Some companies need their enterprise resource planning (ERP) applications to be up and running 24/7. Clustering an IT infrastructure can increase availability and help IT make changes with less risk to application downtime. However, the availability of a business function isn't achieved through the implementation of a single product; rather, increased availability is woven into the fabric of the data center. There are common and not-so-common techniques to increase an application's availability through clustering. But while clustering can increase the availability of an application, it doesn't necessarily increase the availability of a business function.

How long until the ERP app is up?
It's a familiar refrain: The business unit wants to know how long the application will be unavailable if it sustains a failure and IT's standard answer is "It depends." Why? The time required depends on the type of clustering, the operating system, the time for the database to be returned to a stable state, as well as other factors. When an active-passive cluster sustains a failure, there are several steps required to prepare the passive server to activate the application.
  1. The passive server activates access to the storage through the operating system. This action is similar to the action taken during a normal boot, but the only disks activated are those associated with the database application. Journaling techniques reduce the time to activate.
  2. The passive server then starts the database application on its node. Because the disk has now been activated to the operating system, the database application loads in memory and begins to check the status of the database tables on the recently activated disks.
  3. The database application now running on the passive server performs recovery of the tablespace on what it perceives as a power failure.
  4. Once the tablespace is back to usable form, the passive server associates the floating ERP IP with its internal NIC and begins to service user requests.
When an active-active cluster has a database node fail, several of the above steps aren't required. Because the database application is already running in memory on multiple nodes and the database application has access to the disk, users won't perceive any downtime. There's no requirement to restart the app in memory, or to activate the disk to the operating system and then to the database app. While this certainly increases application uptime during a failover, distance failover isn't possible and there are some restrictions on performing upgrades.

Business function vs. ERP app
In the 1990s, I was part of a team that implemented a clustered SAP environment using the latest techniques available at the time. We were confident that if we sustained a server failure, we'd be able to keep the SAP application, along with the central instance, up and functional. Users might see a pause of the application, but it would be up and running in minutes.

Several months into our implementation we sustained a server failure, and our SAP instance moved as planned from the production server to the failover server. Before the IT team could celebrate its success, however, the business unit reported that the company wasn't able to accept Electronic Data Interchange (EDI) orders. The EDI application wasn't communicating with the active SAP application--a significant problem because 85% of the company's orders were received electronically.

Though we had carefully protected the SAP application, the SAP central instance and the underlying Oracle database, we failed to protect the business function of taking and processing an order. Most business functions rely on several applications that must also be protected to increase the availability of what's important to the business.

For an application to provide data to users, three elements must work in harmony: the user request must traverse the network to the correct subnet; an application must be running at the IP address that answers the request; and the application must have access to the underlying data. Clustering software controls these aspects of responding to a user request.

In the case of many ERP applications, "load balancing" is built into the application architecture, which gives the ERP application additional scalability. ERP applications, like Oracle and SAP, are architected to support multiple "application servers." The application servers can respond to a user request, but because there are multiple servers at this layer and they don't have access to the data, these servers aren't single points of failure and therefore aren't clustered.

Each application server must communicate with a database server. The database server and underlying storage are considered single points of failure. Because clustering is all about availability, ERP clustering activities are focused on the database server (see "How long until the ERP app is up?"). The goal is to eliminate the single points of failure of the database server and its underlying storage.

VMware and clustering
A discussion of clustering isn't complete without talking about VMware, which is becoming a standard in both large and small data centers. The latest version, VMware Infrastructure 3 (VI3), has some interesting clustering characteristics. This article discusses the virtualization of an application (via its IP address); VMware's approach is to virtualize the operating system along with the application. The entire virtual machine (VM) is actually packaged into a couple of physical files on the storage device. Today, with the ability to share storage among multiple servers, any VMware Server (ESX Server) that has access to those files can use VMotion technology to start the VM.

When a user sends a request, something (an application, operating system or server) must answer the request at the correct IP address running the correct application with access to the appropriate data. VMotion does this at the operating system level rather than at the application level. In a VMotion scenario, two ESX Servers have physical access to the VM files on a storage array. Either ESX Server could run the VM and answer the user request.

When a system administrator initiates a VMotion migration, VMware begins to log all activity against the machine. It also has to move a memory map of the first machine to the second machine. With the memory map migrated and the logging applied to the changes on disk, the second machine can now adopt the IP address and the application has moved. If this were a planned migration, the application moves without incident. If it's a true failure on the first server, the memory map is missing. The ability to successfully start the VM and subsequent application is dependent on the resilience of both to a complete power failure.

Today, many of the larger ERP vendors don't support running their application within a virtual session. In addition, many of the larger ERP vendors recommend a full 64-bit server architecture for the ERP database server. VMware isn't available for those platforms. But as CPU power becomes more plentiful (quad-core is around the corner), we may see a shift of ERP database servers running in a VMware cluster. Combined with a continuous mapping of memory between two ESX Servers, this may become a more accepted method of clustering ERP databases.

Database, system, storage and network admins play a part in increasing the availability of an application. Clustering typically refers to the leveraging of duplicate infrastructures to increase the availability of an application or a group of applications.

When a user requests data about an order that was placed in the ERP system, as long as the right application is operating at the requested IP address with access to the data, the request will be acknowledged and serviced. Clustering software like Hewlett-Packard (HP) Co.'s MC/ServiceGuard, IBM Corp.'s High-Availability Cluster Multiprocessing (HACMP) and Microsoft Corp.'s Windows Compute Cluster Server 2003 allow a server to have multiple IP addresses. In a clustered environment, the mission-critical ERP application will have an IP address that's independent from the hardware IP address. Users send requests to an application's IP address, not to the server's IP address. By separating--or virtualizing--the application's IP address from the hardware, the need to have a specific server running to keep the application available is eliminated (see "VMware and clustering,"). This is the first step to ensure that the database layer isn't a single point of failure for the mission-critical application.

Controlling server access to storage
Because a user request is going to a virtual IP address associated with an application rather than a server, the actual server that's "actively" running the application needs to have access and control of the underlying storage. And because there are multiple servers that could be running the application and possibly answering the request, the servers' access to the storage must be controlled.

The storage administrator has to leverage all of the spindles and array controllers to ensure database performance, as discussed in the first article of this series (see "Configuring storage for ERP," Storage, December 2006). Once that's complete, the physical disks that house the tablespace, archive files and log files that make up the ERP application's data must be presented to every node in the cluster. SANs, of course, connect the physical data to multiple servers in the cluster. How the storage is accessed, and which server is actively in control of the storage (and when), is typically a function of the server's operating system and its clustering technique, rather than a function of the storage array software.

Active-passive/active-active clusters
Methods of controlling access to the storage vary greatly. In an active-passive cluster, one server answers all user requests and the second (passive) server waits to take over if required. The second server can do other things, but it must be prepared to be repurposed immediately to service user requests if the need arises. The clustering software enables the passive server to constantly check on the active server. This is typically done through multiple network connections and is referred to as the heartbeat between the two servers. Certain cluster-based configuration parameters can dramatically change the timing of a cluster failover.

In an active-active cluster, more than one server can respond to user requests and more than one server can access the storage to retrieve or write the data. In an active-active cluster, either the database application or the operating system must determine who has control of the storage at that moment, allowing access to storage to move back and forth between servers with every request. Oracle Real Application Clusters (RAC) is a form of active-active clustering in which the Oracle application manages all storage access from multiple servers. RAC enables every database server in the cluster to process user requests and allows every server to access the underlying storage. The servers must be able to communicate quickly; in many cases, high-speed server interconnects like InfiniBand are used to keep the independent servers and their respective operating systems acting in unison.

A clustered database server architecture basically has an IP address that can move between servers, and the ability to control access to the data. This means that an entire layer of the production model can fail and the application will still be available. However, the business unit that depends on the application needs to know how long the application will be down if the database server fails.

A significant benefit of clustering an ERP app is a reduction in planned downtime. While this is never the reason companies initiate a clustering project, it's the most frequently realized benefit IT organizations enjoy. Clustering allows maintenance activities to occur outside the critical path of ERP app downtime.

Upgrading servers and arrays
Due to rapid advances in server and storage technology, IT organizations usually upgrade their critical SAN components every three to five years. Combined with the need to patch operating systems and upgrade storage array microcode, this can result in significant planned downtime for critical ERP applications. In a nonclustered environment, IT organizations agonize over every operating system patch or server hardware upgrade because the ERP application has to be stopped during the maintenance upgrade. If the patch or upgrade doesn't work, the changes have to be backed out before the ERP application can be restarted.

Companies that have operated an ERP environment for 10-plus years have typically had to change their database server three or four times to reduce cost, increase performance and maintain a vendor-supported environment. In a clustered environment, the ERP application can be moved from one database server to another (from the active to passive node) fairly easily. Once moved, the patches to the database server can be installed and tested while users' requests are still being serviced. The application might be down during the migration, but that controlled migration is much less risky because all of the maintenance work can be done outside the critical data path.

Companies spend time and money to cluster their ERP database server, but most still operate their ERP application in a single storage array. While storage arrays don't have to be patched as often as operating systems, microcode upgrades can cause ERP application downtime. Large storage arrays (EMC Corp.'s Symmetrix DMX-3, HP's StorageWorks XP and Hitachi Data Systems Corp.'s Universal Storage Platform, for example) are better suited to tolerating live microcode upgrades, which can significantly increase your ERP applications' availability. Midrange arrays (such as EMC's Clariion, HP's StorageWorks EVA and IBM's TotalStorage DS family) typically require reduced or no I/O during microcode upgrades. If you're looking at storage for your ERP array, ask your vendor or reseller for a microcode upgrade history as part of your selection criteria.

Array-based replication software (e.g., EMC SRDF/A or HP StorageWorks Continuous Access) can be used to replicate every change from one array to another. The replication between storage arrays has no impact on the ERP app. Once the data is fully migrated and synchronized, the second array is made available to the database servers in the cluster and the first array is then extracted from the SAN and data center.

Companies replicate data from one array to another to increase the availability of the ERP application, but the second array is usually in a remote location except during an array replacement. Most corporations that implement ERP today have a single storage array that supports the database cluster in their primary location. The data is always protected within the array through mirroring, RAID, dual controllers and other array-based techniques. Duplicate arrays can be leveraged in ERP environments, but they're typically put in separate, remote locations; they also require the development of remote replication and remote failover capabilities.

Clustering is about increasing the availability of the application. It's important to remember that clustering techniques are used to reduce, but not eliminate, downtime for an ERP application. While there's some complexity associated with the techniques, the ability to work on the database server and the database server's operating system while the application is running outweighs management complexity. Storage managers who have mastered an ERP cluster can sleep easy at night and work on other pressing tasks during the day.

Dig Deeper on Data storage management