A few years ago, I helped an ISP redesign its e-mail storage to strengthen the application's availability, scalability and recovery. Many of the steps we took were to transform a monolithic storage design into a more flexible and scalable system, one that provided higher levels of availability and performance that may be transferable to other large e-mail storage environments.
At the time, the ISP was growing rapidly, adding nearly 10,000 new accounts a day. To keep the system running, the staff was constantly adding more hardware and servers to handle the increase in database traffic and more storage. It was obvious the current architecture would soon hit a wall as to the number of users it could service. The redesign needed to address these three critical areas:
Availability. This problem was the most significant of the three. E-mail couldn't go down for any reason. Nor could an e-mail message be corrupted without doing harm to the ISP's reputation. The new design needed database failover, dynamic multipathing and multiple mirror sets.
Scalability. Scalability was the greatest challenge, because the system was servicing thousands of users. The servers, storage and software had to be staged, cut over and reclaimed, and this process couldn't affect the users.
Recovery. Management also wanted to add new features that would allow fast recovery of a corrupted database and off-site archival of low-usage accounts. This required multiple mirrors, scripts to trigger synchronization points and a large common repository that could be mirrored remotely for the highest level of recovery.
The e-mail system stores all security and account information for the individual e-mail users in a database, and all messages in a file systems. Originally, database and file system data were on the same physical drives, channels and subsystems without regard for the distinct performance characteristics of each data type. The storage subsystem was trashing between random, small block and sequential, large block data.
The server layout wasn't optimized for performance, either. Large servers handled the massive amount of database traffic while also streaming message objects to users. Each Sun 6000 server ran the e-mail application and an Oracle database. While these machines have decent I/O and expandability, the database was using a lot of system resources. Message streaming lowered the CPU usage, but consumed the majority of I/O bandwidth. At first glance, this doesn't seem like a bad solution. The two applications are using different resources within the same server. Oracle was using mainly CPU and memory, and the e-mail application was using only the I/O subsystem.
A database server is typically much more complex and finely tuned than a file server. Add the fact that both are tuned differently, and you begin to have problems. The combination of these two applications had availability implications. A slight problem that causes a sensitive database to hang wouldn't have affected a simple file server. In addition, the scaling of a large server required either buying more large servers or consolidating on an even larger server. The solution: Decouple the database and file system servers. This was also done at the data level, which I'll discuss next.
As mentioned earlier, the e-mail system was composed of both message files and database objects. The original design treated these two distinct data types as equals. This caused all sorts of I/O problems. The channel utilization for the storage subsystem was low because of the mix of block sizes. The mixture of random and sequential data caused trashing within the cache with minimal reuse. Physical recovery was difficult because it was necessary to restore the full volume to another disk and then pull off the needed data. The only logical step was to isolate the database and message file data so they could be tuned and matched with the resources needed to provide an optimum solution. This sounds good in theory, but how could we redesign the entire e-mail system while still servicing thousands of users with no downtime?
As we moved from a monolithic to modular architecture, the e-mail system redesign took several phases to complete. In the first phase, the database and file system servers were decoupled to make better use of the available resources. The 6000-class Suns were replaced with 4000-class systems and 200-class systems were used for file servers. (The 6000s were redeployed for another project within the company.) In the second phase, we built new storage modules. The next phase migrated users off the hardware to the new modules, which were built from the decommissioned hardware; so the only new hardware needed was for the initial module. Users were migrated when there was the smallest amount of e-mail traffic. It's important to note the features that were added to the redesigned system to increase the e-mail system's availability: failover and dynamic multipathing.
The failover strategy was an elegant solution consisting of a dual server architecture and implementation of an Oracle standby database. In the event of a problem with the active server, the failover software would cut over to the stable server and use the standby database. The standby database was kept synchronized by updating the standby server's logs on a periodic basis. This made maintenance and upgrades non-disruptive.
Dynamic multipathing also improved channel utilization, thus enhancing performance. A total of eight paths were used in this implementation with four dedicated to each server (see "Final e-mail system" on this page). The multipathing allowed for host bus adapter (HBA) failures without affecting the database environment. Dynamic multipathing was also implemented on the file servers, primarily for performance.
|Final e-mail system|
With servers and storage decoupled, and with clustered servers, this e-mail system provides high availability and performance. Maintenance without downtime is easier.
Another goal for this redesign was to increase scalability so that a new module could be added at any time without disrupting the entire e-mail system. This meant the modules had to be self-contained and independent of each other while working together to service the e-mail customers. This building-block approach is found today in several storage utility models, but at the time was relatively new for an e-mail application. Below is a depiction of the near-linear scaling model.
Each module consisted of two disk arrays. The first array was mirrored with relatively small physical drives and a large cache. It was fitted with the maximum number of front-end channel adapters. Large physical disks were also included in the array for the mirrored copies. In addition to the disk arrays, there were the servers, which included two high-end database servers configured for high availability and another smaller file server. The servers were connected to the database array and message file array, respectively.
A complete module consisted of three servers and two disk arrays with ports reserved for recovery channels. The e-mail modules were built in distinct phases as follows:
Phase 1: Server segmentation. To create the level of scalability needed to handle almost 10,000 new accounts per day, the first task was to decouple the database and message files servers. Once the servers were in place with the storage systems, the next two phases of data migration could proceed.
Phase 2: Database migration. The database migration was performed during a window late in the week when usage was lowest. Once a working database was migrated over, the message files could be moved next.
Phase 3: Message migration. The message file migration was performed during the same window as the database migration. Once the messages and databases were verified as functional, the network was pointed to the new servers, and e-mail operations continued on the new e-mail module.
Phase 4: Recovery.The final phase was to connect the open channel ports for data archival and remote mirroring. This phase was done without any risk of downtime.
These phases were repeated four times over a four-week period to create a scalable system of four e-mail modules.
The final enhancement to the e-mail system was the addition of several recovery features. First, two volumes for every active volume were left unused for the purpose of providing a mirror.
The first volume would be synchronized with the active data on a two-hour interval. This allowed for an incremental recovery of no more that two hours worth of e-mail activity.
The second volume was synchronized on a 12-hour interval. This volume was for any corruption errors that weren't found in the two-hour window before the first volume would be resynchronized and thus overwritten. This volume could also be used for future asynchronous remote mirroring.