|Best practices for e-mail storage|
Tuning e-mail storage
E-mail is one of the fastest growing applications in terms of storage capacity. With Sarbanes-Oxley and other regulations, the problems of managing e-mail have increased. (See "Regulations squeeze storage.") Because of the requirements for data retention, e-mail storage will have two components: primary (online) storage and secondary (archival) storage.
E-mail storage should be tuned for simple capacity scaling and long-term data retention. I/O performance and availability are secondary requirements. Most e-mail messages are accessed within the first few days they are received and then they are accessed infrequently--if ever again. Users may search their old messages for specific information, such as another person's e-mail address, but most old messages are never opened and read after several days.
This is partly due to the fact that e-mail replies and discussion threads often have the message text from a previous message copied in subsequent messages. E-mail storage can be described as a write-infrequently, read-rarely facility. It's lightly used, although that might not be obvious, based on the capacity problems that it causes. The characteristics of archived e-mails are different--the archive must be designed to last for many years, and in some cases, prove that e-mails haven't changed.
First and foremost, e-mail primary storage should be cheap, but scalable. Disk drives running in server cabinets are not the least bit scalable, although this is probably the most common type of e-mail storage used today. Virtualization in storage area networks (SANs) is a much better option. Virtualization's ability to change storage capacity on demand would dramatically lower the cost of primary e-mail storage.
E-mail iSCSI SAN
Another way to lower the cost of e-mail storage is to use cheaper SAN connection technology for e-mail servers. This sounds like a job for iSCSI, and it is. Of course, this may require creating a new iSCSI SAN or using iSCSI routers to provide iSCSI servers access to Fibre Channel (FC) storage subsystems.
Implement your iSCSI SAN independent of other Ethernet networks--leveraging an existing Ethernet network only invites problems and there's not much synergy to be had uniting legacy LANs and iSCSI SANs. Do not bother with TCP offload engines (TOE) because e-mail servers don't generate enough I/O traffic to justify the additional TOE cost.
If you plan to leverage an existing FC SAN, you will need an iSCSI router. Cisco Systems Inc., Crossroads Systems Inc., FalconStor Software Inc., McData Corp. and Sanrad Ltd. all offer iSCSI storage routers. Some of these products also provide the virtualization function that you need for optimal e-mail storage. You can put the virtualization product on the FC side of the router using any number of SAN virtualization products. I recommend putting the virtualization system on the FC side so it can be leveraged by other servers in the SAN. If implemented successfully, virtualization will increase your storage capacity levels more than any other storage technology on the market today.
Additionally, you should oversubscribe or multiplex server connections for e-mail servers. A single 2Gb storage subsystem port should be able to accommodate between 15 and 20 e-mail servers. Surprisingly, you might find that this number could be even higher. A fast LAN connection is necessary to migrate data transparently between storage subsystems.
|Best practices for e-mail storage|
Use SATA drives
Plan to use SATA drives with mean time between failure (MTBF) ratings of 1 million hours or more. They cost more than lower-rated SATA drives, but reducing drive failures saves money in the long run. Try not to partition individual drives among too many servers because SATA drives aren't good for overlapped I/O. SATA is terrific for a few e-mail servers, but I wouldn't want more than five servers accessing the same physical disk drives.
Keep that in mind when you partition storage in your virtualization system. Use RAID 1, 5 or 10--whatever seems easiest to manage. Don't worry that RAID 1 won't have the scalability you need--scalability can be handled by the virtualization product. Also, don't use SATA drives with write caching turned on; write caching won't deliver noticeable performance advantages and it increases the risk of data loss.
Primary and secondary storage
Whereas primary e-mail storage can be fairly generic and replaceable, your e-mail archiving system should be designed to last awhile. There are many software companies targeting products at data retention and regulatory business requirements, including the administrative challenges that come with e-mail. Most e-mail archiving packages move data from the e-mail system and store it in an external, compressed and indexed format for fast searching. Many also have special functions for handling e-mail attachments. Most backup software vendors have products for archiving e-mail data. However, data retention might be better if kept separate from regular backup processes.
Unlike primary e-mail storage, which can use just about any kind of storage, secondary e-mail storage needs to have safeguards built in to ensure that archived data isn't deleted or tampered with. Network Appliance Inc. (NetApp) has a software function called SnapLock that gives its filer products write once, ready many (WORM) capabilities. Likewise, IBM Corp. recently introduced a new server called the TotalStorage Data Retention 450, which provides WORM storage and works with Tivoli Storage Manager for data retention software to automate data retention policies. EMC Corp.'s Centera, using content-addressed storage (CAS) technology provides a similar data retention functionality. Keep in mind that these are data center products with relatively high price tags.
In lieu of buying an expensive special-purpose data retention storage subsystem, it's possible to use network-attached storage (NAS) to store the archived e-mail data. The risk in this is the possibility that an administrator or user will delete or alter the data after it has been safely archived. This archiving-on-the-cheap approach can work in practice as long as it's teamed with back up operations that make additional copies to tape. Be warned, however, that your manual process may come under the analysis of corporate auditors who might not be convinced of its effectiveness when compared to a more automated product.
If you use the NAS for e-mail archives, you should use removable WORM media to back up your archives. As WORM prevents data from being overwritten and the NAS system doesn't, the NAS system should probably be viewed as a temporary storage location for archiving purposes. It's fine to keep archived e-mail data on a NAS system, but you must regularly back up to WORM media. Sony Electronics Inc. has recently announced tape drives that provide WORM capabilities that can operate in tape libraries. Optical jukeboxes with WORM drives are another option. A new format called ultra density optical (UDO), which is based on blue-laser technology, may give magneto-optical (MO) drives the capacity necessary to be useful for e-mail archiving.
Storage for plain file servers
Inexpensive file servers that support general office environments--not industrial-strength file servers that support applications such as engineering or multimedia development environments--are perfectly adequate for office application files such as word processing documents, spreadsheets, presentations and HTML files. These types of files are accessed frequently while they are being worked on, but once they are finished, they are rarely accessed again.
Common office application file server storage requirements are less stringent and not all that difficult to meet. Unlike e-mail servers, common file servers don't typically require a large amount of storage capacity, but they do have to support higher performance levels. File servers regularly support a fair amount of I/O work in parallel, particularly when people are beginning their work or preparing to quit. Special business applications that have integrated databases or involve streaming I/O require higher performance levels than average office applications.
NAS appliances work very well as common file servers. The types and degrees of tuning that can be done depend on the product manufacturer and model. The following applies to both NAS appliances and file servers built from commercial software products.
When planning storage for common file servers, you must take into account the strengths and weaknesses of the disk drive technologies being used. It's certainly not a mistake to use FC and SCSI drives in file servers because they are much better at handling parallel I/O operations than ATA and SATA desktop disk drives. Subsystem designs can overcome the throughput limitations of ATA and SATA desktop disk drives, but you want to clearly understand what techniques are being used. For instance, ATA and SATA drives could have write caching enabled, which is probably a bad idea because of the chance of data loss. On the other hand, Ciprico Inc.'s new SATA storage subsystem, FibreStore 2212A, is based on an accelerated drive teaming (XDT) technology, and is designed to increase the throughput of SATA-based subsystems for common file serving environments.
One major difference between storage for e-mail and file servers is the amount of cache memory used. While cache doesn't do much for e-mail performance, file servers often benefit from cache. Read-ahead cache is most likely to generate the best results for common office file servers, but if you're trying to improve performance for a particular application, you should ask the application vendor's opinion.
Another way to increase the performance of file servers is to "short-stroke" the drives. The basic idea is to create a single partition on the drive that's approximately one-half to two-thirds of the drive's capacity and ignore the remaining capacity. This reduces seek times in the drive, allowing a large SATA drive to achieve seek times that are similar to SCSI and FC drives.
An important goal for file server storage is to reduce the number of drive failures. As with e-mail server storage, look for SATA drives with MTBF values greater then 1 million hours. Beyond that, file server storage should include hot swapping and hot spare features as standard fare.
iSCSI is also a good fit for file server storage as a way to reduce the cost of putting file servers on a SAN. File server performance over iSCSI should be adequate for most applications, but low latency applications with database functions could suffer with it. Don't bother using TOE unless your servers have processors running at less than 2GHz. iSCSI storage routers that connect iSCSI servers to FC storage are an excellent idea for companies that already have FC SANs. This way you can get the cost reduction of iSCSI on the servers, combined with the high-throughput capabilities of FC and SCSI disk drives.
General-purpose database storage
Although high-throughput transaction processing tends to get the most attention when databases are discussed, there is an enormous number of general-purpose database installations with easier requirements to meet than transaction processing. These servers support an incredible variety of business applications, from material resource planning and accounting packages to specialized lines of business applications.
General-purpose database systems tend to have more I/O activity than office application or e-mail servers. Unlike these other servers, database systems tend to access more of their data over an extended period of time. Performance and reliability are much more important because these databases may be used for primary business operations and can impact corporate productivity.
The requirements for these databases vary widely, but if the business depends heavily on them, then it makes sense to invest more heavily and avoid problems. Capacity requirements can be surprisingly small. Some databases don't need much storage capacity and could theoretically operate with 200GB capacity disk drives.
In general, these servers merit FC storage based on the requirement for critical reliability and lower latency. This isn't to say high reliability SATA drives won't do the job, but they haven't been in the market that long, and a more conservative technology approach is prudent.
Another option for database storage is an industrial-strength NAS system. For instance, Oracle databases run perfectly well on NAS systems. As it turns out, most of the concerns voiced about using NAS for databases are related to questions about network reliability. If you're going to install a SAN with fiber optic cabling, you could install the same cabling and use it for NAS to get the same level of network reliability.
New storage technologies such as iSCSI and SATA and new developments in older technologies such as WORM and NAS can be applied and tuned to meet a wide range of changing storage requirements. The model of the high-performance, high-availability FC SAN doesn't necessarily translate to applications such as e-mail, which need scalability and little else.
Of all applications, the one posing the most challenges today is e-mail--particularly the archiving of e-mail data. If you save money deploying storage for other servers, you may have more available to spend on this difficult and critical area.