Although it's not commonly done, we can all imagine what the benefits are from booting operating systems from a storage area network (SAN). From the more standardized approach of deploying new servers to a more instantaneous recovery of those servers, booting from the SAN gives an organization more choices in the way it manages server and storage infrastructures.
Considering those benefits, as well as the management difficulties that existed before the advent of storage area networks, it would seem we all should start booting from the SAN. Instead, users are taking the tortoise approach to SAN-based system generation and the reasons are worth noting.
In IT shops with hundreds--and when you consider mirrors, thousands--of OS disks to manage, booting from the SAN must be a technological boon. To understand that, let's look at JumpStart from Sun Microsystems Inc. as an example of an application that can be enhanced to take advantage of having boot disks on the SAN. Keep in mind that any volume management or backup and recovery application could bring this functionality under its umbrella of services.
For large Solaris shops, system generation is likely to be aided by JumpStart even without a SAN. Profile configuration and rule files are created, configured and then applied to boot images before they are copied to the installing client over a local IP network. For all intents and purposes, this is managing storage on the front end with some performance and architectural (bootp) limits on the number of simultaneous clients that can be generated at once (see "SAN boot simplifies systems management" on this page).
Booting from the SAN doesn't replace products like JumpStart. Instead, it enables you to manage boot images on the SAN from the back end, which is less expensive and faster than using a JumpStart profile/boot server to initialize multiple Solaris systems over IP.
Just as today, where there's no dependency on the JumpStart server after the OS is installed, there wouldn't be any dependency on the JumpStart server in our proposed SAN-based solution either. After the OS image is generated and copied onto the selected target disks in the SAN, all ownership of the disks would be released, put into a default zone and then later reassigned to the incoming server's host bus adapter (HBA) when the system is connected to the SAN and its world wide name is discovered. From that point and until an OS upgrade or hardware replacement is required, the disk would be exclusively assigned to and managed by the incoming application server as if it were locally attached to the server.
Today, Solaris systems with internal boot disks may be taken out of their packaging and plugged into a locally broadcasted network where JumpStart is used to install a system image on its disks. Afterwards, the newly generated system is wheeled to a more permanent location in the rack and then connected to its storage devices over a SAN or direct-attached storage (DAS) connection.
However, when managing boot images from the SAN, newly unpackaged systems outfitted with only CPUs, NICs, memory and some number of HBAs can be taken directly to the rack, connected to the SAN and configured to boot from one of the free OS disks previously created with minimal ease. That makes the deployment of one or many servers more efficient and cost effective.
|SAN boot simplifies system management|
Remember all of those Y2K patches that we had to apply to the OS in 1999? Remember how we had to visit each server to apply those patches each time the vendor came out with a new bundle every quarter?
Remember asking yourself: "How can this process be automated?" Well, managing and booting your OS disks from the SAN gives you the ability to create a newly standardized (patched) OS disk for discovery, testing and then synchronizing with the child node in a mirror, thus enhancing system upgrades. And if testing should prove that the patch bundle was flawed, you can still fall back to the original OS image with a simple zone change for as long as the policy allows to keep that image around.
Disaster recovery is yet another business discipline in which SAN-based boot image management can prevail. The ability to drive multiple instances of a boot image to many potential OS disks can drastically improve the recovery time of operating systems in disaster recovery exercises. Following the attacks on Sept. 11, 2001, the brokerage firm that I was working with received a number of brand-new servers on the dock of its recovery site within hours of the disaster (see "Disaster Recovery: Reenacting Sept. 11"). They were racked and administrators immediately installed the operating systems from a CD.
Although a moderate number of installs were proceeding in parallel, the sheer number of systems that had to be regenerated hindered the speed of the deployment. In addition, no one really knew what packages needed to be installed on each server or what answers were being supplied to the installation prompts across the group. The edict was simply to "get them installed!" Because the common thought was "it's better to be safe than sorry," the installation of the entire CD distribution was often the preferred choice, further extending the OS installation portion of the exercise.
In that situation, being able to quickly generate like-system images without any dependency on the number of CD installation media or the backup server, would have proven useful to resurrect so many servers in such a short period of time. And although you can do that using JumpStart's native IP functionality, without a SAN, the speed and broadcasting characteristics that are typical of a disaster recovery IP solution (i.e., 100Mb/s, routable LAN) would likely be insufficient for the mass rollout of a large number of servers. Unless you want to provision a Gigabit Ethernet connection over a flat network space for every application server slated for recovery, you are bound to run into a bottleneck when installing your boot images onto local disks via IP.
In contrast, with a SAN, once you've done the initial OS install on the JumpStart server and recovered the JumpStart application data with native Unix utilities, like-boot images can be created on independent disks and then served up to the recovered application server's HBA.
Streamlining this process in a disaster recovery effort has many benefits. Not only does it now take just one storage administrator to cook up some boot images, it also frees up precious system administrator time to concentrate on restoring application data once the backup and recovery environment has been certified.
Mirroring root disks across distances is yet another benefit of booting application servers from the SAN. In theory--and considering the minimal amount of data being driven to the OS disk(s) when compared to application data disks--any long-distance SAN link capable of sustaining synchronous data flow for a resource-intensive application should also be able to sustain mirror I/O between a local SAN-based boot disk and its remote mirrored partner. As always, extensive testing should be done between long distance points before assuming the link will support boot disk traffic.
Additionally, if you are mirroring swap files and your applications make heavy use of them, consider purchasing more system memory before testing the link. If your testing proves successful, then further testing will show that upon disaster declaration and following some massaging in the remote SAN, boot disks and root disk groups can be discovered and imported into a newly deployed server. At that point, the new server is ready for the recovery of its application data. If you already have this infrastructure in place, isn't this reason alone to test?
The first stumbling block for booting from the SAN is not technical, but political. Larger IT organizations are creating storage administration groups that are responsible for the provisioning, securing and protection of application storage. However, the system administrator assigned to the business unit that owns the application server usually administers the operating system disk locally and independently of the storage group.
From the system administrator's point of view, the boot disks are the brains of the computer, and they don't want some fallible network between the physical server and its brains. This concern was both real and valid in light of the initial complexity of SANs. However, times have changed. Fibre Channel (FC)-based SANs are much more stable than they have been in the past. Interoperability between hardware and software offerings has increased, drivers are stabilizing and there are more quality gigabit interface converters (GBICs) and HBAs available on the market today. All of this lends itself to a more quality connection between a server and its boot disks on the SAN. And while there have been some isolated successes when booting from an IP SAN, FC's head start is obvious in this solution space.
What I'm hearing from the storage administrators in the field is that from a management point of view, it only makes sense to boot their servers from the SAN.
"Imagine managing a retail Web site with hundreds of Dell Red Hat Linux servers," said one storage administrator who used to be a system administrator. "Then imagine having to deploy two hundred more just to handle the holiday load. We need this kind of change to happen with quality and as quickly and cost effectively as possible. And the best way to do this is by managing boot disks on the SAN." Ultimately, booting from the SAN is system cloning at its best.
Lack of standards
One thing is true: If the opportunity to boot from the SAN is proven useful, operational standards must be defined and abided by without exception. And trying to get different system administrators with different preferences and habits to agree on standardization is difficult.
However, I've seen it done with organizations that are serious about doing what is best for the company as a whole and not necessarily driven by individual desires.
One milestone on the path to booting from the SAN is for cooperating groups to agree to have the fewest number of boot images for their production and development servers. If a particular group of application servers need additional modifications above the standard install, then that request can be serviced with a post-execution script after the install has completed. Simplicity should be the goal here. In the end, you want to be able to do as I described above--recover a JumpStart-like server with standard Unix utilities and start snapping off boot disks.
In order to provide the same minimum level of service across business units, standardization from the HBA to the storage port must be guaranteed. This helps to address the system administrators concern about the quality of the connection between the server and its boot disks. That means that the same hardware, driver and firmware revisions supporting the boot disk connections for one group of application servers must be present in all others. If individual system administrators' personal preferences for operating systems, HBAs or other configuration components are allowed, than no real standardization will be possible. On the other hand, standardizing on quality physical connections, stable device drivers and generally agreed upon configurations usually brings goodness to all corners of the organization, including the serving of boot images from the SAN.