|Four tips for SAN scaling|
That's what we learned at Intuit Inc. when we decided to implement a SAN three years ago. With business requirements and storage doubling - sometimes tripling - every year, the advantages of achieving greater storage resource utilization through centralization, consolidation and availability were incentive enough to go ahead and be one of the early adopters of SAN technology. As our SANs grew from around 20TB, 128 ports and 60 DLT tape drives to approximately 200TB, 900 switch ports and 140 DLT drives, we encountered unforeseen problems that can plague you if you're not prepared.
One of the challenges was sharing SAN resources and achieving 100% utilization while trying to avoid both high costs and a large team to manage the SAN. We also had to figure out how to protect our initial investment while expanding - you don't want to have to throw out the infrastructure you built when you were relatively small in order to expand.
You can avoid these landmines by not boxing yourself in with a SAN design that can't scale effectively. Understanding what that means concretely, however, is far from obvious.
|The right stuff|
Avoid false economies
Our SAN implementation began as a few isolated monolithic and modular storage arrays with redundant fabrics made up of a few switches. We had mostly Unix servers of mixed operating systems connected to the monolithic arrays while Intel servers were connected to the modular arrays. While the modular arrays were less expensive, at the time they didn't yet have the availability, caching and ability for multiple mirror copies and therefore were mostly used for the smaller applications such as databases running on Intel platforms. Although this changed over time as the software for the modular arrays became competitive with the features of the monolithic arrays, we continued to use the monolithic arrays for the most critical applications.
Ultimately, decisions at a higher level forced us to move to a method for replicating data, which meant moving more apps to monolithic storage and scrapping some of the modular systems. Always try and anticipate your future needs when you chose your primary storage (see "Scaling backup,").
Some of our initial SAN implementations were performed by adding Fibre host bus adapters (HBAs) to the servers and then migrating the direct-attached servers from maxed arrays to new ones with Fibre switches placed in between. These isolated SAN islands were designed and laid out in a simple fashion. Management was manual, but relatively easy. A few Excel spreadsheets showed the switch and disk configurations for each of the servers. The infrastructure was nothing more than several strands of fiber laid throughout the data center under the floor in the network trays. Switches were racked and located centrally between the servers and storage. Backups were performed on a daily basis. Soft and hard zones were configured and our SAN implementations were a success.
This initial configuration worked well while things were relatively small and isolated. However, some of the benefits of a SAN weren't being fully utilized in this design. New servers were added to these SAN islands, but once we grew beyond the capacity of the switches or arrays in the initial design, the various components of the SAN started to become obstacles. One by one, each component needed to be addressed.
Design your SAN with a topology that scales regardless of how small you initially start out. Spending money up front will save you both soft and hard costs down the road.
The soft costs saved include the amount of time it takes to manage, redesign and then implement a core-edge topology later down the road. The hard costs are the longer investment protection over time of the initial hardware purchase.
You can protect your initial investment if you correctly anticipate faster hardware speeds for tape, switch, servers and storage. With speeds increasing and the ability to create trunks between your switches, you'll have more flexibility if you've designed an architecture that lets you move the older and slower technology out to the edge and implement the new faster hardware at the core. You'll also reduce the amount of downtime you experience in the future. With our SAN islands, we had to bring down the fabric in order to merge islands - and perhaps the servers as well - to bring firmware levels in sync, or upgrade them to the latest version to support more or newer drives.
We found that it was better to schedule downtime when making most major changes to a SAN. Due to the infancy of SAN technology at the time, interoperability along with older versions of software firmware and drivers could - and had - resulted in unplanned outages. Ensuring data integrity and uptime to our customers was the main objective, and therefore, scheduling the downtime for maintenance was sometimes necessary.
SAN maturity and issues with interoperability have improved, so you may not need to bring everything down to make changes now.
But without the right architecture, you may be forced into awkward configurations just to utilize all the available resources across multiple islands. In our case, we would sometimes end up with small switches linked in daisy-chain fashion to each other through a single ISL in order to achieve this. As a result, our SAN was vulnerable to single points of failure.
I don't recommend the daisy-chain approach in general. Spend the money, buy more switches and design an architecture that will allow the availability, performance and flexibility necessary when SANs become larger and need to scale. For us that meant trunking ISLs when possible as well as building out core-edge topologies that scaled. This also let us take advantage of storage resources by merging fabrics when necessary, without having to schedule downtime. This approach also allowed us to build on our earlier investment in smaller switches - we introduce newer, faster, larger switches at the core and push the older, smaller ones to the edge or to development environments in some cases.
While it is possible to build out a core-edge topology using smaller switches, it's more cost-effective to use bladed directors with a large port count if high availability and increased flexibility are the goal. However, whether using smaller switches or large director switches to build your core-edge topology, the key is to pick an architecture that will scale for your environment.
While architecture is crucial, you can also avoid future costs by building out the right SAN infrastructure and how you connect servers to storage resources.
Start out small, and run fiber under the floor for a few hosts. But as you grow and have to move, add, or retire servers, switches, tape libraries and storage arrays, maintaining the fiber under the floor can be cost prohibitive. Troubleshooting a connection problem can be difficult with so many strands running on top of each other in a spaghetti fashion. Labels can become inaccurate or removed altogether from handling and moving the cables so many times.
If you don't have an infrastructure in place, you could end up with a lot of wasted fiber under the floor that could involve downtime to pull. Build the storage network like any other network and include patch panels with bundled fiber running between and distributed throughout the data center(s) in a design where the anticipated length from any server, storage array, switch and tape library can be calculated and preordered.
Understand the soft costs
One of the biggest soft costs when implementing and administrating a SAN are the people. This is probably because companies such as Intuit are telling vendors that such a tool is needed, a number of vendors are working on ways to easily administer the SAN from a centralized location and eliminate the manual bookkeeping tasks.
As our SAN grew from 20TB to 50TB to 200TB and over 900 switch ports - the old ways of managing were no longer practical. The spreadsheets we started with that detailed how the zones and disks were configured to the hosts wouldn't scale effectively. Again, this worked initially but keeping the documents updated always seemed to take a back seat to keeping the trains running.
Choose a SAN management tool early in your SAN deployment to cope with future growth. Good tools have widespread benefits in the area of interoperability, planning, scaling, and space reclamation (see "The right stuff").
If there's one thing I've learned from our whole experience, it's that basic technologies change rapidly. Absorbing them while running a real environment means you have to have good policies, procedures, design and management tools - so don't wait too long until you do that.
Now that we've arrived at a scalable architecture and infrastructure, we know what we need for management tools, and implementation is our biggest challenge. We're currently focusing on three main pain points. Implementing an enterprise resource management tool, better and more efficient ways for data archiving and remote data facility replication for achieving ensuring higher levels of data integrity and usability in the event of a disaster.
Change is still the order of the day. But I believe that with the current state of the art and the lessons that pioneers such as Intuit have learned, many companies can start at a reasonable level and grow into the many terabyte range while preserving most of their initial investment.
Online resources from SearchStorage.com: "The science of SANs," by M.C. Kinora.