BT Exact, the R&D and IT operations business of British Telecom had until recently over a hundred separate SANs...
littered across the company. Like many large organizations BT would purchase a particular array for a particular project, hook it up and forget about it.
Recently, though, BT Exact has embarked on a large-scale SAN consolidation project aimed at better utilization and easier management. While the company has followed tried and true approaches, such as sticking with a common vendor, it's also found that realizing the benefits of consolidation requires raising your game to a new level of methodical processes and procedures. Technology alone will not bring efficient scaling.
"We were constantly buying new arrays," says Peter Hull, BT Exact's SAN infrastructure designer. He wouldn't give out the exact number but says the IT department was jammed with EMC 8830's and 8730's. These hold between 10-15 TB of storage and BT Exact was only using 2-4 TB of each array.
Not only was the company wasting valuable capacity but the response times to provision new storage took months. "Every time we needed more storage it was a separate business case for each array," says Hull. And he says controlling the way that the technology was implemented in each scenario became impossible.
Hull decided about two years ago that all these homogenous SANs had to be consolidated into one and turned to Brocade to fix the problem. "We were looking for best of breed…Cisco wasn't on the scene then, and from the point of view of handling security, Brocade was the only one that could do hardware enforced zoning," Hull says. This segments storage at the port level and is important to companies that are sharing storage among many different departments or among their customers.
BT Exact deployed 12 128-port Brocade Silkworm 12,000 directors and a number of 32-port Silkworm 3900 switches across multiple sites in a duplicated design with the objective of preventing network downtime and ensuring high levels of system availability. The company declined to give out precise details of its network topology for competitive reasons.
With the old DAS architecture Hull says it was impossible to get an outage as "we could never take it down," whereas with a fabric infrastructure parts of it have to come down for maintenance or firmware upgrades. Therefore "the design had to be completely fault tolerant and reliable," he says.
The decision to stick with Brocade as it continued to build out the SAN was a matter of playing safe, Hull says. "Inter-working between different switch suppliers isn't easy and we were already a Brocade shop so it made sense."
All the same the company still experienced some integration challenges. Getting internal standards written and then implemented accordingly was tricky. "Everything comes from the factory with a certain default setting and that isn't always right for your environment," Hull says. He had to extend the HBA time-out default setting from 30 seconds to 60 seconds. The window was too small to make any changes and each time BT Exact added another edge switch the whole fabric started reinitializing randomly.
Getting his staff to pay attention to these details was a challenge. "People tend to do what works rather than what should be done," Hull says.
The new design required the staff to rethink their approach to the infrastructure. Before the single SAN, each array was managed independently and consequently any changes to it only affected that system. "Now with everything connected on a network you have to make sure any change only affects what you intend and has no side affects," Hull acknowledges. "The cost of going to this infrastructure is that you have to run a much stricter operation."
Behind the scenes BT Exact is working on a major data migration project which again it declined to talk about. "It's commercially sensitive, but we are thinking about application mobility…Having access to disks without physically having to make changes to them frees us up to do a lot more things," Hull says. He can also provision storage in 3-4 days now instead of it taking months for each request. "It's simply a matter of adjusting the infrastructure," he says.