News Stay informed about the latest enterprise technology news and product updates.

Another service-provider infrastructure gets the hiccups

Amazon’s S3 online storage service suffered an outage this morning for several hours, echoing the outage suffered by email service provider RIM last week. While RIM’s outage affected CrackBerry addicts with alternatives to email, the Amazon outage may have affected Web-based companies relying on S3’s storage to deliver core services. Not good.

However, one S3 user I talked to today, SmugMug CEO Don McAskill, said his site didn’t feel a thing. “None of our customers reported any issues–we haven’t seen any problems that are customer facing,” he said.

But there’s also an important factor that may have led to SmugMug’s resiliency: the fact that after another outage last year, SmugMug started keeping about 10% of its data in a hot cache on-site. “It could have been that the hot cache was adequate for the 2 or so hours it was going on, or it could have been that for some people the outage was intermittent,” he added.

Meanwhile, some users were still reporting issues as recently as five minutes ago on Amazon’s Web Services Developer Connection message board. According to an official response on the thread about an hour ago, “This morning’s issue has been resolved and the system is continuing to recover. However, we are currently seeing slightly elevated error rates for some customers, and are actively working to resolve this.  More information on that to follow as we have it.”

Their businesses aren’t the same, but I think this ties in with what I was saying in my post about RIM’s Blackberry meltdown–as more and more data “eggs” put into centralized service provider “baskets”, more and more of them are going to get broken, especially as the service-provider market ramps up.

Or as TechCrunch put it:

This could just be growing pains for Amazon Web Services, as more startups and other companies come to rely on it for their Web-scale computing infrastructure. But even if the outage only lasted a couple hours, it is unacceptable. Nobody is going to trust their business to cloud computing unless it is more reliable than the data-center computing that is the current norm. So many Websites now rely on Amazon’s S3 storage service and, increasingly, on its EC2 compute cloud as well, that an outage takes down a lot of sites, or at least takes down some of their functionality. Cloud computing needs to be 99.999 percent reliable if Amazon and others want it to become more widely adopted.

Growing pains may have had something to do with it, according to Taneja Group analyst Eric Burgener. “There’s less of this going on than there used to be, but this is one of those things that gives people pause about services,” he said. A focus on secondary storage and storage for small companies has made this crop of service providers more successful than the SSP’s of the bubble days, and even where companies are relying on services like this for primary storage, Burgener argued that the services option is still the better bet. “For small internet businesses services are still a perfect play–they allow businesses to start up rapidly without the kind of capital expense or infrastructure they need for an in-house system.”

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

RIM admitted to a change management failure. Amazon hasn't really admitted to anything and their statement is generic with no pointer to any cause. Achieving 99.999% uptime is not a technical solution but a pshyological one. The reason is that the technology provides the uptime while the humans provide the downtime. In this tug of ware who wins?
Beth, How does one service outage "echo" the other? At this point in time *NEITHER* service outage has been explained to have been caused by a storage system failure. Granted, they are both service failures. I guess the point of your post is that any service (storage, transpo., comm., health, newspaper delivery, etc.) that you use could be less than 100% reliable.
The point of my post is that massively multi-tenant services which concentrate huge amounts of processing demand onto relatively few instances of computer technology will have some growing pains and problems as more and more people adopt the service-provider model. And that these hiccups, when they happen, because of the massive multi-tenancy, will be felt far and wide.