When the cloud was good, it was very very good. But when it was bad, it was horrid

Cloud computing took a big hit this week amid two significant service outages.

The biggest one, at least as it affects enterprise computing, is the eight-hour failure of Amazon’s Simple Storage Service. Check out the Amazon Web Services service health dashboard, and then select Amazon S3 in the United States for July 20. You’ll see that problems began at 9:05 am Pacific Time with “elevated error rates,” and that service wasn’t reported as being fully restored until 5:00 pm.

About the error, Amazon said,

We wanted to share a brief note about what we observed during yesterday’s event and where we are at this stage. As a distributed system, the different components of Amazon S3 need to be aware of the state of each other. For example, this awareness makes it possible for the system to decide to which redundant physical storage server to route a request. In order to share this state information across the system, we use a gossip protocol. Yesterday, we experienced a problem related to gossiping our internal state information, leaving the system components unable to interact properly and causing customers’ requests to Amazon S3 to fail. After exploring several alternatives, we determined that we had to temporarily take the service offline so that we could clear all gossipped state and restart gossip to rebuild the state.

These are sophisticated systems and it generally takes a while to get to root cause in such a situation. We’re working very hard to do this and will be providing more information here when we’ve fully investigated the incident. We also wanted to let you know that for this particular event, we’ll be waiving our standard SLA process and applying the appropriate service credit to all affected customers for the July billing period. Customers will not need to send us an e-mail to request their credits, as these will be automatically applied. This transaction will be reflected in our customers’ August billing statements.

Kudos to Amazon for issuing a billing adjustment. However, as we all know, the business cost of a service failure like this vastly exceeds the cost you pay for the service. If your applications were offline for eight hours because Amazon S3 was malfunctioning, that really hurts your bottom line. This wasn’t their first service failure, either: Amazon S3 went down in February as well.

Less significant to enterprises, but just as annoying to those concerned, involved hosted e-mail accounts hosted on Apple’s MobileMe service. MobileMe is the new name of the .Mac service, and the service was updated in mid-July along with the launch of the iPhone 3G. Unfortunately, not everything worked right. As you can see from Apple’s dashboard, some subscribers can’t access their email. Currently, this is affects about 1% of their subscribers — but it’s been like that since last Friday.

According to Apple,

We understand this is a serious issue and apologize for this service interruption. We are working hard to restore your service.

This reminds me of the poem from that great Maine writer, Henry Wadsworth Longfellow:

There was a little girl
Who had a little curl
Right in the middle of her forehead;
And when she was good
She was very, very good,
But when she was bad she was horrid.

When the cloud was good, it was very very good. But when it was bad, it was horrid

Trackbacks & Pingbacks

Comments are closed.