Thoughts on the Amazon Outage

If you haven’t read it yet you should take the time to read the follow up by Amazon following a serious outage of their web services. It’s a good note, decent amount of details and not defensive, which is the best you can do when something goes wrong on a grand scale. It’s interesting to note that the entire outage was initiated by a bad configuration change on the network. Not the first company hit by that kind of error, and certainly not the last.

Of course it wasn’t as simple to fix as replacing the wrong route with the correct one, it initiated a cascade of issues that took days to correct. Simple systems rarely have that level of cascade, but we’ve all felt the pain when a key piece of hardware stopped working and something as basic as a non-functioning file share can cause a whole lot of stuff to break.

Cloud systems depend on automation, no other way to do it. When they work it’s good, when it breaks…well, that’s frustrating if all your eggs were in that basket and you just have to sit and wait for things to be fixed. Can’t make it go faster, can’t get your hands dirty – miserable situation.

That doesn’t mean you shouldn’t put stuff in the cloud. It does mean it’s worthwhile to think about a continuity plan if it fails. The biggest challenge I see is many move the cloud for lower cost and greater availability precisely to avoid building and spending on a continuity plan. It’s a bet, and usually a good one – except this time. At the end of the day you have to own your business and if you lose money because you bet on any external provider, that’s the way it goes.