Amazon "Broke the Internet With a Typo"​ and Other Lies Your CIO/CTO Told You

3 min read

Last week Amazon s3 suffered a major outage that at its core ran four hours, but ultimately left some trail of collateral damage behind as well. In turn, and in an effort to shift blame in most cases, the perpetuated story is that this is truly Amazon’s fault and it was largely unavoidable.

That’s wrong. Running businesses include an inherent risk when it comes to infrastructure - whether that is your own infrastructure or someone else’s. Businesses that suffered the most with the s3 outage had made decisions that created a systemic outage.

The facts: The s3 outage was limited to US-EAST-1 (N. Va) but because of the nature of s3, was perpetuated across multiple availability zones. For those not in the know, availability zones are how AWS allows one to exist in one region, but have failure redundancy. Most regions have an A, B, and C availability zone. This, in turn, reduces overall cost but increases risk when determining what your application architecture will look like.

The spin: It is more difficult, but not impossible, to run applications and AWS infrastructure across regions in addition to spanning availability zones. At ICF Olson, we have quite a few clients who have made the choice to have their primary infrastructure be regionalized, but to have a DR environment in a separate region for this very reason. In fact, some of my team members were working with clients to make the call to fail over to DR at the point S3 began recovering.

Let’s go back to my hard statement of “that’s wrong.” Infrastructure is inexpensive these days compared to the past. It is a commodity, especially when operating in a public, private, or hybrid cloud environment. The cost is reduced because the cost is shared by a greater number of infrastructure clients. Still, the choice - whether it is knowledge based or cost based - to limit assets to a single location is a risk, and must be evaluated as such.

In my opinion some of the businesses that were impacted by the outage at amazon have made a cost-based justification to increase their risk. Smaller businesses with greater financial exposure may legitimately be more limited financially, but some of the billion-dollar multi-nationals really have very little excuse.

In situations like these, it seems like the increased risk is much less tolerable in practice than in reality, and I end up thinking back to one of my favorite quotes from the movie Contact, where John Hurt’s character S. R. Hadden asked Jodie Foster’s character, “Why build one when you can build two at twice the price?” specifically because the first “contact vehicle” was destroyed by fanatics.

In the long run, as I said, four hours of downtime is minimal. By Friday, most clients I was talking with had recognized that the pain they felt was, while perhaps acute on Tuesday, diffused and much less impacting because the story was shared with a large number of people.

The harder conversation to have with some of those clients on Wednesday and Thursday of last week was “this didn’t have to happen, but this was the choice that was made. We are going to continue to support it, but we (and you) are constrained by those choices.” Putting it in that light, several were more interested in how spending a bit more money might significantly reduce the possibility of this happening again.

In the end, I think the s3 outage itself was productive. Clients (and even some of our own infrastructure) have shown to be exposed where they might not otherwise be, and we’ve been able to have conversations that were shut down previously. So if your senior technical people are telling you “it’s all amazon’s fault,” I suggest you point them to this article and ask them if what I’ve written is true.

Image of Stephen Sadowski

Stephen Sadowski

Leader focusing on quality, delivery, technical debt management, and leadership education about DevOps and SRE practices