πŸ‘€tchπŸ•‘13yπŸ”Ό161πŸ—¨οΈ65

(Replying to PARENT post)

I dunno about you, but I could use a TL;DR for this:

1. They fucked up an internal DNS change and didn't notice

2. Internal systems on EBS hosts piled up with messages trying to get to the non-existent domain

3. Eventually the messages used up all the memory on the EBS hosts, and thousands of EBS hosts began to die simultaneously

4. Meanwhile, panicked operators trying to slow down this tidal wave hit the Throttle Everything button

5. The throttling was so aggressive the even normal levels of operation became impossible

6. The incident was a single AZ, but the throttling was across the whole region, which spread the pain further

[Everybody who got throttled gets a 3-hour refund]

7. Any single-AZ RDS instance on a dead EBS host was fucked

8. Multi-AZ RDS instances ran into two separate bugs, and either became stuck or hit a replication race condition and shut down

[Everybody whose multi-AZ RDS didn't fail over gets 10 days free credit]

9. Single-AZ ELB instances in the broken AZ failed because they use EBS too

10. Because everybody was freaking out and trying to fix their ELBs, the ELB service ran out of IP addresses and locked up

11. Multi-AZ ELB instances took too long to notice EBS was broken and then hit a bug and didn't fail over properly anyway

[ELB users get no refund, which seems harsh]

For those keeping score, that's 1 human error, 2 dependency chains, 3 design flaws, 3 instances of inadequate monitoring, and 5 brand-new internal bugs. From the length and groveling tone of the report, I can only assume that a big chunk of customers are very, VERY angry at them.

πŸ‘€seldoπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

> We are already in the process of making a few changes to reduce the interdependency between ELB and EBS to avoid correlated failure in future events and allow ELB recovery even when there are EBS issues within an Availability Zone.

This is music to my ears. We switched away from ELBs because of this dependency. Hopefully this statement means Amazon is working on completely removing any use of EBS from ELBs.

We came to the conclusion a year and a half ago that EBS has had too many cascading failures to be trustworthy for our production systems. We now run everything on ephemeral drives and use Cassandra distributed across multiple AZs and multiple regions for data persistence.

I highly recommend getting as many servers as you can off EBS.

πŸ‘€helperπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I really love when the companies take time to explain their customers what happened specially in such detail.

It's clearly a very complicated setting, and this type of posts make me trust them more, don't get me wrong, and outage is an outage, but knowing that they are in control and take time to explain shows respect and the correct attitude towards a mistake.

Good for them!

πŸ‘€TrufaπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I am always astonished by how many layers these bugs actually have. It's easy to start out blaming AWS, but if anyone can realistically say they could have anticipated this type of issue at a system level, they're deluding themselves.
πŸ‘€lukevπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

> Multi Availability Zone (Multi-AZ), where two database instances are synchronously operated in two different Availability Zones.

> The second group of Multi-AZ instances did not failover automatically because the master database instances were disconnected from their standby for a brief time interval immediately before these master database instances’ volumes became stuck. Normally these events are simultaneous. Between the period of time the masters were disconnected from their standbys and the point where volumes became stuck, the masters continued to process transactions without being able to replicate to their standbys.

Can someone explain this? I thought the entire point of synchronous replication was that the master doesn't acknowledge that a transaction is committed until the data reaches the slave. That's how it's described in the RDS FAQ: http://aws.amazon.com/rds/faqs/#36

πŸ‘€teraflopπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Everytime there is a service outage it makes me feel better about using them in the future. Every outage is actually making the project more reliable since some issuess will only manifest in production. I believe they have a great team that's very knowledgable.
πŸ‘€ndcrandallπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

AWS sure does put out amazing post mortems. If only they'd make their status page more useful ...
πŸ‘€mrkurtπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

So how reliable is AWS in comparison to some of its competitors? I think there might be a slight bias whenever AWS has a problem because so many big name sites rely on them, and they're the proverbial 800lb gorilla of the cloud computing space.

How many of these massive outages are affecting its competitors that we never hear about?

πŸ‘€KarunamonπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I avoid EBS because I think it is very complex, hard to do right, and has nasty failure modes if you use it within a UNIX environment (your code basically hangs, with no warning).

Now I learned that ELB uses EBS internally. I consider this very bad news, as I inadvertently became dependent on EBS. I intend to stop using ELB.

πŸ‘€jwrπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I know there are lots of smart people working there but just look at the sheer amount of AWS offerings. Amazon certainly gets credit for quickly putting out new features and services but it makes me wonder if their pace has resulted in way too many moving parts with an intractable number of dependencies.
πŸ‘€papercruncherπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Everything is a Freaking DNS problem :)
πŸ‘€filvdgπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

For anyone interested, here are the comments on the App engine outage: http://news.ycombinator.com/item?id=4704973
πŸ‘€michaelkscottπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Bugs happen, and the effects of cascading failures are very hard to anticipate. But it seems like the aws team hadn't fully tested the effects of an EBS outage, which seems like it could have uncovered the rdms multi availability zone failover bug and perhaps the elbs failover bug ahead of time.
πŸ‘€krosaenπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

This is some degree of complexity.

What does it suggest when they say "learned about new failure modes". Suggesting that there are new ones not yet learned.

One wonders if somewhere internally they have a dynamic model of how all this works. If not, might be a good time to build one.

πŸ‘€wglbπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

πŸ‘€RandgaltπŸ•‘13yπŸ”Ό0πŸ—¨οΈ0