(Replying to PARENT post)
This is music to my ears. We switched away from ELBs because of this dependency. Hopefully this statement means Amazon is working on completely removing any use of EBS from ELBs.
We came to the conclusion a year and a half ago that EBS has had too many cascading failures to be trustworthy for our production systems. We now run everything on ephemeral drives and use Cassandra distributed across multiple AZs and multiple regions for data persistence.
I highly recommend getting as many servers as you can off EBS.
(Replying to PARENT post)
It's clearly a very complicated setting, and this type of posts make me trust them more, don't get me wrong, and outage is an outage, but knowing that they are in control and take time to explain shows respect and the correct attitude towards a mistake.
Good for them!
(Replying to PARENT post)
(Replying to PARENT post)
> The second group of Multi-AZ instances did not failover automatically because the master database instances were disconnected from their standby for a brief time interval immediately before these master database instancesβ volumes became stuck. Normally these events are simultaneous. Between the period of time the masters were disconnected from their standbys and the point where volumes became stuck, the masters continued to process transactions without being able to replicate to their standbys.
Can someone explain this? I thought the entire point of synchronous replication was that the master doesn't acknowledge that a transaction is committed until the data reaches the slave. That's how it's described in the RDS FAQ: http://aws.amazon.com/rds/faqs/#36
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
How many of these massive outages are affecting its competitors that we never hear about?
(Replying to PARENT post)
Now I learned that ELB uses EBS internally. I consider this very bad news, as I inadvertently became dependent on EBS. I intend to stop using ELB.
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
What does it suggest when they say "learned about new failure modes". Suggesting that there are new ones not yet learned.
One wonders if somewhere internally they have a dynamic model of how all this works. If not, might be a good time to build one.
(Replying to PARENT post)
1. They fucked up an internal DNS change and didn't notice
2. Internal systems on EBS hosts piled up with messages trying to get to the non-existent domain
3. Eventually the messages used up all the memory on the EBS hosts, and thousands of EBS hosts began to die simultaneously
4. Meanwhile, panicked operators trying to slow down this tidal wave hit the Throttle Everything button
5. The throttling was so aggressive the even normal levels of operation became impossible
6. The incident was a single AZ, but the throttling was across the whole region, which spread the pain further
[Everybody who got throttled gets a 3-hour refund]
7. Any single-AZ RDS instance on a dead EBS host was fucked
8. Multi-AZ RDS instances ran into two separate bugs, and either became stuck or hit a replication race condition and shut down
[Everybody whose multi-AZ RDS didn't fail over gets 10 days free credit]
9. Single-AZ ELB instances in the broken AZ failed because they use EBS too
10. Because everybody was freaking out and trying to fix their ELBs, the ELB service ran out of IP addresses and locked up
11. Multi-AZ ELB instances took too long to notice EBS was broken and then hit a bug and didn't fail over properly anyway
[ELB users get no refund, which seems harsh]
For those keeping score, that's 1 human error, 2 dependency chains, 3 design flaws, 3 instances of inadequate monitoring, and 5 brand-new internal bugs. From the length and groveling tone of the report, I can only assume that a big chunk of customers are very, VERY angry at them.