Behind the scenes: Redpanda Cloud's response to the GCP outage

Here's a summary of the themes from the Hacker News discussion, focusing on the article's claims, GCP's regionality vs. AWS, and general outage experiences with cloud providers.

Perceived Overstatement of Redpanda's Resilience

A strong theme revolves around the Redpanda article being perceived as an overblown reaction to a lucky escape. Many users felt the article exaggerated Redpanda's resilience, suggesting it was more luck than design that prevented them from being affected by the GCP outage.

"When I read 'major outage for a large part of the internet was just another normal day for Redpanda Cloud customers', I expected a brave tale of RedPanda SREs valiantly fixing things, or some cool automatic failover tech. What I got instead was: Google told RedPanda there was an issue, RedPanda had a look and their service was unaffected, nothing needed failing over, then someone at RedPanda wrote an article bragging about their triple-nine uptime & fault tolerance." - RadiozRadioz
"Yeah I thought they were going to show something cool like multi-tenant architecture. Odd to write this article when it was clear they expected to be impacted as they were reaching out to customers." - literallyroy
"“We got lucky as the way we designed it happened not to use the part of the service that was degraded”" - bdavbdav
"And we're oblivious enough about that luck that we're patting ourselves on the back in public." - smoyer
"this is a stupid statement from them, hope they will be prepared next time" - Peterpanzeri

However, a counterpoint was made:

"I think you're missing the point. What I took away was that: "Because we design for zero dependencies for full operation, we didn't go down". Their extra features like tiered storage and monitoring going down didn't affect normal operations, which it seems like it did for similar solutions with similar features." - dangoodmanUT
"Why is that stupid? They did get lucky. They are acknowledging that, had they used that, they would have had problems. And now they will work to be more prepared. Acknowledging that one still has risks and that luck plays a factor is important." - mankyd

GCP's Regionality vs. AWS: Perceptions of Dependence and Isolation

A significant portion of the discussion focused on the architectural differences between GCP and AWS, particularly regarding regional independence and the impact on fault tolerance. Many users expressed concerns about GCP's architecture, suggesting that it is more prone to cascading failures across regions due to shared dependencies. AWS was generally seen as having stronger regional isolation.

"In fairness, their design does not seem to be regional. With problems in one region bringing down another, apparently not unrelated, region. With this kind of architecture, this sort of problems is just bound to happen. During my time in AWS, region independence was a must. And some services were able to operate at least for a while without degrading also when some core dependencies were not available. Think like loosing S3. And after that, the service would keep operating, but with a degraded experience. I am stunned that this level of isolation is not common in GCP." - siscia
"AWS regions are fundamentally different from GCP regions. GCP marketing tries really hard to make it seem otherwise, or that GCP has all the advantages of AWS regions plus the advantages of their approach, which means heavily on "effectively global" services. There are tradeoffs, for example multi region in GCP is often trivial and GCP can enforce fairness across regions, but that comes at the cost of availability. Which would be fine - GCP SLA's reflect the fact that they rarely consider regions to be a reliable fault containers, but GCP marketing, IMO, creates a dangerous situation by pretending to be something they aren't." - flaminHotSpeedo
"Everyone does. The difference is AWS very strongly ensures that regions are independent failure domains. The GCP architecture is global with all the pros and cons that implies. e.g GCP has a truly global load balancer while AWS can not since everything is at core regional." - crop_rotation
"Global dependencies were disallowed back in 2018 with a tiny handful of exceptions that were difficult or impossible to make fully regional." - rybosome
"So it's actually really hard to accidentally make cross-region calls, if you're working inside the AWS infrastructure. The call has to happen over the public Internet, and you need a special approval for that." - cyberax

However, some acknowledged the tradeoffs of each approach.

"Generally GCP wants regionality, but because it offers so many higher-level inter-region features, some kind of a global layer is basically inevitable." - rybosome
"I like GCP's approach, where you manage multiple regions with a single identity, but I'm not sure how they can make it resilient to regional failures." - buremba

Outage Experiences and the Reality of "Nines" of Uptime

Several users shared their outage experiences and perspectives on the difficulty of achieving high uptime, highlighting that even seemingly architected solutions can fail. The discussion also touched on the limitations of static stability and the importance of dynamic scaling in response to outages.

"Haha, we used to joke that's how many nines our customer-facing Ruby on Rails services had compared against our resilient five nines payments systems. Our heavy infra handled billions in daily payment volume and couldn't go down. With the Ruby teams, we often playfully quipped, "which nines are those?" humorously implying the leading digit itself wasn't itself a nine." - echelon
"Static stability is a good start, but isn't enough. In this outage, my service (on GCP) had static stability, which was great. However, some other similar services failed, and we got more load, but we couldn't start additional instances to handle the load because of the outage, and so we had overloaded servers and poor service quality. Mayhaps we could have adjusted load across regions to manage instance load, but that's not something we normally do." - toast0
"Years ago I had the misfortune of helping a company recover from an outage....It turned out that they had services in two data centers for redundancy, but they divided their critical services between them. So if either data center went offline, their whole stack was dead." - macintux
"beefnugs: I learned a lesson : "use less cloud""

Frustration with Cloud Provider Response to Outages

There's an undercurrent of cynicism regarding how cloud providers address outages, especially in relation to internal processes and potential over-reliance on process rather than fundamental change.

"they were tired of big outages years ago...One could hope that they'd realize whatever red tape they've been putting up so far hasn't helped, and so more of it probably wont either. If what you're doing isn't having an effect you need to do something different, not just more." - delusional
"They’ll do more of the same. The leads are clueless and sensible voices of criticism are deftly squashed." - kubb