BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Coinbase Postmortem Reveals How a Localized AWS Failure Triggered a Multi-Hour Trading Outage

Coinbase Postmortem Reveals How a Localized AWS Failure Triggered a Multi-Hour Trading Outage

Listen to this article -  0:00

Coinbase has published a detailed postmortem of its May 7, 2026, outage, revealing how a localized cooling failure inside an AWS data center escalated into a multi-hour disruption that halted nearly all trading activity across the cryptocurrency exchange. While the initial incident originated from an AWS thermal event in a single availability zone, Coinbase's investigation found that architectural dependencies within its own systems, including a matching engine tightly coupled to the affected zone and cascading messaging infrastructure failures, significantly prolonged recovery efforts.

The outage began when multiple cooling units failed simultaneously in an AWS data hall within the US-East-1 region, forcing thermal shutdowns of affected racks and taking EC2 instances and EBS volumes offline. Coinbase customers were unable to buy, sell, deposit, withdraw, or transfer assets for several hours, while institutional clients experienced widespread disruption to order routing and exchange services. Full recovery took much of the following day, with trading restored incrementally through cancel-only and auction modes before normal operations resumed.

According to Coinbase, the most significant factor delaying recovery was the design of its exchange matching engine. To achieve the ultra-low latency required for high-frequency trading, the system operates as a Raft-based cluster within a single AWS Cluster Placement Group. This architecture intentionally collocates nodes to minimize network latency between consensus members. However, when the AWS outage took down three of the cluster's five nodes, the system lost quorum and could no longer process trades.

The company acknowledged that while the architecture optimized performance, it lacked an automated mechanism for failover to another availability zone. Recovery required emergency code changes, manual cluster reconstruction, and careful restoration of quorum before trading could safely resume. The incident exposed a classic engineering trade-off: optimizing for latency and performance can sometimes come at the expense of resilience during rare infrastructure failures.

Coinbase's postmortem also identified a separate issue involving its event-streaming infrastructure. Kafka workloads responsible for distributing operational data became stranded in the impaired availability zone, creating significant backlogs and delaying service restoration even after core trading systems were recovering. Engineers ultimately had to manually migrate partitions and rebalance workloads to restore normal data flow across the platform.

The combination of the matching-engine failure and messaging backlog transformed what began as a localized cloud infrastructure issue into a platform-wide outage. Coinbase noted that either issue independently would have been manageable, but together they created a recovery process far more complex than anticipated.

The outage has reignited discussion around cloud concentration risk and the operational realities of building critical financial services on hyperscale infrastructure. Although AWS regions are designed around multiple availability zones, the Coinbase incident demonstrates how applications can still develop hidden dependencies on specific locations, particularly when performance requirements encourage tightly coupled architectures. The same AWS cooling failure also affected other major platforms and services operating in the region.

Industry observers noted that the incident highlights a growing challenge for cloud-native organizations: simply deploying across a cloud provider's infrastructure does not automatically guarantee resilience. System architecture, workload placement, failover automation, and operational assumptions often play a larger role in determining real-world availability than the underlying cloud platform itself.

Coinbase's experience echoes recent outages and engineering postmortems from other large-scale technology companies. GitHub has emphasized the importance of eliminating hidden infrastructure assumptions after several availability incidents exposed unexpected dependencies between systems. Discord's recent work automating ScyllaDB operations similarly focused on reducing recovery complexity and minimizing the impact of infrastructure failures through orchestration and automation. Meanwhile, Netflix has invested heavily in resilience engineering and workload isolation after discovering that infrastructure failures often emerge from subtle architectural coupling rather than single points of failure.

The common thread across these incidents is that modern distributed systems rarely fail because a single component breaks. Instead, outages occur when multiple individually manageable failures interact in unexpected ways. Coinbase's postmortem reinforces this lesson: the AWS cooling failure was the trigger, but the duration and impact of the outage were ultimately shaped by architectural assumptions that had never previously been tested under real-world failure conditions.

In response, Coinbase has outlined several remediation efforts, including automated cross-zone recovery capabilities for its matching engine, improved quorum restoration procedures, more resilient messaging infrastructure, and expanded disaster recovery testing. The company emphasized that while preventing outages remains important, accelerating recovery from inevitable failures is equally critical.

About the Author

Rate this Article

Adoption
Style

BT