The Pragmatic Engineer00 bình luận5 phút đọc2 ngày trước

Reliability fail: No automated zone failover for Coinbase’s global trading service

On May 7, 2026, Coinbase suffered a nearly 10-hour global trading outage triggered by a regional AWS disruption. The root cause was that Coinbase's matching engine ran in a single AWS Cluster Placement Group (a single availability zone) with no automated cross-zone failover. Recovery required an emergency code change and manual quorum restoration. The author criticizes this as amateurish for a $40B company processing $5.2 trillion annually, drawing an unfavorable comparison to Uber's multi-region failover drills from a decade earlier. The piece also notes that Coinbase had previously suffered a similar AWS-related outage in October 2025 and pledged to review its regional deployment strategy — a review that apparently failed to address this single-zone dependency.

Đọc bài gốc

#aws #distributed-systems

Nguồn: https://blog.pragmaticengineer.com/coinbase-fail. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

Redpanda112 phút4 giờ trướcAI

Kafka's log compaction corrupts data. Here's how we fixed it

Apache Kafka có lỗ hổng trong cơ chế log compaction khiến dữ liệu bị hỏng do xung đột giữa compaction và replication, gây ra bốn vấn đề: dữ liệu đã xóa tái xuất hiện, giao dịch bị hủy hiện dưới dạng đã commit, dữ liệu đã commit bị ẩn, và consumers read_committed bị đóng băng partition. Redpanda Streaming khắc phục bằng giao thức compaction phối hợp, sử dụng các cặp offset (MCCO/MTRO, MXFO/MXRO) để đảm bảo tombstones và transaction markers không bị xóa trước khi tất cả replicas xử lý xong. Lỗi này có thể tái hiện trên Kafka phiên bản 3.9 đến 4.2 bằng Docker Compose.

Lập trình viên cần đọc bài này để hiểu cách giải quyết vấn đề lỗi race condition trong log compaction của Kafka, giúp tránh mất dữ liệu và bảo đảm tính nhất quán khi xử lý các trường hợp đồng bộ hóa dữ liệu trên nhiều broker.

Reliability fail: No automated zone failover for Coinbase’s global trading service

Đề xuất cho bạn

Kafka's log compaction corrupts data. Here's how we fixed it

AI Coding Agent Horror Stories: The 13-Hour AWS Outage

How to use traces to avoid breaking changes

Mastering Secure CI/CD for ECS with GitHub Actions

Amazon to invest an extra $13bn in Indian cloud and AI by 2030

Running pgvector in production on Amazon Aurora PostgreSQL

Amazon ups India bet with fresh $13B AI infrastructure investment

Centralized traffic inspection for Oracle Database@AWS