Amazon Web Services

In this AWS re:Invent 2023 session, Jim Roskind and Ankit Chadha explore the phenomenon of congestion collapse in large distributed systems and how to prevent it. They discuss real-world examples, including the 2018 Amazon Prime Day incident, to illustrate how systems can reach 100% CPU utilization while providing zero productive work. The speakers delve into strategies for avoiding congestion collapse, such as implementing proper retry mechanisms, throttling upstream traffic, and utilizing AWS services like CloudWatch, WAF, and SQS. The session also covers testing methodologies, including crash testing and chaos engineering principles, to proactively identify and mitigate potential issues. This comprehensive talk provides valuable insights for developers and system architects looking to build more resilient and efficient distributed systems on AWS.

cloud-trends-and-knowledge
skills-and-how-to
resilience
mgmt-govern
networking
Show 5 more

Up Next

VideoThumbnail
30:23

T3-2 Amazon SageMaker Canvasで始めるノーコード機械学習 (Level 200)

Jun 27, 2025
VideoThumbnail
31:49

T2-3 AWS を使った生成 AI アプリケーション開発 (Level 300)

Jun 27, 2025
VideoThumbnail
26:05

T4-4: AWS 認定 受験準備の進め方 AWS Certified Solutions Architect – Associate 編 後半

Jun 26, 2025
VideoThumbnail
32:15

T3-1: はじめてのコンテナワークロード - AWS でのコンテナ活用の第一歩

Jun 26, 2025
VideoThumbnail
29:37

BOS-09: はじめてのサーバーレス - AWS Lambda でサーバーレスアプリケーション開発 (Level 200)

Jun 26, 2025