Resilience | AWS Architecture Blog

Build resilient generative AI agents

Generative AI agents in production environments demand resilience strategies that go beyond traditional software patterns. AI agents make autonomous decisions, consume substantial computational resources, and interact with external systems in unpredictable ways. These characteristics create failure modes that conventional resilience approaches might not address. This post presents a framework for AI agent resilience risk analysis […]

AWS Fault Injection Service AZ Availability: Power Interruption scenario

Improving platform resilience at Cash App

Cash App, a leading peer-to-peer payments and digital wallet service from Block, Inc., has implemented resilience improvements across the entire technology stack. In this post, we discuss how Cash App improved the resilience of its compute platform built on Amazon Elastic Kubernetes Service (Amazon EKS) by implementing a dual-cluster topology to reduce single points of failure. We also discuss how Cash App used AWS Fault Injection Service (AWS FIS) to conduct an Availability Zone power interruption scenario in non-production environments, preparing the platform team for real-world failures and ongoing

Implement event-driven invoice processing for resilient financial monitoring at scale

This post demonstrates how to build a Business Event Monitoring System (BEMS) on AWS that handles over 86 million daily events with near real-time visibility, cross-Region controls, and automated alerts for stuck events. You might deploy this system for business-level insights into how events are flowing through your organization or to visualize the flow of transactions in real time. Downstream services also will have the option to process and respond to events originating within the system or not.

spectrum of disaster recovery strategies

Pilot light with reserved capacity: How to optimize DR cost using On-Demand Capacity Reservations

In this post, we explore an intermediate strategy between the pilot light and the warm standby strategies: pilot light with reserved capacity. You can use this strategy to reserve compute capacity in a secondary Region while also limiting cost.

Enhance the resilience of critical workloads by architecting with multiple AWS Regions

In this post, we will share how you can use multi-Region as an architectural approach to achieve higher resilience on Amazon Web Services (AWS). This approach relies on first operating a workload across multiple Availability Zones within an AWS Region, before expanding to achieve even higher resilience by using multiple Regions.

Know before you go – AWS re:Invent 2024 cloud resilience

If you’re attending AWS re:Invent 2024 with the goal of improving your organization’s cloud resilience operations, we will be offering valuable insights, best practices, and fun activities to improve your cloud resilience expertise. This year, we’re offering more than 100 resilience breakout sessions, workshops, chalk talks, builders’ sessions, and code talks. Find the complete list in the re:Invent 2024 session catalog and filter by “Resilience” in the area of interest field. In this post, we highlight must-see sessions for those building resilient applications and architectures on AWS. Reserved seating is now open, so act quickly to claim your seat. Be sure to also check out the vertical-specific re:Invent guides.

When the application experiences an impairment using S3 resources in the primary Region, it fails over to use an S3 bucket in the secondary Region.

Creating an organizational multi-Region failover strategy

AWS Regions provide fault isolation boundaries that prevent correlated failure and contain the impact from AWS service impairments to a single Region when they occur. You can use these fault boundaries to build multi-Region applications that consist of independent, fault-isolated replicas in each Region that limit shared fate scenarios. This allows you to build multi-Region […]

Chaos engineering pattern for hybrid architecture (3-tier application)

London Stock Exchange Group uses chaos engineering on AWS to improve resilience

This post was co-written with Luke Sudgen, Lead DevOps Engineer Post Trade, and Padraig Murphy, Solutions Architect Post Trade, from London Stock Exchange Group. In this post, we’ll discuss some failure scenarios that were tested by London Stock Exchange Group (LSEG) Post Trade Technology teams during a chaos engineering event supported by AWS. Chaos engineering […]

Improving Performance and Reducing Cost Using Availability Zone Affinity

April 2025: This blog post has been updated to include new features in Elastic Load Balancing (ELB). One of the best practices for building resilient systems in Amazon Virtual Private Cloud (VPC) networks is using multiple Availability Zones (AZ). An AZ is one or more discrete data centers with redundant power, networking, and connectivity. Using […]

Figure 1. Current Architecture with improved resiliency and standardized observability

Journey to Adopt Cloud-Native Architecture Series: #3 – Improved Resilience and Standardized Observability

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. In the last blog, Maximizing System Throughput, we talked about design patterns you can adopt to address immediate scaling challenges to provide a better customer experience. In this blog, we talk about architecture patterns to improve system resiliency, why observability […]

AWS Architecture Blog

Category: Resilience