AWS Cloud Resilience resources
Build and run resilient, highly available applications in the AWS cloud
Whitepapers
Resilience Lifecycle Framework
Multi-Region Fundamentals
Advanced Multi-AZ Resilience Patterns
Using AWS Fault Isolation Boundaries
Blogs
Resilience Best Practices
Open allFour things everyone should know about resilience
New to resilience? Read this blog to learn about the top four most important concepts to get you started on your journey to building resilient applications in the cloud.
Strengthen application resilience with myApplications and AWS Resilience Hub
Resilience Hub now seamlessly integrated into myApplications in AWS Console Home, you can effortlessly manage and enhance your application’s resilience alongside other essential metrics.
High Availability Patterns
Open allEnhance the resilience of critical workloads by architecting with multiple AWS Regions
A multi-Region approach is a reliable way to achieve a bounded recovery time for critical applications in the rare event of a service failure in a Region that is impacting your application.
Using zonal shift with Amazon EC2 Auto Scaling
Learn how performing an Auto Scaling Group (ASG) zonal shift fits in to a multi-AZ resilience strategy and considerations for how to use the feature with different architectures.
Rapidly recover from application failures in a single AZ
Performing a zonal shift with Amazon Route 53 Application Recovery Controller enables you to achieve rapid recovery from application failures in a single Availability Zone (AZ).
Automating safe, hands-off deployments
Learn how Amazon automatically validates and safely deploys any type of source change to production, and how you can apply this strategy to your work.
Reliability, constant work, and a good cup of coffee
Learn about building simple, scalable, resilient systems using a clever coffee analogy and AWS services such as Amazon Route 53 and S3.
Making retries safe with idempotent APIs
Learn strategies for using idempotent APIs to reduce complexity and manage retries.
Choosing the right health check with Elastic Load Balancing and EC2 Auto Scaling
Customers frequently use Elastic Load Balancing (ELB) load balancers and Amazon EC2 Auto Scaling groups (ASG) to build scalable, resilient workloads.
Disaster Recovery
Open allDesigning for multi-account scenarios using AWS Elastic Disaster Recovery
AWS Elastic Disaster Recovery offers multi-account capabilities to meet governance, security, and operational requirements.
Enhance business continuity within an Availability Zone using AWS Elastic Disaster Recovery
There are certain situations where you might need to run your workloads in a single AZ. With AWS Elastic Disaster Recovery you can continuously replicate data from your primary AZ to a secondary AZ and recover your applications during both planned and unplanned outages.
Series: Disaster recovery (DR) architecture on AWS
This four-part series shares best practices for disaster recovery across four strategies: backup and restore, pilot light, warm standby, and multi-site active/active.
Creating disaster recovery mechanisms using Amazon Route 53
Modern DNS services, like Amazon Route 53, offer health checks and failover records that you can use to simplify and strengthen your DR plan.
Chaos Engineering
Open allAny day can be Prime Day: How Amazon.com search uses chaos engineering to handle over 84K requests per second
Discover how Amazon Search combines technology and culture to empower its builder teams, ensuring platform resilience through Chaos Engineering.
Bootstrap your chaos engineering journey with AWS Fault Injection Service Scenarios Library
Learn how the AWS Fault Injection Service Scenario Library can make your chaos engineering journey easier.
DORA scenario testing with AWS Fault Injection Service
Learn how you can use AWS Fault Injection Service (FIS) to support the DORA requirements around scenario-based testing through a structured, iterative process of identifying failure scenarios, planning and executing chaos engineering experiments, reporting on the results, and using the information learned to improve operational resilience.
Introducing AWS Fault Injection Service Actions to Inject Chaos in Lambda functions
By purposefully injecting failures and stresses into serverless components, you can uncover hidden weaknesses and validate the fault tolerance of your systems.