AWS Cloud Operations Blog
Guide to AWS Cloud Resilience sessions at re:Invent 2025
If you’re attending AWS re:Invent with the goal of learning how to prevent costly downtime for your organization, you can look forward to more than 150 breakout sessions, workshops, chalk talks, builders’ sessions, and code talks that will help you improve the resilience of your critical applications. Find the complete list in the re:Invent 2025 event catalog and filter to “Resilience” in the area of interest field. In this post, we highlight a few of these must-see sessions. Our recommendations are divided into three topics to help you choose the sessions most relevant to your business: 1/ AWS innovations and best practices, 2/ Build and operate resilient applications, and 3/ Cultivate a culture of resilience. Reserved seating is now open, so act quickly to claim yours.
AWS innovations and best practices
Discover the cutting-edge innovations from AWS that help us deliver the most reliable cloud infrastructure for your applications. Learn about AWS fault isolation boundaries—including Availability Zones (AZs) and Regions—and how to use them to improve the resilience of your applications. And get a peek under the hood as we share our time-tested operational practices and key lessons learned from more than twenty years of operating highly available services at scale.
Breakout sessions
From ideas to impact: Architecting with cloud best practices (ARC204)
2025 marks the 10-year anniversary of the AWS Well-Architected Framework, Cloud Adoption Framework, and AWS Cloud Operating Model. Learn how these foundational frameworks evolved through customer feedback and real-world learnings from thousands of organizations. What began as structured guidance has matured into dynamic insights for optimizing cloud environments. Discover how this continuous feedback drives innovation across architecture reviews, operations, and remediation. Learn practical strategies for applying these unified best practices to accelerate cloud transformation.
Building on AWS resilience: Innovations for critical success (ARC207)
Essential services that power global economies and critical infrastructure demand exceptional resilience. Through nearly two decades of focused innovation, AWS has developed core engineering practices and operational approaches that power critical workloads worldwide. Explore how AWS’s architectural innovations and organizational practices help customers build robust services that maintain resilience during severe disruptions. Learn how AWS’s continued investment in resilience provides the foundation for delivering essential services across governments, economies, and critical infrastructure.
Chalk talks & code talks
Building resilient clients: Architecture patterns from Amazon.com (ARC331)
Discover architectural patterns for building resilient frontend applications, drawn from Amazon.com’s production experience at massive scale. Learn how Amazon.com architects frontend systems to maintain reliability during peak events through fault injection testing, caching strategies, and graceful degradation patterns. Explore implementations of circuit breakers, deployment safety, and comprehensive monitoring for operational excellence. This technical session provides practical patterns for architecting robust client applications that scale, aligned with AWS Well-Architected Framework principles.
Defend against downtime using fault isolation boundaries (COP305)
Building an application to recover from common failure modes aligned to AWS fault isolation boundaries can help you meet your availability goals. In this chalk talk, we’ll share how to use Application Recovery Controller (ARC) to recover your applications from impairments within an AWS Availability Zone and an AWS Region. You’ll leave with an understanding of how ARC works, as well as key take-ways to incorporate into your architecture.
Resilience testing and AWS Lambda actions under the hood (COP414)
As use of serverless technology grows, the resilience testing (also known as chaos engineering) for serverless becomes even more crucial for ensuring reliable and available applications. Join us as we demo new capabilities for testing the resilience of AWS Lambda-based workloads, and unpack how these faults were built and run under the hood. You’ll also leave with valuable lessons gleaned from our customers’ experiences with modern serverless applications.
Build and operate resilient applications
Explore strategies for maximizing your applications’ resilience across single-AZ, multi-AZ, and multi-Region architectures. Discover effective techniques for using automated recovery mechanisms to minimize downtime and strategies to quickly recover from disruptions. And get valuable practical guidance on meeting regulatory compliance requirements to ensure your applications align with industry standards and regulations.
Breakout sessions
Building resilient multi-Region applications with Capital One (ARC404)
Organizations face significant challenges achieving predictable recovery times and maintaining consistency in multi-Region applications at scale. Learn how to use Application Recovery Controller (ARC), Aurora DSQL, and DynamoDB Multi-Region Strong Consistency to create resilient architectures with bounded recovery targets. Through real-world implementation patterns, discover how ARC Region switch controls and AWS Fault Injection Service transform maintenance and testing approaches. This expert-level session provides practical strategies for architecting multi-Region applications that deliver predictable recovery and consistent operations.
Multi-Region disaster recovery & resilience testing (feat. Fidelity) (COP358)
Explore how AWS innovations are revolutionizing disaster recovery (DR) strategies for enterprise-scale organizations. Managing thousands of applications across AWS Regions demands sophisticated DR capabilities that have historically required complex and resource-intensive custom development. Learn how Fidelity transformed DR for 8,500 mission-critical applications by using the multi-Region recovery orchestration, live dashboards, and reporting capabilities of Amazon Application Recovery Controller’s Region switch feature. Combined with AWS Fault Injection Service, Fidelity also validates their recovery procedures under realistic conditions to bolster confidence in their DR plan. Discover how AWS enables enterprises to modernize their infrastructure operations, improve compliance, and enhance business continuity for mission-critical applications.
Architecting resilient multicloud operations, feat. Monzo Bank (HMC201)
When organizations choose a multicloud strategy to address their resilience needs, they often face challenges in areas such as data consistency, service isolation, and long term testing and maintenance. Join this session to learn about Monzo Bank’s resilience journey in implementing a strategic multicloud architecture that offers a practical and efficient approach to operational resilience. We’ll dive deep into Monzo’s Stand-in Platform, which runs critical banking services on a different cloud provider alongside their primary AWS infrastructure. You’ll learn practical patterns for maintaining service availability, managing data consistency trade-offs, and implementing resilient multicloud architectures.
Cyber resilience on AWS, designing security and recovery strategies (GBL204)
Cyber resilience is the ability of an organization to continuously deliver the intended outcome despite adverse cyber events. It has an overlap with disaster recovery where both cyber resilience and disaster recovery involve plans and actions to restore normal operations after an incident. Cyber resilience includes disaster recovery as part of its broader strategy. There are multiple key themes when it comes to designing for cyber resilience: – Protect: Proactive measures taken to safeguard systems, networks, and data.
Chalk talks & builders sessions
Architecting multi-Region expansion for mission-critical workloads (ARC322)
Expanding mission-critical applications across AWS Regions demands meticulous architectural planning, especially with strict SLA requirements. This chalk talk explores key design considerations for multi-Region expansion: evaluating service availability, implementing secure cross-region connectivity, and ensuring reliable operations. Through collaborative scenario-based exercises, learn to map out evaluation methods, network patterns, and operational procedures. Gain actionable architectural techniques for regional expansion projects that maintain high availability and performance for mission-critical workloads.
Cell-based architectures: From connected vehicles to enterprise systems (ARC327)
Connected vehicle platforms demonstrate how cell-based architectures solve challenges of massive device access, data surges, and latency-sensitive workloads. Learn how this architectural pattern extends beyond automotive to transform smart cameras, monitoring systems, solar management, and SaaS platforms. Through practical examples using AWS IoT Core, Amazon MSK, Amazon EKS with Graviton, Amazon Aurora, and Amazon MemoryDB for Redis, discover how to implement fault isolation, scalable deployments, and localized edge services. This session provides architectural patterns for building resilient connected systems that maintain performance at scale across diverse industries.
A practical guide for meeting regulatory resilience requirements (COP210)
Organizations worldwide must demonstrate their operational resilience to meet regulatory requirements like DORA, NIS2, and RegSCI. These regulations are aimed at ensuring organizations have incident detection and disaster recovery plans to prevent disruptions and maintain business continuity. In this chalk talk, you’ll learn how to use AWS services to assess and prove your compliance in regulated industries like financial services and healthcare. We’ll explore practical applications of the D-CAT tool, AWS Fault Injection Service experiment reports, resilience assessments in AWS Resilience Hub, and live dashboards in Amazon Application Recovery Controller to help you evaluate and document your regulatory readiness.
AWS disaster recovery strategies (COP302)
Prepare for the unexpected in this disaster recovery (DR) builders session. We’ll work on an application to implement the DR strategy that aligns to the recovery objectives of your business: Backup and restore, pilot light, warm standby, or AWS Elastic Disaster Recovery (AWS DRS). We’ll cover services such as Amazon Aurora, Amazon S3, Amazon EC2, AWS CloudFront, AWS DRS, AWS Fault Injection Service, and AWS Backup. We will also explore ways to test and validate your chosen approach. You’ll leave with practical insights for building the right DR strategy for your business.
Financial services multi-Region design patterns and best practices (IND317)
Explore proven architectural patterns and design principles for building resilient multi-Region deployments for financial services on AWS. Gain practical insights into leveraging specialized AWS services like Amazon Application Recovery Controller, Amazon Aurora DSQL, and Amazon DynamoDB Multi-Region strong consistency to build robust global solutions. Walk away with a comprehensive understanding of the tradeoffs involved in multi-Region deployments and the ability to make informed architectural decisions that balance reliability, performance, and cost for your organization’s specific requirements.
Workshops
Building and testing resilient multi-AZ applications (ARC304)
Gain hands-on experience building and testing resilient multi-AZ applications. Learn to use Amazon CloudWatch dashboards, insights rules, and composite alarms for comprehensive health monitoring. Practice injecting randomized faults with AWS Fault Injection Service to simulate various single-AZ impairments. Master zonal deployments using AWS CodeDeploy and experience realistic failure scenarios. Explore Amazon Application Recovery Controller’s zonal shift capabilities to recover from failures and maintain customer experience. This workshop provides practical skills for architecting and operating highly available systems on AWS.
From downtime to uptime: Mastering application recovery on AWS (ARC307)
Master the latest features of Amazon Application Recovery Controller (ARC) for enhancing application resilience on AWS. Through hands-on exercises, learn to implement automated recovery workflows, test recovery plans, and monitor recovery operations at scale. Build practical skills in architecting and managing recovery solutions that align with enterprise resilience requirements. This workshop provides cloud architects and DevOps engineers with proven patterns for ensuring business continuity through sophisticated recovery architectures.
Building resilient architectures with observability (COP408)
When critical systems fail, every minute of downtime costs money and trust. Transform an application into a resilient, observable system in this hands-on workshop. Enhance resilience through chaos engineering, inject faults with AWS Fault Injection Service to simulate Availability Zones failures, network issues, and deployment problems. Learn to leverage Amazon CloudWatch and Amazon Application Recovery Controller to detect, diagnose, and automatically recover from failures. Leave with practical experience in building applications that remain observable and resilient under real-world conditions.
Cultivate a culture of resilience
Learn how to integrate resilience earlier in your development cycle through operational readiness reviews, resilience testing, root cause analyses, and game day scenarios. Explore effective safe deployment practices and techniques for building robust observability strategies to help you prevent costly downtime.
Breakout sessions
Mastering Root Cause Analysis: Rebuilding trust after outages (ARC211)
Investigating outages is tough, but effectively explaining them to customers is an even bigger challenge. Root Cause Analysis (RCA) documents often represent the single opportunity to rebuild trust by demonstrating understanding, establishing ownership, and presenting a plan to address shortfalls. Drawing from more than a decade of experience in crafting effective RCAs, learn practical strategies for navigating complexity, bespoke software, and internal jargon while maintaining transparency. Whether you are an ISV or SaaS provider, discover techniques to create insightful RCAs that explain what happened, why it happened, and outline solid remediation plans.
The incident is over: Now what? (COP216)
Optimal operational practice defines how to handle inevitable incidents and recover quickly. What about the aftermath? How do we ensure that the true root cause is tracked down and that effective preventive actions are planned for implementation? How do we turn every incident into an organization-wide learning opportunity? How do the shared responsibility model and third-party software vendors come into play? We’ll share our mental models and decades-long experience around Root Cause Analysis and Correction of Error (COE), so you can drive an effective practice in your own organization.
Chalk talks, code talks, & builders sessions
Operational excellence: Building resilient systems (ARC316)
This chalk talk explores the critical relationship between operational practices and system resilience. Examine how fundamental elements like logging, health checks, and deployment strategies impact application reliability on AWS. Through real-world scenarios, discover common operational pitfalls that compromise system availability and learn practical solutions aligned with Well-Architected principles. Learn proven approaches to strengthen architecture resilience and elevate operational excellence. This interactive session provides architects and operators with actionable patterns for building and maintaining robust cloud systems.
Agent down! Building unbreakable AI workflows (COP321)
Explore how chaos engineering (resilience testing) principles apply to autonomous AI agent workflows in this chalk talk. Learn to stress-test agent-based systems that handle complex, multi-step tasks using AWS Fault Injection Service. Discover how to identify and mitigate failure modes unique to agentic AI, including decision-making loops, task handoff failures, and resource coordination breakdowns. Practice designing experiments that validate agent resilience across orchestration layers, memory systems, and tool interactions. Perfect for teams building or maintaining autonomous AI workflows, this chalk talk provides practical techniques for improving resilience in agent-driven architectures.
Downtime prevention with the Resilience Lifecycle Framework (COP357)
With most system failures stemming from human error, code deployment issues, and misconfigured systems, it’s important to have a framework in place to proactively mitigate risks, practice resilience plans, and prevent the repeat of operational incidents. In this chalk talk you’ll learn how to apply the AWS Resilience Lifecycle Framework, a holistic approach based on years of working with customers and internal teams, that captures resilience best practices. You’ll leave with practical strategies for setting objectives, designing for resilience, resilience testing, conducting operational readiness reviews, and incident analysis reporting to strengthen the resilience posture of your critical workloads.
Build resilient SaaS: Multi-account resilience testing patterns (ISV404)
Enterprise SaaS providers face increasing pressure to maintain high availability while preventing failures from spreading across tenant boundaries. Learn how leading ISVs use AWS Fault Injection Service to validate their multi-tenant architecture’s resilience through controlled reslience testing experiments. We’ll explore real-world examples from Security and HR tech providers who test cross-account failure scenarios while maintaining strict tenant isolation. Discover patterns for implementing resilience testing that strengthen your SaaS architecture without risking customer availability.
Workshops
Chaos engineering workshop (COP304)
This workshop introduces AWS Fault Injection Service (FIS) for running resilience experiments, also known as chaos engineering. You’ll learn how to inject faults and apply test scenarios, such as power interruptions and cross-Region connectivity issues, to see how they affect the behavior of services like Amazon EKS, Amazon ECS, AWS Fargate, Amazon EC2, Amazon S3, and Amazon RDS. You’ll also learn how to produce experiment reports required for compliance in regulated industries. You’ll also learn how to use Amazon CloudWatch, AWS X-Ray, and Amazon CloudWatch RUM to gain key insights from your experiments.