AWS Architecture Blog

A multi-dimensional approach helps you proactively prepare for failures, Part 1: Application layer

Resiliency of applications surpasses everything else in building customer trust. Because of this, it cannot be an afterthought. Instead of simply reacting to a failure, why not be proactive?

As your system expands, you’ll likely encounter issues that can hinder your ability to scale, like security and cost. So, it’s necessary to think about the correct architectural patterns beforehand to minimize your chances of enduring a failure without a recovery plan.

In this multi-part blog series, we’ll show you how to take a multi-dimensional approach to ensure your applications, infrastructure, and operational processes can detect points of failure and can gracefully react if (and inevitably when) a failure occurs.

In each part of the series, we recommend resiliency architecture patterns and managed services to apply across all layers of your distributed system to create a production-ready, resilient application. This post, Part 1, examines how to create application layer resiliency.

Example use case

To illustrate our recommendations for building an effective distributed system, we’ll consider a simple order payment system. Figure 1 shows our base implementation. The real-time part of the distributed system is built with:

This system is transactional so far, but we need to aggregate and analyze data. We do this using Amazon Simple Queue Service (Amazon SQS), AWS Lambda, AWS Step Functions, Amazon S3, and Amazon Athena. Amazon Kinesis Data Firehose and Amazon Redshift populate the data warehouse via their data pipelines. Amazon QuickSight helps us visualize the data.

Architecture components of a distributed system

Figure 1. Architecture components of a distributed system

Next, let’s look into distributed architectural patterns we could apply to the base implementation to bolster resiliency and common tradeoffs you need to consider for each.

Pattern 1: Microservices

Microservices are building blocks for smaller, domain-specific services. As shown in the Microservices on AWS whitepaper, microservices offer many benefits. They reduce the blast radius of any failure, require smaller teams to manage them, and simplify deployment pipelines to get them into production.

Tradeoffs and their workarounds

If your distributed system is comprised of several smaller services, a failure or inability to respond to failures in one service might affect other services in the chain.

To help with this, consider implementing one or more of the following patterns.

Circuit breaker pattern

Like an electrical circuit breaker, the circuit breaker pattern stops cascading failures. You can implement it as an orchestrator, at an individual microservice level, and/or in a service mesh across multiple services to detect timeouts and track failures across services, which prevents a distributed system from getting overwhelmed.

Retries with exponential backoff and jitter

A common way to handle database timeouts is to implement retries. However, if all the transactions retry at the same interval, it might choke the network bandwidth and throttle the database.

We recommend introducing exponential backoff and jitter to the retries and to introduce an element of randomness in the retry interval.

Let’s see this in action. Consider the backend implementation of the order payment system in our example use case, as shown in Figure 2.

Backend processing of order payment system

Figure 2. Backend processing of order payment system

For incoming orders, the payment process must succeed for the order to be processed. To ensure latency in the payments database writes does not affect the transactions that read from the database for the order processing UI:

  1. Isolate reads and writes with a read-replica
  2. Use an Amazon RDS Proxy to handle connection pools and use throttling to help with preventing database congestion and locking
  3. Introduce exponential backoff to increase the time between subsequent retries of failed requests. Combine this approach with “jitter” to prevent large bursts of retries

Figure 3 compares recovery time with exponential backoff to exponential backoff combined with jitter:

Recovery time graph for exponential backoff with and without jitter

Figure 3. Recovery time graph for exponential backoff with and without jitter

Health checks and feature flags

If you’ve deployed new functionality in production and it doesn’t work as expected, use feature flags to turn it off rather than roll back the deployment. This helps you reduce operational complexity and downtime. See the Automating safe, hands-off deployments article for more information.

The load balancer uses periodic automatic health checks to determine if the backend service is healthy and can serve traffic. Figure 4 shows you where to use shallow and deep health checks:

  • Use shallow health checks to probe for host/local failures on a server resource (for example, liveliness checks).
  • Probing for dependency failures needs a deep health check. Deep health checks will give you a better understanding of the health of the system, but they’re expensive, and a dependency check failure can cause cascading failure throughout the system.
Shallow vs deep health checks

Figure 4. Shallow vs deep health checks

Pattern 2: Saga pattern

A saga pattern keeps your data consistent across microservices when a transaction spans multiple services. It updates each service with a message or event to prompt the service to move to the next transaction step.

For example, in our order payment system, a transaction is considered successful if the payment for that order is processed.

Tradeoffs and their workarounds

The saga pattern is great for long transactions. However, if a step in the process cannot complete, you’ll need compensating transactions in place to undo any changes from previous steps. For example, as shown in Figure 5, if the payment fails, the customer’s order must be canceled.

The saga pattern coordinates transactions across a system of microservices

Figure 5. The saga pattern coordinates transactions across a system of microservices

Consider implementing the following pattern to set up compensating transactions.

Serverless saga pattern with Step Functions

We recommend using a serverless orchestrator in Step Functions to roll back to a previous step in an order. Figure 6 shows the serverless orchestrator and how its lightweight, centralized control service coordinates the flow across microservices. It performs compensating transactions during failure scenarios to ensure that if one microservice fails, the others will not.

Step Functions as serverless orchestrators

Figure 6. Step Functions as serverless orchestrators

Pattern 3: Event-driven architecture

An event-driven architecture uses events to communicate between decoupled services. By decoupling your services, they are only aware of the event router, not each other. This means that your services are interoperable, but if one service has a failure, the rest will keep running. The event router acts as an elastic buffer that will accommodate surges in workloads.

Tradeoffs and their workarounds

An event-driven architecture provides multiple implementation options. Carefully consider your system’s distinct attributes and scaling and performance needs to decide on a suitable approach.

The mind map in Figure 7 shows when to use an event-driven pattern, our recommendations for implementation, tradeoffs and their workarounds, and anti-patterns:

Decision criteria for event-driven architecture pattern

Figure 7. Decision criteria for event-driven architecture pattern (click the image to enlarge)

Pattern 4: Cache pattern

Deploying stateless redundant copies of your application components, along with using distributed caches, can improve the availability of your system. This allows your infrastructure to scale in response to user requests.

Tradeoffs and their workarounds

Caches have multiple configuration options. Carefully consider your system’s distinct attributes and scaling and performance needs to decide on a suitable approach.

The mind map in Figure 8 shows when to use a cache pattern, our recommendations for implementation, tradeoffs and their workarounds, and anti-patterns:

Decision criteria for cache pattern

Figure 8. Decision criteria for cache pattern (click the image to enlarge)

Improving application resiliency with bounded contexts

Loosely coupled services improve availability more than only creating redundant copies of your applications. Figure 9 shows the benefits of a loosely coupled system versus a tightly coupled system.

Relationship between loose coupling and availability

Figure 9. Relationship between loose coupling and availability


In this post, we learned about various architecture patterns and their tradeoffs, and provided recommendations on how to gracefully mitigate failures before they happen.

Distributed systems can use all of these patterns, which helps improve resiliency at the application layer. Applying these patterns, along with the Infrastructure and Operations improvements discussed in parts 2 and 3, will provide frameworks for resilient applications.

Piyali Kamra

Piyali Kamra

Piyali Kamra is a seasoned enterprise architect and a hands-on technologist who has over 21 years of experience building and executing large scale enterprise IT projects across geographies.She believes that building large scale enterprise systems is not an exact science but more like an art, where you can’t always choose the best technology that comes to one’s mind but rather tools and technologies must be carefully selected based on the team’s culture , strengths , weaknesses and risks , in tandem with having a futuristic vision as to how you want to shape your product a few years down the road.

Aish Gopalan

Aish Gopalan

Aish Gopalan is a Senior Solutions Architect at AWS based out of the beautiful city of Atlanta, Georgia. Primarily an individual contributor, she has worn multiple hats ranging from Solution Architect, Application Architect, Cloud Transformation Delivery Lead and Developer in her 16yr software journey. Always chipper, she enjoys and follows Crossfit and believes in living for the moment.

Isael Pimentel

Isael Pimentel

Isael Pimentel is a Senior Technical Account Manager at AWS with over 15 years of experience in developing and managing complex infrastructures, and holds several certifications including AWS Solution Architect, AWS Network Specialty, AWS Security Speciality, MSCA, and CCNA.

Aditi Sharma

Aditi Sharma

Aditi Sharma is a Technical Account Manager in AWS and started her journey as a Cloud Support Engineer in the RDS Databases team where she worked on various DB engines. She has since then held several roles in AWS before moving to Technical Account Manager role. Prior to joining AWS, she was working as a Software Development Engineer in a Banking industry. Aditi hold Master’s in Management Information systems and Bachelor’s in Computer Science Engineering and also Certified Scrum Master, Product Owner and AWS Associate Solutions Architect.