By Mike Haken

Introduction

Years before I came to Amazon, I worked in restaurants. I held numerous positions, including server, line cook, bartender, and delivery driver. I noticed that most of the places I worked at dealt with the load of busy events, like the lunch rush, in similar ways. When I started working at Amazon, I was struck by how much the approaches to solving problems of load in restaurants resembled how we build resilient systems in the cloud.

In restaurants, usually all customers could be seated and served quickly. However, under certain circumstances, even when it wasn’t that busy, temporary problems could cause significant delays. This happened when many customers either arrived or ordered at the same time, or when mistakes caused rework (such as dropping a plate of food, and unfortunately, I’ve dropped a few). When it did get busy and the number of customers exceeded our capacity to serve them, we would create a queue of customers waiting for a table. Both of these situations made them unhappy. To manage these problems, we employed a number of strategies to help provide a predictable and consistently good customer experience.

My experience working in restaurants also shaped how I behaved when I went out to eat. I wanted to limit my impact on the restaurant’s load, so if there were a queue at the door, I might choose to come back later or go somewhere else. If I asked a server for something, and they were really busy and forgot, I would politely ask again. If there were a minor problem with my order when it was really busy, I might just not eat the cold broccoli instead of sending my whole order back. I wanted to avoid creating further delays and extra wait times for myself and others. In short, I wanted to be a well-mannered restaurant patron. And without knowing it, I’d experienced what overload is like from both the server-side and client-side perspectives in distributed systems, as well as a few strategies to deal with those situations before I ever entered the technology industry.

After trading my “server-full” experience for serverless technologies in the cloud, I continued to draw upon my experience both as a restaurant worker and a restaurant customer. At Amazon, one of the areas I’ve focused on is how to make systems more resilient, in particular, how to control excessive load. I observed how spikes in load could affect systems in the cloud. It was similar to the way overload affected the quality of service in restaurants. However, I also saw a significant difference. Cloud services are elastic and can rapidly add capacity in response to increased demand, whereas capacity in restaurants, such as square footage, is static and can take years to scale. So, while overload events at restaurants can be quite common, at Amazon, we go to great lengths to prevent overload events from ever occurring.

Automated capacity forecasting and auto scaling help ensure our services stay ahead of demand by significant margins. We favor being over-provisioned to provide a buffer in capacity. This relieves customers from traditional on-premises capacity planning and allows them to build elastic services in the cloud. For example, in our services, we plan for extremely unlikely events, such as the loss of capacity of a single Availability Zone, and we have enough capacity pre-provisioned in other Availability Zones to seamlessly absorb the additional work and forecasted usage, as shown in the following figure.

This also means that under normal circumstances our services have sufficient capacity to handle significant variations in load. In the cloud, true overload scenarios are rare events. But when things don’t go to plan, and one of these exceptional events does occur, we need to be prepared to prevent impact to our customers’ experience.
 
At Amazon, we use a combination of strategies to prevent overload, and we’ve described some of them in detail in Amazon Builders’ Library articles, which I’ll discuss later. Many of these strategies describe how we build well-protected services. Some of them are operational strategies for reacting to overload situations after they have been detected by our observability systems. Others are architectural strategies that are part of the design of our services to prevent overload from occurring. Additionally, our services typically interact with dependencies, and we employ strategies to be well-behaved clients. The specifics of the various techniques we use across these three areas are complex topics that we don’t have the space to go into here. However, we will provide links to articles that provide deep dives on them.
 
This article will give you a high-level overview of some of our most common strategies for controlling load, describe when and why we use each strategy, and identify the pros and cons of each. I hope this will help you choose the right combination of strategies for controlling load in your own systems to build well-protected services, well-behaved clients, or both.

Well-protected services

Restaurants need to manage customer demand (load) as well as service time (latency) to maintain the customer experience their patrons expect. In restaurants, we knew that capacity was limited. There are only so many tables, chairs, servers, and cooks. So, when a restaurant is full, it can only seat people at the rate that people leave. Increases in the time it takes to complete meal service slow the exit rate of customers, which in turn reduces the rate at which new customers can be seated.

Some of the approaches we used to manage demand and service time are operational. They are the ways the restaurant responds when load or latency starts to increase. Other restaurants I worked at were able to use architectural approaches. The restaurant’s systems and workflows were designed to help prevent overload from occurring. These approaches allowed a lot of different parties to enjoy meals at the same time, while giving each party the same service and attention they’d get in a private setting.

At Amazon, we also want to protect ourselves from overload situations. Like restaurants, we use both operational and architectural strategies to be resilient to excessive load, which helps us preserve a good customer experience.

Operational practices

Operational practices include the strategies we use to respond to unanticipated load after the event occurs. Typically, we use a combination of these strategies together, with each providing a different dimension of protection. Let’s examine a few of the operational strategies that help us maintain a consistent customer experience when overload occurs.

Load shedding

One of the most dreaded responsibilities in the restaurant business, at least for me, is called side work. Side work includes activities like refilling the salsa pan, restocking tortilla chips, rolling forks and knives into napkins, and refilling ice. These activities helped the restaurant run, but there weren’t resources specifically dedicated to them. When the restaurant would get really busy, we’d prioritize the most important work, such as taking customer orders, refiling drinks, and delivering food, and we’d deprioritize side work. That enabled us to keep up with the increased demand and maintain the average service time. If we didn’t reprioritize in this way, we might spend a lot of time doing work that wasn’t essential for the customer experience (eating a meal in a reasonable time) resulting in an overall slowdown in the restaurant. But neglecting side work couldn’t last forever, eventually the chips and salsa would run out, and we had to restock them. Deprioritizing side work is analogous to load shedding in the cloud—intentionally discarding work temporarily in a system.

When the offered load a system receives increases, it can hit a tipping point where that additional load lowers the system’s goodput (its ability to successfully process work within the client timeout). The acts of accepting new work, placing it into a processing queue, and context switching between different processes all contribute to additional processing latency. This is shown in the following figure.

Load shedding helps prevent making a bad situation worse. It can protect a service from becoming overwhelmed during transient spikes in load and give us time to scale or address the root cause of the problem. The result of load shedding, allowing the system to handle increased load while maintaining a consistent level of goodput, is shown in the next figure.

We use load shedding in varying dimensions. It could be per service, per API, or per resource. In implementing load shedding, it’s important to build the system to know when it’s taking on too much work. Therefore, we might look at request rate, thread pool utilization, or queue depth among other metrics. Overload testing is an essential component to both identifying the tipping point as well as ensuring that load shedding keeps goodput flat instead of turning down after that tipping point is reached.

The effects of load shedding are generally indiscriminate in their impact on customers. It’s a coarse-grained tool to control load. While it helps extend the runway the service has to maintain a high level of goodput, it also has an impact on the customer experience, so it’s not the only solution we want in place to handle the situation. We use this strategy in coordination with others to create a more holistic approach to controlling load in our services. Refer to the Amazon Builders' Library article, Using load shedding to avoid overload for more details on this strategy.

Fairness and quota management

Sometimes excessive load events can be driven by a single customer who starts to use an unexpectedly large portion of resources. At one restaurant where I worked, on holidays we’d give customers a time limit on how long they could have a table before they needed to leave. This ensured that one party couldn’t occupy a table for the entire night and allowed us to turn it over and seat other customers. By limiting the amount of time any one party could occupy a table, we were able to accommodate more customers during the meal service. This provided a fairer share of the tables each night.

At Amazon, we follow a similar practice for ensuring customers get an equitable share of the available resources. When a single customer is driving load, we don’t want to use load shedding indiscriminately against all customers. Instead, we use techniques that maintain a single-tenant experience in a multi-tenant system so that customers receive consistent, predictable performance. To do so, we typically measure resource consumption through request rate or number of provisioned resources and enforce these limits through Service Quotas. For example, we provide service quotas on transactions per second (TPS) in our APIs, the number of Amazon Elastic Compute Cloud (Amazon EC2) instances you can launch, or the concurrency of AWS Lambda functions in your account.

When it comes to maintaining fairness of TPS, we might decide to rate limit customers who exceed their quota by using algorithms like token bucket (generally the most popular), leaky bucket, exponentially weighted moving average (EWMA), fixed window, or sliding window. However, our services aren’t under full load all of the time, and so occasionally it’s fair to let a customer exceed their quota temporarily through bursting in conjunction with one of these algorithms. The following figure illustrates the use of a token bucket algorithm for rate limiting.

For service quotas, we publish the default and applied values of quota information that can be accessed through the AWS Management Console or the AWS CLI. To help you proactively manage your service quotas, we provide AWS usage metrics in Amazon CloudWatch. These metrics allow you to manage usage by visualizing metrics, creating custom dashboards, and setting up alarms that indicate when you approach a quota threshold so you can request a quota increase when needed. This can make it easier to stay ahead of service quota increases or help you address inadvertent usage of your resource quotas.

Fairness helps deliver a consistent, single-tenant experience in a multi-tenant environment. But it comes at the cost of occasionally providing a degraded customer experience when we have to rate limit or enforce a quota. Like load shedding, fairness is something we use in combination with other strategies to comprehensively manage load. See the Amazon Builders’ Library article, Fairness in multi-tenant systems, for additional details.

Auto scaling

When restaurants see a trend of increased or decreased usage of any item over the course of a day, week, or month, they ensure they are stocked appropriately for consumption of that item in the future. These usage trends help continuously forecast future capacity needs. For example, as the seasons change from spring to summer, we’d typically see a drop off in hot soup orders and an increase in cold soup orders. As these changes occurred, we’d order fewer ingredients for the hot soup and more ingredients for the cold soup to serve our forecast and prevent wasted resources.

But this kind of forecasting doesn’t handle sudden spikes in demand well. In the United States, National Gazpacho Day occurs in the normally cold month of December each year. If we weren’t pre-provisioned in expectation of demand on that day, orders could likely exceed the quantity of gazpacho we’d normally stock in the winter. Typically, when we ran out of a menu item, we’d remove it from the menu (“86 it” in restaurant speak) and stop accepting orders. Then, we either scaled up by preparing more for that meal service (if we had the ingredients) or receiving our next food delivery. Although this usually provided a stopgap to give us time to scale production, on National Gazpacho Day the increased demand was transient and increasing our supply of gazpacho for the next day in response wouldn’t be useful. Auto scaling in the cloud has many of the same considerations.

At Amazon, our primary strategy for ensuring sufficient capacity is through our automated forecasting. This ensures that we provide future capacity needs that also contain significant buffers for large fluctuations in actual usage, including short-term events like Amazon Prime Day or the day after Thanksgiving in the United States. We also proactively monitor service quotas and can automatically request quota increases when a service is nearing a limit to ensure we scale in that dimension as well. Additionally, we pre-provision resources to handle the loss of capacity of one Availability Zone, which acts as an additional capacity buffer. Finally, we use Amazon EC2 Auto Scaling to help tailor our capacity to shorter-term changes in demand. Auto Scaling allows our services to continue to be elastic while helping optimize our capacity provisioning.

However, in rare events where traffic spikes suddenly beyond the numbers that we’ve forecasted, auto scaling by itself can be insufficient because it doesn’t react fast enough. This is illustrated in the following figure, which shows a coarse-grained view on the left of how capacity forecasting and auto scaling keep us ahead of the actual demand curve. On the right, the figure zooms in on a small section, showing a different situation. On the left side of the granular view, transient spikes are handled through the capacity buffer we implement. On the right side of the granular view, a sudden load spike exceeds our provisioned capacity over a period of time that is shorter than the time required for auto scaling to detect and react. This situation must be managed with load shedding or rate limiting to protect the service from overload impacts.

Forecasting, auto scaling, and pre-provisioned capacity are essential tools for providing elasticity in the cloud and delivering a good customer experience while minimizing unnecessary waste. At Amazon, we make substantial investments in capacity management to handle significant, transient load spikes. During the rare events where demand exceeds our provisioned capacity more quickly than auto scaling can handle, we typically combine load shedding and rate limiting to help protect the service while we scale to meet demand, or the load event dissipates. The combination of these operational practices helps provide part of a holistic approach to managing excessive load.

Architectural practices

Restaurants not only respond to load issues, but they also proactively build their workflows, practices, and interactions to help prevent certain types of overload scenarios. For example, some restaurants I worked at required customers to make a reservation and didn’t take walk-ins. This prevented a Friday night dinner rush, allowing the restaurant to operate on a predictable rate of diners. In much the same way, we architect services at Amazon to help reduce the number of possible overload situations that can occur. In this section, we’ll look at some of the architectural practices we use to prevent excessive load.

Managing queue depth

When a restaurant would go on a wait, we would start adding customers to the wait list. As the queue grew, it would become harder and harder to accurately predict individual customer wait times. The restaurant needed to be able to manage the queue to avoid customers waiting for longer than they were told. After waiting long enough, customers might just leave and go somewhere else. And the list couldn’t grow indefinitely. Beyond providing accurate wait times, at some point the restaurant would need to close and be cleaned for the next day (and the staff would certainly want to go home at some point).

I’ve also experienced the way that queue depth affects the customer experience when it takes too long to bring food to the table. On occasion, a customer might leave before receiving their meal (their timeout threshold had been reached). If the kitchen were unaware that the customer had left, it would spend time preparing meals based on their ticket queue for people that would never receive them. As customers gave up waiting for their food, the kitchen would continue to do work that didn’t lead to goodput. Then, if more customers walked out, the situation could continue in perpetuity, similar to a metastable failure. To stop the sustained impact, servers needed to tell the kitchen which tables had already left, so the cooks could move on to orders for customers who were still patiently waiting.

The following figure shows how this situation plays out in the cloud. It illustrates an excessively deep First In, First Out (FIFO) queue where workers waste time and resources processing messages that the client has given up on.

We take great care at Amazon to design our systems to prevent excessive latency and metastable failures driven by queue depth. A number of our approaches to managing queues are detailed in the Amazon Builders’ Library article, Avoiding insurmountable queue backlogs. Generally, the approaches Amazon uses are either to bound the queue and reject excess work through rate limiting or load shedding, or use an approach like a Last In, First Out (LIFO) queue to process the most recently received work that is most likely to succeed. These strategies help give our systems a better chance of doing useful work, which in turn helps provide a more consistent customer experience. These strategies also allow the queue to be testable at a maximum size that provides predictable performance under load. However, they also potentially reject some work or by design ignore work based on the service’s understanding of what is most likely to succeed.

Constant work

At one of the restaurants where I worked, we’d sometimes host private events. Guests would ebb and flow in and out of the event without a waitlist (at least up to what the fire code allowed in the building). Guests could come and go, and we didn’t know how long they would stay. This created a guest service problem. If we took a unique order from every person and submitted each order to the kitchen, it would produce a very spikey, inconsistent load pattern. This could result in a huge backlog of orders. Additionally, guests might leave before they received their order, resulting in wasted effort and food (as in the queue situation discussed previously).

At these events we avoided this guest service problem by performing constant work to prevent overload. In restaurant terms, this means that we sent servers out with hors d’oeuvres and small plates at regular intervals, which alleviated pressure on the kitchen. The staff knew that they had to plate food at consistent intervals and only had to prepare a set number of dishes. We planned to serve the maximum number of guests that could fit in the venue. This provided a good customer experience. Everyone could graze, and new platters of food came out quickly. The amount of work the kitchen and wait staff performed didn’t fluctuate based on the number of guests. However, the downside was that food might be wasted and extra work might be done.

When we build services at Amazon that we expect will need to process highly variable rates of change, constant work is a common strategy we select to smooth out the variance in those change rates. Constant work means that a system doesn’t scale up or down with load or stress. It does the exact same amount of work in almost all conditions resulting in predictable load without spikes. For example, instead of pushing new configuration files to a server fleet each time there is an update, servers instead poll an Amazon Simple Storage Service (Amazon S3) bucket every few seconds for the configuration file and apply it, even if nothing has changed, as shown in the following figure.

However, this means that the system must be scaled to handle that constant rate of work, and it introduces potential additional cost by performing work when it might not be needed. Just like the possibility of wasted food that was created at the restaurant, this pattern can introduce wasted work in the system, though we typically find this cost is offset by the resilience and stability benefit it provides. But we can’t always use this pattern, sometimes we really do need event-driven systems. See the Amazon Builders’ Library article, Reliability, constant work, and a good cup of coffee, for more details.

Putting the smaller service in control

During my restaurant career, I did a stint tending bar. I remember that when we transitioned from meal service to just drink service on our busiest nights, we could have two or three rows of customers waiting at the bar to place drink orders with us. Typically, on those busy nights there was at least an order of magnitude more customers than bartenders. We needed to control the rate of orders so we could avoid the situation where all of the customers shouted out their orders at the same time and overwhelmed the bar staff. We controlled the rate of the drink orders by calmly walking up and down the bar, deciding who we’d ask, “What would you like?” This put the smaller group, the bartenders, in control of the demands of the larger group, the customers. The bartenders were pulling orders from the customers instead of letting the orders be pushed to them.

I was reminded of this strategy when I thought about the interactions between the control plane and data plane in our services. At AWS, most of our services are divided into these components. The control plane APIs provide the administrative and configuration functions of the service, such as creating, updating, and deleting resources. The data plane provides the core day-to-day functionality of the provisioned resource. We expect that the control plane is used at a much lower volume than the data plane. Thus, the fleets that make up the control plane are also considerably smaller than those making up the data plane. However, the data plane relies on information from the control plane for its operation. We carefully consider whether the data plane should pull that information from the control plane, or whether the control plane should push the information to the data plane. Typically, we decide to put the smaller fleet in control, meaning we choose to push from the control plane to the data plane. This prevents the larger data plane fleet from overwhelming the control plane fleet. The following figure illustrates this scenario.

This strategy allows us to maintain separate components within a service that are scaled independently based on their own needs, while still allowing them to interact in a way that avoids potential overload. But this strategy has its challenges. The larger fleet needs to maintain all of the data pushed to it from the smaller fleet and implement recovery mechanisms should the data become lost or corrupted. Additionally, the smaller fleet needs to maintain an inventory of every node in the larger fleet and equally share work pushing updates. And finally, the smaller fleet needs to manage the convergence of updates to ensure nodes in the larger fleet that are temporarily unavailable still receive them. Refer to the Amazon Builders’ Library article, Avoiding overload in distributed systems by putting the smaller service in control, for additional information.

Avoiding cold caches

At the restaurants where I worked, we typically prepared some of the most popular items ahead of time. This helped speed up meal preparation when the restaurant opened. For example, we’d make large batches of French fries, have the dry components of salads premade, or have pots of soup prewarmed. Keeping premade things stocked was the focus of a lot of the asynchronous side work I mentioned earlier. The system worked well until one of the pre-prepped items ran out, perhaps because we were load shedding our side work. When, say, the soup ran out, we had to go back to the walk-in fridge, get a new pot of soup, and start warming it up from scratch, which took much longer than incremental top ups. To get our lunch soup and sandwich combo orders out on time, the servers created a workaround. They heated up the soup in the microwave. When this happened, you might have seen a long line of servers in the kitchen, each waiting for the microwave to warm up cups of soup. This workaround applies a lot of load on the microwave, which can only handle a few cups of soup at a time. This scenario is very similar to the cold cache issues we want to prevent at Amazon.

We typically add caches to reduce tail-end latency, reduce cost, or mask downstream transient errors. But we’ve also seen that services tend to optimize for these improvements over time and the cache transitions from a helpful addition to a critical component. When the cache goes cold, perhaps through a service restart, cache outage, or change in traffic patterns, all of the work the cache was offloading is now applied directly to the downstream data source. If the downstream data source was only scaled for the typical load, like 10 percent of reads, this change in traffic can overwhelm it, causing latency to exceed client timeouts and reducing the perceived availability, as shown in the following figure.

To avoid this situation, we generally limit concurrency through rate limiting or load shedding while the cache is warmed to prevent overload. Alternatively, in some services, the cache can be prewarmed by tracking the most recently used or most popular entries in the underlying data store and loading that data first. In other cases, we always read from the origin in addition to the cache to remove the bimodal behavior caches can introduce and ensure the database is always scaled to handle its maximum load. This trades off the performance benefit while still allowing the cache to provide redundancy during impairments of the origin. We also seriously consider whether a cache is the appropriate solution for the problem at all. See the Amazon Builders’ Library articles, Caching challenges and strategies and Using dependency isolation to contain concurrency overload, for more information.

Well-behaved clients

As I mentioned earlier, my experience in restaurants has made me want to be a well-mannered patron. I understand the stress of being a server and working in a kitchen, especially when a restaurant is busy. So, I don’t want to add to anyone’s load. At Amazon, we are guided by a similar principle: Avoid making the situation worse for a dependency that is under duress. We want to be a well-behaved client. Let’s look at two patterns we use to prevent applying excessive load to dependencies through the lens of my restaurant experiences.

Circuit breakers

Despite our best efforts, on rare occasions the kitchen became overwhelmed and couldn’t keep up with customer demand. In the worst cases, we stopped food service (meaning that we put a pause on accepting new orders) to allow the cooks to finish the meals that were already in their queue. After the food started coming out on time again, we started accepting new orders. While the kitchen was under duress, the wait staff tried to be well-behaved clients by not adding to the kitchen’s load. But this approach sometimes introduced the unintended impact of making a small problem bigger. Let’s say we stopped food service because the grill station was backed up. If the deep fryer station and salad station weren’t experiencing a delay, we’ve unnecessarily stopped taking orders that they could still fulfill. This would have a negative impact on tables that just wanted salads or French fries.

In distributed systems, we call this approach to preventing the sustained overload of a dependency that’s in duress the circuit breaker pattern. This pattern mimics how circuit breakers are used in electrical systems to prevent too much current from being applied to a single electrical circuit. Circuit breakers in software have three states. They start in the closed state where requests flow normally to the dependency. If the dependency crosses a latency or failure threshold, the circuit breaker transitions to the open state where requests are blocked. After a period of backoff, it moves to the half-open state where a small percentage of requests are sent to the dependency to gauge its health. If responses are once again within the healthy threshold, the circuit breaker closes. If the responses are not within the healthy threshold, it transitions back to open. The following figure illustrates these three states of a software circuit breaker.

Circuit breakers help prevent overload in two dimensions. First, they stop our services from continuing to overwhelm the dependency. Second, they help prevent our services from exhausting their own resources like threads, CPU cycles, or network ports through additional concurrency while they wait for requests to the dependency to fail or time out.

The potential downside to using circuit breakers occurs in the situation we described at the restaurant—when a small event is made larger. Circuit breakers, when open, choose to fail client requests intentionally without giving the dependency a chance to succeed. In fact, some requests to the dependency might not be affected by the overload at all, but are still rejected by the circuit breaker, which unnecessarily increases the failure rate. For example, in the following figure, the data store for records starting with A-H is overwhelmed (maybe a new product launch is driving everyone to look up the same product starting with A), tripping the circuit breaker, and stopping all requests to the service. However, requests for data starting with I-R and S-Z aren’t impacted by the load, but they are still rejected.

At Amazon, we choose where to use circuit breakers carefully, and we don’t treat dependencies uniformly. A dependency can experience failures in a single partition, fault boundary, host, or customer, among other dimensions. This means our circuit breakers are more granular than the “whole dependency” and are typically aligned to the dependency’s expected fault domains. This granularity comes at the cost of increased complexity for implementing “mini” circuit breakers, and it requires knowledge of how the dependency is architected or partitioned. Setting the appropriate threshold for the circuit breaker can also be a challenge. We want to make sure that a single client who is being rate limited or invoking a latent bug doesn’t inadvertently trip the circuit breaker for everyone. Improperly scoped and tuned circuit breakers can make the impact larger. So, when we choose to use a circuit breaker to control load, we extensively and continuously test and validate those thresholds to prevent unintentional impact.

Retries

When I was busy as a server, I could forget things, even if I wrote them down. There were lots of distractions—greeting a new table of customers, dropping off an order, or needing to finish some side work. Now, when I’m a customer in a restaurant, if my request isn’t quickly fulfilled, I typically let a little bit of time pass to see if the server remembers what I asked for and then ask again. By retrying my request, I can hopefully get the outcome I want (like an extra side of salad dressing) despite the temporary distractions for the server. I also want to bound how frequently and how many times I ask. If I flagged down my server to ask for a side of salad dressing every time they walked by, I would just further delay them from ever getting back to the kitchen. They’d also probably start to get annoyed and might begin ignoring me, effectively rate limiting my retries.

In distributed systems, retries can mask transient errors and rate limiting, which transparently improves the perceived availability of a dependency. There are several options for implementing a retry strategy. One common implementation is letting the client retry every request up to N times using exponential backoff with jitter between requests. The drawback of this approach is that it can add work to the system at a time when the system might be impaired because of overload, causing additional impact. To go back to the restaurant analogy, consider a really impatient customer who wants more salad dressing. Instead of just asking their server once, they ask every server who happens to walk by until their request is fulfilled. Without explicit coordination among the servers, the amount of work being done to bring a single side of salad dressing is multiplied by the number of servers that were asked for it. It also likely results in a lot of wasted salad dressing since every server is now bringing the customer an extra side.

This situation can get worse if the request needs to traverse multiple services in a call chain. The work done grows exponentially with the number of nodes. The resulting number of calls to the last dependency is k^n where k is the number of retries each node makes and n is the call chain depth. The following figure illustrates the effects of a retry storm. Each service makes up to three requests (1 original and 2 retries). The result is that service 3 ends up receiving 9 total requests, and the system processes 12 requests overall.

To reduce work amplification, we frequently use an approach called adaptive retries instead. This mode of retries still uses exponential backoff with jitter, but it adds two protection mechanisms. The first is a retry quota implemented with a token bucket. Each retry takes a quantity of tokens from the bucket. Retries in response to different errors might cost different quantities of tokens. For example, a retry in response to a timeout might require more tokens than a retry due to an internal server error because the former situation might indicate that the dependency is overloaded to the point of not being able to respond. Successful responses allow the token bucket to be refilled with a fraction of the tokens required for a retry (maybe 0.1 or 0.01 tokens, meaning we earn 1 retry for every 10 or 100 successful responses, up to the bucket’s maximum size), while error responses don’t. Clients can retry when the error rate is low, but are rate limited when the error rate is higher. This limits the maximum increase in traffic while a dependency is impaired and significantly reduces the amplification.

The second protection mechanism is client-side rate limiting, which is also implemented with a token bucket. This approach applies to both original requests and retries in order to dynamically adapt to dependency capacity, and it is focused specifically on requests that have been throttled. In this case, the token bucket has a dynamically changing fill rate and capacity. The fill rate and capacity are increased after a successful response, but when a throttling response is received, they are reduced. This helps our clients dynamically respond to changing capacity conditions and reduces work amplification on the dependency in response to throttling. Typically, we’ll apply the client-side rate-limiting logic first and then evaluate the available retry quota when a retry is needed.

It’s also important that wherever we perform rate limiting or use circuit breakers, we place them as close to the source as possible to most effectively reduce the work amplification. We’ve found that it’s not always safe to retry, and adaptive retries help us be a well-behaved client by striking a balance between improving the chances of success during transient errors and minimizing the amount of additional load applied to a dependency under duress. The AWS CLI and the AWS SDK both offer adaptive retries.

Operational visibility

When using any of the strategies summarized in this article, we have to implement good operational visibility in order to operate our resilient systems effectively. In restaurants there is typically a front-of-house (FOH) manager, who manages the servers, hosts, and bartenders, and a back-of-house (BOH) manager, who manages the kitchen staff. These managers continually inspect the performance of each part of the restaurant to understand where the backlogs and bottlenecks are so they can select the right reaction to different situations.

In the distributed systems that we build at Amazon, we need the same kind of insights. First, we want to use diverse perspectives to measure the system, just like using the FOH and BOH managers. If a restaurant were to only measure wait times from the kitchen’s perspective, it could miss the line of people standing outside. At Amazon, measuring everywhere is critical to getting a holistic view of the service and understanding where load is having impact.

Second, we want to measure utilization. Resources can run out, and it’s important to provide early warning indicators when this is about to happen. These can be things like CPU, memory, disk space, thread pools, queues, or service quotas. All of these resources can be exhausted through over-utilization. Such overutilization signals excessive load, which we’d prefer to know before the resources run out so we can use strategies to prevent customer impact.

Third, it’s critical to consider the dimensionality of our metrics so that our measurements are precise. A system-wide error rate could mask errors happening in a single API. Instead, we want per-API dimensions in our metrics so we can alarm on the error rate for each one. We also want to separately measure things that fail separately. For example, an Availability Zone represents a fault domain, so we also want to produce metrics for each Availability Zone in order to respond to impairments of resources in a single Availability Zone. In some cases, these dimensions might have a high-level of cardinality, meaning there are a lot of unique keys, like hosts in a fleet or customers. This makes creating an alarm and dashboard per unique key unrealistic, so instead we commonly combine our metrics and logs using the Amazon CloudWatch embedded metric format (EMF) and query those logs using CloudWatch Contributor Insights to find the top-N contributors to some metric. From there, we can create dashboards and alarms on the contributor data. This allows us to answer questions like “which individual customers are driving excessive load to a service” or “is one host producing a lot more latency than all the others?”

Finally, it’s important to build a culture around resilience and operational excellence. At Amazon, one of the mechanisms we leverage is weekly reviews of our dashboards and metrics. We want to be sure we have the right dashboards so operators can act quickly when something goes wrong. We also want to inspect the established alarm thresholds and ensure that the metrics are sufficient. Continuous inspection helps us ensure we have the right level of operational visibility to detect and respond to load events when they occur. Refer to Building dashboards for operational visibility for more details.

Many of the other Amazon Builders’ Library articles I’ve linked to go into much more detail about the visibility required to effectively use each strategy. What we’ve found over the years is that more visibility is better. By using a continuous, iterative process to instrument our services we can derive better insights that allow us to react more quickly with better accuracy and precision.

Conclusion

Our primary strategy at Amazon for managing load is proactive forecasting and scaling to stay ahead of customer demand and provide sufficient capacity to handle significant load fluctuations. In the rare circumstances that load exceeds our provisioned capacity, we’ve found that the strategies outlined in this article paired with the right operational visibility have provided us with a lot of control over how to deal with excessive load. Each strategy has targeted use cases and pros and cons, but using a combination of these approaches allows us to mitigate a broad range of overload impacts. Most of the strategies discussed in this article apply to request/response style services. But you might also want to consider whether that’s the right architecture in the first place or whether pub/sub, batch, or push-style WebSocket architectures, which are intrinsically less vulnerable to excessive load to begin with, are better suited to your needs.

At Amazon, having both well-behaved clients and well-protected services through operational and architectural strategies is critical to building resilient systems that provide a consistent, predictable customer experience. We use a handful of well-tested strategies, usually built into common libraries or services, to simplify our approach and make it easier for our engineering teams to adopt and use them. Whether it’s because the service has become an overnight success, a misbehaved client sends too many requests, or any combination of factors, these strategies help ensure our services continue to deliver a three Michelin star experience.


About the author

Mike Haken
Mike Haken

Michael Haken is a Senior Principal Solutions Architect on the AWS Strategic Accounts team. He has over 15 years’ experience supporting financial services, public sector, and digital native customers. Michael has his B.A. from UVA and M.S. in Computer Science from Johns Hopkins. Outside of work you’ll find him playing with his family and dogs on his farm.

Going faster with continuous delivery Automating safe, hands-on deployments Ensuring rollback safety during deployments