AWS Architecture Blog

re:Invent 2019: Introducing the Amazon Builders’ Library (Part I)

This week I’m telling you about a new site we launched at re:Invent, the Amazon Builders’ Library, a collection of living articles covering topics across architecture, software delivery, and operations. You get to peek under the hood of how Amazon architects, releases, and operates the software underpinning Amazon.com and AWS.

Want to know how Amazon.com does what it does? This is for you. In this two-part series, I’ll highlight some of the best architecture articles written by Amazon’s senior technical leaders and engineers.

Avoiding Insurmountable Queue Backlogs

Avoiding insurmountable queue backlogs

In queueing theory, the behavior of queues when they are short is relatively uninteresting. After all, when a queue is short, everyone is happy. It’s only when the queue is backlogged, when the line to an event goes out the door and around the corner, that people start thinking about throughput and prioritization.

In this article, I discuss strategies we use at Amazon to deal with queue backlog scenarios – design approaches we take to drain queues quickly and to prioritize workloads. Most importantly, I describe how to prevent queue backlogs from building up in the first place. In the first half, I describe scenarios that lead to backlogs, and in the second half, I describe many approaches used at Amazon to avoid backlogs or deal with them gracefully.

Read the full article by David Yanacek, Principal Engineer

Timeouts, Retries, and Backoff With Jitter

Timeouts, retries and backoff with jitter

Whenever one service or system calls another, failures can happen. These failures can come from a variety of factors. They include servers, networks, load balancers, software, operating systems, or even mistakes from system operators. We design our systems to reduce the probability of failure, but impossible to build systems that never fail. So in Amazon, we design our systems to tolerate and reduce the probability of failure, and avoid magnifying a small percentage of failures into a complete outage. To build resilient systems, we employ three essential tools: timeouts, retries, and backoff.

Read the full article by Marc Brooker, Senior Principal Engineer

Challenges With Distributed Systems

Challenges with distributed systems

The moment we added our second server, distributed systems became the way of life at Amazon. When I started at Amazon in 1999, we had so few servers that we could give some of them recognizable names like “fishy” or “online-01”. However, even in 1999, distributed computing was not easy. Then as now, challenges with distributed systems involved latency, scaling, understanding networking APIs, marshalling and unmarshalling data, and the complexity of algorithms such as Paxos. As the systems quickly grew larger and more distributed, what had been theoretical edge cases turned into regular occurrences.

Developing distributed utility computing services, such as reliable long-distance telephone networks, or Amazon Web Services (AWS) services, is hard. Distributed computing is also weirder and less intuitive than other forms of computing because of two interrelated problems. Independent failures and nondeterminism cause the most impactful issues in distributed systems. In addition to the typical computing failures most engineers are used to, failures in distributed systems can occur in many other ways. What’s worse, it’s impossible always to know whether something failed.

Read the full article by Jacob Gabrielson, Senior Principal Engineer

Static Stability Using Availability Zones

Static stability using availability zones

At Amazon, the services we build must meet extremely high availability targets. This means that we need to think carefully about the dependencies that our systems take. We design our systems to stay resilient even when those dependencies are impaired. In this article, we’ll define a pattern that we use called static stability to achieve this level of resilience. We’ll show you how we apply this concept to Availability Zones, a key infrastructure building block in AWS and therefore a bedrock dependency on which all of our services are built.

Read the full article by Becky Weiss, Senior Principal Engineer, and Mike Furr, Principal Engineer

Check back in a week to read about some other architecture-based expert articles that let you in on how Amazon does what it does.