re:Invent 2019: Introducing the Amazon Builders’ Library (Part II)
In last week’s post, I told you about a new site we introduced at re:Invent at the beginning of this month, the Amazon Builders’ Library, a site that’s chock-full articles by senior technical leaders that help you understand the underpinnings of both Amazon.com and AWS.
Below are four more architecture-based articles that describe how Amazon develops, architects, releases, and operates technology.
Caching Challenges and Strategies
Over years of building services at Amazon we’ve experienced various versions of the following scenario: We build a new service, and this service needs to make some network calls to fulfill its requests. Perhaps these calls are to a relational database, or an AWS service like Amazon DynamoDB, or to another internal service. In simple tests or at low request rates the service works great, but we notice a problem on the horizon. The problem might be that calls to this other service are slow or that the database is expensive to scale out as call volume increases. We also notice that many requests are using the same downstream resource or the same query results, so we think that caching this data could be the answer to our problems. We add a cache and our service appears much improved. We observe that request latency is down, costs are reduced, and small downstream availability drops are smoothed over. After a while, no one can remember life before the cache. Dependencies reduce their fleet sizes accordingly, and the database is scaled down. Just when everything appears to be going well, the service could be poised for disaster. There could be changes in traffic patterns, failure of the cache fleet, or other unexpected circumstances that could lead to a cold or otherwise unavailable cache. This in turn could cause a surge of traffic to downstream services that can lead to outages both in our dependencies and in our service.
Avoiding Fallback in Distributed Systems
Critical failures prevent a service from producing useful results. For example, in an ecommerce website, if a database query for product information fails, the website cannot display the product page successfully. Amazon services must handle the majority of critical failures in order to be reliable.
This article covers fallback strategies and why we almost never use them at Amazon. You might find this surprising. After all, engineers often use the real world as a starting point for their designs. And in the real world, fallback strategies must be planned in advance and used when necessary. Let’s say an airport’s display boards go out. A contingency plan (such as humans writing flight information on whiteboards) must be in place to handle this situation, because passengers still need to find their gates. But consider how awful the contingency plan is: the difficulty of reading the whiteboards, the difficulty of keeping them up-to-date, and the risk that humans will add incorrect information. The whiteboard fallback strategy is necessary but it’s riddled with problems.
Leader Elections in Distributed Systems
Leader election is the simple idea of giving one thing (a process, host, thread, object, or human) in a distributed system some special powers. Those special powers could include the ability to assign work, the ability to modify a piece of data, or even the responsibility of handling all requests in the system.
Leader election is a powerful tool for improving efficiency, reducing coordination, simplifying architectures, and reducing operations. On the other hand, leader election can introduce new failure modes and scaling bottlenecks. In addition, leader election may make it more difficult for you to evaluate the correctness of a system.
Because of these complications, we carefully consider other options before implementing leader election. For data processing and workflows, workflow services like AWS Step Functions can achieve many of the same benefits as leader election and avoid many of its risks. For other systems, we often implement idempotent APIs, optimistic locking, and other patterns that make a single leader unnecessary.
In this article, I discuss some of the pros and cons of leader election in general and how Amazon approaches leader election in our distributed systems, including insights into leader failure.
Workload Isolation Using Shuffle-Sharding
Not long after AWS began offering services, AWS customers made clear that they wanted to be able to use our Amazon Simple Storage Service (S3), Amazon CloudFront, and Elastic Load Balancing services at the “root” of their domain, that is, for names like “amazon.com” and not just for names like “www.amazon.com”.
That may seem very simple. However, due to a design decision in the DNS protocol, made back in the 1980s, it’s harder than it seems. DNS has a feature called CNAME that allows the owner of a domain to offload a part of their domain to another provider to host, but it doesn’t work at the root or top level of a domain. To serve our customers’ needs, we’d have to actually host our customers’ domains. When we host a customer’s domain, we can return whatever the current set of IP addresses are for Amazon S3, Amazon CloudFront, or Elastic Load Balancing. These services are constantly expanding and adding IP addresses, so it’s not something that customers could easily hard-code in their domain configurations either.
It’s no small task to host DNS. If DNS is having problems, an entire business can be offline. However, after we identified the need, we set out to solve it in the way that’s typical at Amazon—urgently. We carved out a small team of engineers, and we got to work
Next week we’ll wrap up 2019 with a top ten list of the most-visited Architecture blog posts of 2019.