AWS Architecture Blog
re:viewing AWS re:Invent
We previously wrote about how Route 53 organizes software deployments. We thought it’d be worthwhile to highlight an architecture track talk at AWS re:Invent 2014. In this talk, Chris Munns, AWS Solutions Architect discusses various deployment techniques and some of the new AWS services customers can leverage to improve their deployment strategies. A useful reference […]
Exponential Backoff And Jitter
Update (May 2023): After 8 years, this solution continues to serve as a pillar for how Amazon builds remote client libraries for resilient systems. Most AWS SDKs now support exponential backoff and jitter as part of their retry behavior when using standard or adaptive modes. Consequently, this pattern can be leveraged without having to incorporate […]
Internet Routing and Traffic Engineering
Internet Routing Internet routing today is handled through the use of a routing protocol known as BGP (Border Gateway Protocol). Individual networks on the Internet are represented as an autonomous system (AS). An autonomous system has a globally unique autonomous system number (ASN) which is allocated by a Regional Internet Registry (RIR), who also handle […]
Selecting Service Endpoints for Reliability and Performance
Choose Your Route Wisely Much like a roadway, the Internet is subject to congestion and blockage that cause slowdowns and at worst prevent packets from arriving at their destination. Like too many cars jamming themselves onto a highway, too much data over a route on the Internet results in slowdowns. Transatlantic cable breaks have much […]
Running Multiple HTTP Endpoints as a Highly Available Health Proxy
Route 53 Health Checks provide the ability to verify that endpoints are reachable and that HTTP and HTTPS endpoints successfully respond. However, there are many situations where DNS failover would be useful, but TCP, HTTP, and HTTPS health checks alone can’t sufficiently determine the health of the endpoint. In these cases, it’s possible for an […]
Doing Constant Work to Avoid Failures
Amazon Route 53’s DNS Failover feature allows fast, automatic rerouting at the DNS layer based on the health of some endpoints. Endpoints are actively monitored from multiple locations and both application or connectivity issues can trigger failover. Trust No One One of the goals in designing the DNS Failover feature was making it resilient to […]
A Case Study in Global Fault Isolation
In a previous blog post, we talked about using shuffle sharding to get magical fault isolation. Today, we’ll examine a specific use case that Route 53 employs and one of the interesting tradeoffs we decided to make as part of our sharding. Then, we’ll discuss how you can employ some of these concepts in your […]
Organizing Software Deployments to Match Failure Conditions
Deploying new software into production will always carry some amount of risk, and failed deployments (e.g., software bugs, misconfigurations, etc.) will occasionally occur. As a service owner, the goal is to try and reduce the number of these incidents and to limit customer impact when they do occur. One method to reduce potential impact is […]
Shuffle Sharding: Massive and Magical Fault Isolation
A standard deck of cards has 52 different playing cards and 2 jokers. If we shuffle a deck thoroughly and deal out a four card hand, there are over 300,000 different hands. Another way to say the same thing is that if we put the cards back, shuffle and deal again, the odds are worse […]
AWS and Compartmentalization
Practically every experienced driver has suffered a flat tire. It’s a real nuisance, you pull over, empty the trunk to get out your spare wheel, jack up the car and replace the puncture before driving yourself to a nearby repair shop. For a car that’s ok, we can tolerate the occasional nuisance, and as drivers […]








