Networking & Content Delivery

Three advanced design patterns for high available applications using Amazon CloudFront

Any web application using Amazon CloudFront benefits from the inherent high availability of this AWS service. It’s a globally distributed network that is immune to local hardware failures or network congestion. Furthermore, it’s built on top of the AWS global network, which provides better isolation from the public internet. Finally, it’s designed with various advanced engineering techniques to honor one of its top tenets: Service Availability. To learn more about these techniques, watch the following talks from past re:Invents: Design Patterns for High Availability: Lessons from Amazon CloudFront and Maintaining security and availability on the unpredictable internet.

As an application owner, you can leverage CloudFront features to protect your application’s availability from impairments impacting your origin. In this previous post, you can learn about these CloudFront features: origin failover, custom error responses, and request collapsing. In this advanced technical post, we’ll explore this topic further with three design patterns that you can use for building high available web applications. You’ll learn about hybrid origin failover, graceful failure, and static stability.

Hybrid origin failover

Companies that architect their applications for high availability introduce redundancy in their origin infrastructure. For example, they deploy redundant origins that can be hosted in two different AWS Regions. This strategy can be implemented with CloudFront in two different approaches.

The first approach is using Amazon Route 53 Failover routing policy with health checks on the origin domain name that’s configured as the origin in CloudFront. When the primary origin becomes unhealthy, Route 53 detects it, and then starts resolving the origin domain name with the IP address of the secondary origin. CloudFront honors the origin DNS TTL, which means that traffic will start flowing to the secondary origin within the DNS TTLs. The most optimal configuration (Fast Check activated, a failover threshold of 1, and 60 second DNS TTL) means that the failover will take 70 seconds at minimum to occur. When it does, all of the traffic is switched to the secondary origin, since it’s a stateful failover. Note that this design can be further extended with Route 53 Application Recovery Control for more sophisticated application failover across multiple AWS Regions, Availability Zones, and on-premises.

Using Amazon Route 53 Failover routing policy with Health Checks on the origin domain nameFigure 1: Using Amazon Route 53 Failover routing policy with Health Checks on the origin domain name

The second approach is using origin failover, a native feature of CloudFront. This capability of CloudFront tries for the primary origin of every request, and if a configured 4xx or 5xx error is received, then CloudFront attempts a retry with the secondary origin. This approach is simple to configure and provides immediate failover. However, it’s stateless, which means every request must fail independently, thus introducing latency to failed requests. For transient origin issues, this additional latency is an acceptable tradeoff with the speed of failover, but it’s not ideal when the origin is completely out of service. Finally, this approach only works for the GET/HEAD/OPTIONS HTTP methods, because other HTTP methods are not allowed on a CloudFront cache behavior with Origin Failover enabled.

Using CloudFront Origin FailoverFigure 2: Using CloudFront Origin Failover

The hybrid origin failover pattern combines both approaches to get the best of both worlds. First, you configure both of your origins with a Failover Policy in Route 53 behind a single origin domain name. Then, you configure an origin failover group with the single origin domain name as primary origin, and the secondary origin domain name as secondary origin. This means that when the primary origin becomes unavailable, requests are immediately retried with the secondary origin until the stateful failover of Route 53 kicks in within tens of seconds, after which requests go directly to the secondary origin without any latency penalty. Note that this pattern only works with the GET/HEAD/OPTIONS HTTP methods.

Hybrid origin failover pattern combines Route 53 failover policy and CloudFront origin failover groupFigure 3: Hybrid origin failover pattern combines Route 53 failover policy and CloudFront origin failover group

Graceful failure

There can be use cases where companies don’t want to replicate their origin infrastructure, but they would like to limit the consequences of a failure. This strategy is called Graceful Failure. Natively, CloudFront lets you configure your CloudFront distribution to respond with a static error page whenever the origin responds with 4xx or 5xx error codes. This feature is called custom error responses, and it can be configured as shown in the following screenshot to return the content of the /error path with 200 OK response code, when the origin returns a 503 error.

Configure Custom Error Responses
Figure 4: Configure Custom Error Responses

Typically, you would create a dedicated cache behavior in CloudFront for this custom error response that serves the graceful content from an Amazon Simple Storage Service (Amazon S3) bucket instead of serving it from the same impacted origin. It’s recommended to restrict access to the Amazon S3 bucket by using an origin access identity.

Create a dedicated cache behavior

Figure 5: Create a dedicated cache behavior

However, this approach means that CloudFront responds with the same error response, regardless of what the initial resource requested by the user was. To illustrate this, let’s consider a common architecture, such as image manipulation, where you store original resources on an Amazon S3 bucket, and then use a compute layer to process those resources before returning them to users. In image manipulation, original images are typically stored in an Amazon S3 bucket, and the frontend calls an image API which fetches the original image from Amazon S3 and optimizes it for the user screen. If the processing layer fails, then CloudFront serves the same error image for all of the frontend image requests. You can enhance this failure by using CloudFront’s Origin Failover to fail to Amazon S3 directly. This lets you improve the user experience by transforming a missing image into just a non-optimized one. However, for this to work, the requested image path must match the image key in Amazon S3 exactly, which isn’t always the case. This is where you can combine Lambda@Edge with CloudFront’s Custom Error Response to implement an enhanced graceful failure.

Combine Lambda@Edge with CloudFront’s Custom Error Response to implement an enhanced graceful failureFigure 6: Combine Lambda@Edge with Custom Error Response to implement an enhanced graceful failure

To implement this, configure a Lambda@Edge function on the origin request event, and attach it to the dedicated cache behavior for error handling (/error). Then, you attach a Cache Policy to this cache behavior that includes the following HTTP custom headers in the cache key: CloudFront-Error-Uri (path of the initial request URL) and CloudFront-Error-Args (query arguments of the initial request URL).

Setting up cache key

Figure 7: Setting up cache key

This setup means that every time the origin fails, CloudFront fetches the error response, which triggers Lambda@Edge. Then, Lambda@Edge rewrites the request to the Amazon S3 bucket with the required format based on the initial request URL. The following is a Lambda@Edge code example:

exports.handler = async (event) => {
    const request = event.Records[0].cf.request;

    // In this example, we are ignoring the query strings sent in the 'cloudFront-error-args' header.
    // We only use the 'cloudfront-error-uri' header to get the value of the initial object path,
    // then rewrite it to add a static prefix to match the directory structure on the origin.
    if (request.headers['cloudfront-error-uri']) {
        request.uri = '/some-special-origin-directory' + request.headers['cloudfront-error-uri'][0].value;
    }
    
    return request;
};

Static stability

Most AWS services include a data plane, which provides the service’s core functionality, and a control plane, which enables you to create, update, and delete resources. CloudFront’s control plane is operated in the North Virginia region, and its data plane is globally distributed. In general, control planes are engineered to prioritize consistency and durability, whereas data planes are engineered for high availability. One way of achieving this high availability in data planes is to design them to be statically stable. This designates their ability to keep working even when the control plane becomes impaired.

You can build more resilient applications by exclusively relying on the CloudFront data plane for critical operations such as traffic routing. Let’s suppose that you are using CloudFront, with Lambda@Edge on the Origin Request event to dynamically route traffic to different origins. This is a common pattern used by customers like OutSystems and TrueCar for various use cases, such as Blue-Green deployments, micro-services routing, routing users to the correct shard on a multi-tenant platform, and implementing strangler pattern when modernizing applications.

One way of implementing the routing in Lambda@Edge is to store the routing logic in the function code itself. For example, you could decide to send traffic of Application A to origin 1 and Application B to origin 2. Every time that you must change this logic, for example during outage where you must manually reroute traffic, you must update the function code.

Store the routing logic in the function code to implementing the routing in Lambda@EdgeFigure 8: Store the routing logic in the function code to implementing the routing in Lambda@Edge

However, updating function code is an API operation that depends on the CloudFront control plane. In a statically stable design, you’d let the Lambda@Edge function fetch this information from an external location, such as Amazon DynamoDB, to make the routing decision. In this architecture, the routing update operation stays within the CloudFront and DynamoDB Data plane, and it benefits from a higher availability. Note that this design introduces additional costs and latency (DynamoDB, longer execution duration for Lambda@Edge), which can be optimized with techniques explained in this post.

Use Lambda@Edge to fetch information from DynamoDB to make routing decisionFigure 9: Use Lambda@Edge to fetch information from DynamoDB to make routing decisions

Conclusion

“Everything fails all of the time” is a famous quote by Werner Vogels, the CTO of Amazon.com. To build reliable web applications, we recommend that you use CloudFront and benefit immediately from a higher availability. In addition, you can implement some of the proposed advanced patterns in this post to further increase the resiliency of your web applications.

To get started with CloudFront, visit creating a CloudFront distribution. To learn more about best practices of building applications with CloudFront, visit AWS content delivery blogs.

Achraf.jpg

Achraf Souk

Achraf Souk is a Specialist Solutions Architect based in Paris. His main focus is helping companies delivering their online content in a secure, reliable and fast way using AWS Edge Services. He gets very exited about customers innovating with Lambda@Edge and other AWS services. Outside of work, Achraf is a bookworm, and a passionate clarinetist.

Ben Lee.jpg

Ben Lee

Ben Lee is a Senior Product Manager on the Amazon CloudFront team focusing on caching, edge delivery, and security. He is based in Seattle and is passionate about building scalable edge product and helping customers build resilient edge solutions to deliver secure, reliable, and fast content to clients.