Networking & Content Delivery

Four Steps for Debugging your Content Delivery on AWS

Introduction

Werner Vogels, chief technology officer for AWS, has been quoted as saying: “Everything fails all the time.” Well, his quote applies as well to content delivery with Amazon CloudFront and Lambda@Edge. In content delivery, issues might occur in different places, for example:

  • On your origin, when it returns HTTP 5xx errors
  • On CloudFront, when it cannot connect to your origin
  • On Lambda@Edge, when your code throws an unhandled exception

The best way to help prevent failures is to plan carefully and design a resilient architecture that works with your specific requirements. However, when content delivery fails despite your efforts, you need to identify the issue quickly and take action to resolve it. In this blog, I provide four steps that guide you in debugging AWS content delivery issues. The goal is to make sure that your application continues to be delivered to your users with minimal disruption. First, enable logging on CloudFront, and then set up alarming using Amazon SNS and Amazon CloudWatch. When an alarm goes off, use the CloudFront Monitoring Dashboard to review data about errors and where they’re occurring, and then leverage AWS services to identify and troubleshoot the issue.

Step 1: Enable CloudFront logging

To get started, configure CloudFront to create log files that contain detailed information about every user request that CloudFront receives. Sign in to the AWS Management Console and open the CloudFront console, then edit the general configuration of your CloudFront distribution to enable logging.

When you enable logging, CloudFront starts delivering access logs in W3C extended log file format to the configured Amazon S3 bucket (with a delay of a few minutes).

Tip: To optimize the cost of storing your log files, you can leverage S3 Object Lifecycle management to delete or move your logs to a cheaper storage tier after a certain period.

Next, set up Amazon Athena to query CloudFront access logs. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 by using standard SQL queries. Athena is serverless, so there’s no infrastructure to manage, and you pay only for the queries that you run. You can set up Athena in few minutes by following these steps.

Tip: To reduce the cost and time it takes to query CloudFront logs, you can partition the log files by following the guidance in this blog post.

Step 2: Set up alarming

Log files give you data to work with when there’s an issue, but you also need mechanisms to notify your operation team whenever there’s a potential problem. To deliver alarm notifications, we’ll use Amazon SNS, a fully-managed pub/sub messaging service. Since CloudFront publishes CloudWatch metrics to us-east-1 AWS Region, you need to set up the SNS topic my_application_delivery in that Region, and then subscribe your operation team to the SNS topic.

To keep it simple, I subscribed the operation team email to receive notifications. You can also choose to send notifications to other destinations, such as to a custom monitoring dashboard, or to  commercial solutions like PagerDuty using SNS webhooks integration.

Then, you need to configure alarming using Amazon CloudWatch. Amazon CloudWatch is a monitoring and management service that provides you with data and actionable insights to monitor your applications, understand and respond to system-wide performance changes.

CloudFront is integrated with CloudWatch, and automatically publishes six operational metrics per distribution (Requests, Bytes Downloaded, Bytes Uploaded, 4xx Error Rate, 5xx Error Rate) at a 1-minute granularity for no additional cost. CloudFront also publishes additional three operational metrics per distribution (Lambda Execution Errors, Invalid Function Response Errors, and Lambda Throttles), if you have a lambda function associated with your distribution. You can set alarms based on these metrics in the CloudFront console, or in the CloudWatch console (with standard CloudWatch rates). For example, to set an alarm based on 5xxErrorRate metric in the CloudWatch console, do the following. This metric represents the percentage of all requests for which the HTTP status code is 5xx.

  1. In the AWS Console, open the CloudWatch console in the us-east-1 Region.
  2. On the Alarms tab, choose Create Alarm > Select metric > CloudFront > Per-Distribution Metrics > 5xxErrorRate for my CloudFront distribution id, and then, for Period, enter 1 minute. Note that you must send some requests to your distribution before the metric appears in the list.
  3. Set an alarm to be triggered whenever 5xxErrorRate is more than 0.1% for 1 minute, or adjust the values to be appropriate for your own application and requirements.
  4. Under Actions, for Send notification to:, select the SNS topic you created earlier.
  5. Choose Create Alarm.

 

At first the State for the alarm will be INSUFFICIENT_DATA, but it will quickly change to OK.

Step 3: Troubleshoot delivery on CloudFront

When an alarm goes off or another issue occurs, you need to understand whether the problem is caused by CloudFront or a Lambda@Edge function so you can troubleshoot using the right tools. To help determine what is behind the error, go to the recently enhanced Monitoring dashboard on the CloudFront console where you can select and view both distribution metrics and any associated function metrics, right in the console.

For example, say you’ve gotten a notification based on the alarm that you set for 5xxErrorRate. Follow these steps to track down what’s going on.

  1. In the CloudFront console, choose Monitor.
  2. Search for and choose your CloudFront distribution, and then choose View distribution metrics.
  3. First look at the Error rate dashboard. If you see that there are 5xx errors caused by Lambda@Edge, then you can skip to the next section, Step 4: Debug Lambda@Edge functions. Otherwise, the errors are caused by CloudFront, and you can do some analysis to see what’s causing them.
  4. Use the following Athena queries with your CloudFront log files to analyze the errors. The first query that follows figures out the number of requests that are causing 5xx errors, broken down by error code. When you view the results of the query, let’s say that 500 stands out as dominant error code being returned. Next, use the second query to filter on the error code to determine the top URIs that are returning 500 errors.
SELECT status, count(status) AS count FROM cloudfront_logs
WHERE status >= 500
GROUP BY status
ORDER BY count DESC
SELECT uri, count(uri) AS count FROM cloudfront_logs
WHERE status = 500
GROUP BY uri
ORDER BY count DESC

Based on the error code, you can identify the problem and take actions to solve it. For example, one of the following might be the problem:

  • CloudFront returns a 504 Gateway Timeout Error error when an origin is not reachable or is not responding. You can check the timestamps of these errors and compare them with your origin logs. If you need more help, open a ticket with AWS Support and provide the associated request ids (x-edge-request-id) from your distribution’s log files.
  • Your origin returns 500, 501, or 503 internal errors which are cached on CloudFront. Filter your log files using Athena to find the URLs that are causing the server failure, and then debug your application code.
  • You have misconfigured TLS on your origin which is causing an 502 Bad Gateway error. Verify the TLS certificate and configuration on your origin.

For detailed information about different HTTP errors and how to help prevent them, see the CloudFront documentation.

Step 4: Debug Lambda@Edge functions

If you see a spike in HTTP 5XX errors by Lambda@Edge in the Error rate graph, the next step is to understand what caused it. For example, the spike in errors might be caused by an exception in your code, your function might have returned an invalid response to CloudFront, or your function might have been throttled. To learn more about the types of Lambda@Edge errors, see Testing and Debugging Lambda@Edge Functions.

To help you understand what’s causing the error spike in a specific scenario, you can deep-dive into the graphs on the console. Start by choosing the Lambda@Edge Errors tab in the Monitoring dashboard. You’ll see something like the following:

Based on the error type, you can navigate through the following decision diagram to debug your Lambda@Edge function. The rest of this blog post provides detailed guidance for debugging each type of error.

I – Execution Errors

Execution errors are the most common error scenario. These happen when CloudFront doesn’t get a response from Lambda@Edge because there are unhandled exceptions in your function or there’s an error in your code. To fix this, you need to debug your code by analyzing your Lambda@Edge logs in CloudWatch. If your function doesn’t execute correctly, Lambda@Edge attempts to convert the error object to a String with the following format. The string is sent to a CloudWatch log file in the AWS Region that is closest to the end user’s location.

{
  "errorMessage": "something is wrong",
  "errorType": "Error",
  "stackTrace": [
    "exports.handler (/var/task/index.js:10:17)"
  ]
}

To see the error in the log file, in the Monitoring Dashboard, select the Region and the Lambda@Edge function you’re investigating, and then choose View logs. You’ll be redirected to the CloudWatch console to see the details about the log group associated with your function.

To analyze logs more quickly in CloudWatch, you can use CloudWatch Logs Insights. Insights enables you to interactively search and analyze your log data. You can run queries to help you quickly understand operational issues so that you can solve them faster. For example, the following query in Insights lets you filter error messages and see just the messages that are relevant to a specific error.

You can also use Insights to query the log files that are generated by your code using Console.log(). As an example, consider a Lambda@Edge function that is used for A/B testing of a Single Page Application. In my code, I can log which page version was selected in each Lambda@Edge execution by doing the following:

console.log('INFO { spaVersion:', spaVersion, "}");

Now I can use the following query in Insights to filter on INFO messages, then extract a selected page version and count each occurrence:

fields @timestamp, ddbRegion
| filter @message like /INFO/
| sort @timestamp desc
| stats count() by ddbRegion

II – CloudFront Validation Errors

Sometimes a Lambda function returns an invalid response to CloudFront. For example, a validation error is returned if the object structure of the response doesn’t conform to the Lambda@Edge Event Structure, or the response contains invalid headers or other invalid fields. Note that when you test your function in the Lambda console, your function isn’t validated by CloudFront, which means that your function can run correctly in the console, but still fail when you add it to a distribution and deploy it in CloudFront.

CloudFront sends validation logs to CloudWatch Logs in an AWS Region near the user. In the Monitoring Dashboard, select the Region where you want to see the logs, then choose View logs. You’re redirected to the CloudWatch console to view the log group associated with your function.

In the logs, you can see the cause of CloudFront validation error, as shown in the following image:

III – Throttling

On occasion, the Lambda@Edge service throttles your function invocations on per Region basis, if you reach the regional concurrency limit. Your concurrency is the number of simultaneous Lambda@Edge executions in a specific Region.

To better understand where and why you reached a concurrency limit, on the Monitoring Dashboard, choose the Lambda@Edge function, and then choose View function metrics.

Now check the Invocations and Duration dashboards, which display how often the function was invoked, by Region, and how long the function executed, by Region:

There are three scenarios in which your function might get throttled:

  • When a function is invoked more often, because there’s high demand. If this happens, you can ask AWS support for a limit increase on the Concurrent executions in the Regions where your function is being throttled. You can also consider rearchitecting your application to invoke Lambda@Edge less often, for example, by using a more specific CloudFront behavior to configure Lambda@Edge triggers.
  • When your code takes longer to execute. To address this, consider whether your function depends on an external component that is not responding. If so, you can set timeouts in your code and fail over gracefully to defaults.
  • When your application receives a sudden traffic peak in a short period of time (seconds). In this case, Lambda@Edge needs to spin up new execution runtimes for your function, which adds a latency overhead to the execution duration. This latency overhead consumes your available execution concurrency level more quickly. One way you can reduce this overhead is by compacting your function deployment package size.

To learn more about optimizing concurrency, consider reading my previous blog post on Lambda@Edge design best practices.

Conclusion

In this blog, I showed you how to configure alarms for delivery issues, and how to quickly troubleshoot issues using AWS tools and services like the Monitoring dashboard in the CloudFront console, Athena and CloudWatch Logs Insights. Make sure that you adapt these suggestions for your own environment, and then include what you learned in your operational runbooks. Additionally you can enrich your runbooks by doing the following:

  • Consuming additional CloudWatch metrics such as Lambda’s concurrency level per Region
  • Producing custom alerts based on CloudFront logs by using Kinesis Analytics
  • Automating some of the troubleshooting. For example, you can trigger a Lambda function when there’s an alarm, run predefined Athena queries, and then send a pre-diagnostic report to your operations team.