Enabling granular operational visibility for CloudFront with CloudWatch
Amazon CloudFront is a content delivery network (CDN) that delivers static and dynamic web content using a global network of edge locations. CloudFront integrates natively with Amazon CloudWatch to provide monitoring and observability capabilities. With the introduction of CloudFront real-time logs, it is now possible to create highly granular custom metrics in CloudWatch to view and create alarms from this data. These custom metrics in CloudWatch can provide low-level insight into how well a CloudFront distribution is performing. These metrics can also help you monitor the operational health of specific points of presence by grouping HTTP status codes, request rates, browser used, and more by edge location.
Enabling real-time logs with CloudFront provides detailed information about viewer requests received by CloudFront to a Kinesis Data Stream. Using a Kinesis data stream consumer, the real-time logs can be parsed and sorted before being uploaded to CloudWatch custom metrics. These custom metrics permit an unprecedented visibility that can identify common problems affecting CloudFront distributions, such as localized outages, ISP failures, and DNS routing issues. Such issues—particularly if they only affect a single CloudFront edge location—can be nearly impossible to detect with aggregate metrics for a CloudFront distribution considering that the CloudFront global edge network consists of over 400+ edge locations. An additional benefit of this data is it can provide information about geographic usage patterns for a CloudFront distribution, such as the request rate for a particular edge location in a certain region being considerably higher than others.
In this blog, we will discuss how to generate CloudWatch custom metrics to monitor HTTP status codes. We will produce custom metrics for 2xx, 4xx, and 5xx HTTP status codes showing the percentage of requests returning one of these status codes for each CloudFront edge location that sends traffic into a CloudFront distribution. These custom metrics will also include a count of the requests by HTTP status code and show the total requests by edge location in the CloudFront global edge network. This architecture will illustrate how CloudFront real-time logs can be combined with a Kinesis data stream and AWS Lambda to create a cohesive system to generate granular CloudWatch custom metrics. We at AWS Professional Services have implemented this architecture to solve an operational problem for a large customer. Depending on how real-time the surfacing of the metrics need to be, and how much custom processing is required, this architecture can be easily adapted to use other AWS services and deliver more CloudWatch metrics. We will provide Lambda code and an AWS Serverless Application Model (SAM) template as part of the walkthrough.
How it works
To create the CloudWatch custom metrics from CloudFront real-time logs, you must enable CloudFront real-time logs and configure it to use a Kinesis data stream. Each web request to the CloudFront distribution configured for real-time logging is delivered to the specified Kinesis data stream as a record. Records from the Kinesis data stream are processed by a Lambda configured as a Kinesis consumer which extracts the relevant information from the real-time logs. The Lambda aggregates this data by minute and then uses the PutMetricData request to add the data to our custom metrics in CloudWatch with one-minute standard resolution.
To ensure that CloudFront real-time logging data is uploaded to CloudWatch, the Lambda consumer is designed to prevent excessive PutMetricData calls to the CloudWatch API. The PutMetricData request is limited to 150 transactions per second (TPS), as is shown in the CloudWatch quotas table here. This limit is a reason why we upload metrics at one-minute standard resolution—it has the effect of reducing the amount of PutMetricData requests by aggregating data from Kinesis records together into fewer requests. A reduction in the amount of PutMetricData requests is also achieved by setting a batch window of one-minute for the Kinesis consumer. A batch window permits larger batches of records from the Kinesis stream to be processed with a single Lambda consumer invocation. This is because it requires the Lambda consumer to wait for a specified period of time before polling a Kinesis shard for a batch of records. These larger batches of records have the benefit of more metrics being aggregated into a single PutMetricData request.
There are some limits to this solution. Because CloudFront real-time logging creates a single Kinesis record for each request to a distribution, this means that an increase in the CloudFront request rate increases the amount of Kinesis records created. This documentation shows that a Kinesis shard is capable of accepting 1,000 records per second. This solution will default to creating 5 Kinesis shards and is capable of processing records for CloudFront distributions receiving 5,000 requests per second or less. If no other parameters of the solution are altered, the Kinesis shards can be increased to 20 to support 20,000 requests per second or less. Scaling the solution beyond 20 Kinesis shards requires the consideration of other factors to avoid experiencing errors. To determine the appropriate amount of Kinesis shards, we suggest reviewing the default CloudWatch metric for requests associated with the CloudFront distribution. While this metric is not provided at single-second resolution, it will enable an informed decision on the necessary number of Kinesis shards required for a distribution’s real-time logging configuration.
If you decide to scale the solution above 20 Kinesis shards, there are additional considerations. By default, a Lambda consumer polls a Kinesis shard for records every second. A Lambda consumer has a max batch size of 10,000 records and a max payload of 6 MB. Surpassing either of these factors results in the invocation of the Lambda regardless of whether your specified batch window expired. If your Kinesis Stream receives records rapidly due to high CloudFront distribution request rates, surpassing the PutMetricData TPS limit of 150 is possible as a result of rapid invocation of your Lambda consumers. Occasionally passing this limit is acceptable as exponential backoff is a feature of AWS SDKs by default, but regularly exceeding this limit is an indication that the batch window is not long enough. Increasing the batch window alleviates this problem. You can identify this issue by looking for throttling exceptions in the CloudWatch logs of the Lambda consumer.
Another issue you can face is a high CloudFront request rate is that the Lambda consumer may process records for a longer duration, which is indicated by an increase in the IteratorAge metric of the Lambda consumer. You can alleviate this problem by adding additional Kinesis shards or increasing the parallelization factor for the Kinesis Lambda consumer. By default, the parallelization factor, or concurrent batches per Kinesis shard, is set to 1. The Lambda consumer can also experience throttling errors by being invoked too quickly, which is usually resolved with reserved concurrency.
Building the solution
This walkthrough will use an AWS SAM template to automate the deployment of the majority of the infrastructure required to support this solution. The following prerequisites should be met prior to proceeding with this walkthrough:
It is possible to proceed without having a CloudFront distribution already created. The provided AWS SAM template in Step 1 of the walkthrough below will not create a CloudFront distribution. In order to move on to Step 2 of the walkthrough, a CloudFront distribution is required. More information about CloudFront distributions is available here.
Prior to proceeding, please note that the AWS resources created as part of the walkthrough have costs associated with them. It is recommended to delete AWS resources that do not have an immediate use.
Step 1: Deploy Infrastructure
To deploy the infrastructure, we will use an AWS SAM template to automate the setup of all the components. Following these instructions requires two parameters to be supplied—CloudFrontDistributionDomainName and KinesisShardCount. The CloudFrontDistributionDomainName is the CloudFront-assigned domain name associated with the CloudFront distribution for which custom metrics are being enabled. This domain name will end in “.cloudfront.net”. Do not use any CloudFront alternate domain names or CNAMEs that may have been configured as the value for this parameter.
The KinesisShardCount is an integer that will default to a value of 5. This parameter will determine the number of Kinesis shards that are created in the Kinesis data stream used for processing the real-time logs.
1. Clone the source code from the aws-cloudfront-real-time-metrics-sample GitHub to a local environment.
2. From a shell prompt in a local environment, change directory to the root of the source code cloned from aws-cloudfront-real-time-metrics-sample.
3. Run the command below to trigger the stack creation dialog.
4. Populate the Stack Name parameter with any valid CloudFormation stack name. The AWS Region parameter can be any valid AWS Region, but it will default to us-east-1. Provide values for the CloudFrontDistributionDomainName and KinesisShardCount parameters as appropriate. Responding with a “y” when prompted to Allow SAM CLI IAM role creation is required, or else the deployment will fail. Responding with a “y” to the Confirm changes before deploy will enable a review of the CloudFormation changeset prior to the deployment. Respond with a “y” to Save arguments to a configuration file to prevent being asked these questions in the future.
5. Once the stack has been created successfully, navigate to the AWS Console. Select the stack and then Outputs. Note the Value of the output with a Key of KinesisDataStreamArn. This value will be required in the next step.
Step 2: Enable real-time logs on CloudFront distribution
Adding a real-time log configuration to the CloudFront distribution enables the CloudFront real-time logging feature, and the subsequent flow of data into CloudWatch. You can add log configurations while creating or updating a distribution or by using the Logs option in the console.
1. In the AWS Console, navigate to the CloudFront console
2. Navigate to the Telemetry dropdown in the console and select the Logs option
3. Select the Real-time log configurations option
4. Select Create configuration to create a new configuration
5. Under Configuration settings provide a value for the Name option
6. Leave the Sampling rate and IAM role options set to the default value.
7. Modify the Fields dropdown selection to include only the following fields: timestamp, c_ip, sc_status, cs_host, and x_edge_location. The default setting for Fields will send all the fields CloudFront collects to the Kinesis Data Stream as parts of a Kinesis record. For the purposes of this exercise only the timestamp, c_ip, sc_status, cs_host, and x_edge_location fields are required.
8. Modify the Endpoint option to include the ARN of the Kinesis data stream created in Step 1
9. Under the Distribution settings select the CloudFront distribution and the associated Cache behavior(s) that will generate real-time logs
10. Select Create configuration to complete the real-time log configuration
Step 3: View metrics in CloudWatch
Real-time logs should now be flowing through the system in response to web requests to the CloudFront distribution. The metrics will appear as custom namespaces in CloudWatch metrics titled “CloudFront by Edge Location – Count” and “CloudFront by Edge Location – Percent”. Each metric will list the CloudFront distribution domain name and the edge location as dimensions. As traffic from previously unmonitored edge locations is received, new metrics will be added to the custom namespaces automatically. The RequestCount metric in the “CloudFront by Edge Location – Count” namespace will show the total number of requests received by each edge location for a CloudFront distribution. The 5xx metric in the same namespace will show the total count of HTTP 500 errors that were received at each CloudFront edge location. A metric with the same name will show the percentage of requests that resulted in a HTTP 500 error in the “CloudFront by Edge Location – Percent” namespace. Metrics for 2xx and 4xx are also included in the “CloudFront by Edge Location” namespaces and will show metrics for HTTP 200 and 400 status codes.
It may take a few minutes for the custom metrics to start appearing in CloudWatch. The metrics are uploaded at one-minute resolution, and CloudWatch will default to showing metrics at five-minute resolution which may hide some of the first data points that are uploaded. If the distribution is not receiving web traffic, no metrics will be generated.
Complete the steps below to clean up resources created by this exercise:
1. Run the command below to delete any resources created by CloudFormation. Use the stack name specified in Step 1 of Building the solution. Verify that the appropriate AWS region where the stack was originally deployed has been specified.
2. In the AWS Console, navigate to the CloudFront console
3. Navigate to the Telemetry dropdown in the console and select the Logs option
4. Select the Real-time log configurations option
5. Select the real-time log configuration created in Step 2 of Building the solution by clicking its name
6. Select the Distribution cache behaviors attached to the real-time log configuration, and then select Detach
7. Select Delete to remove the real-time logging configuration
In this post, we enabled CloudFront real-time logs and deployed a system to process this data into actionable CloudWatch metrics. These granular CloudWatch metrics enable real-time operational analysis and precision monitoring of every CloudFront edge location used by a distribution. This level of visibility into operations allows for the rapid identification of even the most localized failures and detailed geographic data that can be used to inform analysis. The system can now be augmented to extract additional data from CloudFront real-time logs and upload it as CloudWatch custom metrics. Some examples of edge location specific CloudWatch custom metrics that could be created with this system include HTTP request methods, size of requests, latency of requests, cache behavior used to respond to requests, originating countries for requests, the browser used to make requests, and more. Review the information available about CloudFront real-time logs to learn about the possibilities for adding new granular CloudWatch custom metrics that can be created with this system.
There are a few other topics to consider to improve monitoring and operational visibility capabilities. Consider enabling autoscaling functionality on Kinesis data streams for greater resilience and cost-effectiveness. CloudWatch alarms can also be added to the custom CloudWatch metrics to notify and trigger automatic remediation processes in response to the custom metrics. Another method to improve operational visibility is to create a real-time dashboard using the CloudFront real-time logging data.
About the Authors
Thomas Davis is a Cloud Infrastructure Architect with AWS Professional Services. He spends the majority of his time working directly with customers to solve challenging technology problems. Thomas has experience in the DevSecOps, application architecture, networking, governance, and application migration. Outside of work he enjoys reading, playing video games, and spending time with family and friends.
Koushik Biswas is a Cloud Application Architect with AWS Professional Services. He specializes in Security, DevOps, CI CD Pipelines, Networking, Infrastructure and Migrations. When he is not advising an AWS customer, he can be found outdoors fishing, hiking or camping with family, friends, or just by himself.