We wanted to provide you with some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on July 30th, 2024.
Issue Summary
Between 2:45 PM PDT and 9:37 PM PDT on July 30th, some AWS services in the Northern Virginia (US-EAST-1) Region experienced increased latencies and elevated error rates. The impacted services included CloudWatch Logs, Amazon Data Firehose, the event framework that publishes Amazon S3 events, Amazon Elastic Container Service (ECS), AWS Lambda, Amazon Redshift, and AWS Glue. The impact to these services was caused directly or indirectly by a degradation of one of our internally-used cells of the Amazon Kinesis Data Streams service.
Like many AWS customers that want to manage real-time data at high scale, AWS services often use Kinesis Data Streams to efficiently capture, process, and store large volumes of data from multiple sources as part of their architecture. For example, CloudWatch Logs uses Amazon Kinesis Data Steams to buffer log streams before persisting data to CloudWatch Logs’ storage. Because of the criticality of Kinesis Data Streams as both an internal and external building block, we have invested heavily in assuring it provides high availability and scalability. A key enabler of high-availability in Kinesis Data Streams is its cellularized architecture, which provides scalability and fault isolation. Each Kinesis Data Streams cell consists of multiple independent subsystems that create and manage data streams across multiple Availability Zones (AZs). In every AWS Region, Kinesis Data Streams maintains multiple cells that are used exclusively by AWS services and additional cells that are used by AWS customers. The root cause of this issue was an impairment in one of the Kinesis Data Streams cells used exclusively by AWS services in the US-EAST-1 Region. The other Kinesis Data Streams cells in the US-EAST-1 Region and other AWS regions continued to operate normally and were not affected during this event.
Over the last couple of years, we have been gradually migrating Kinesis cells to a new architecture that has several benefits including increased redundancy, better performance, and greater elasticity of scale. The new architecture includes a cell management system that monitors the health of all hosts in a cell and more efficiently distributes the processing of Kinesis shards across healthy hosts within the cell. This new architecture was tested extensively before being initially deployed. After extensive load and functional testing, the cell management system was deployed into production and has been effectively managing load in many cells across multiple regions for many months. In the months prior to the event, we had been upgrading additional cells dedicated to internal AWS workloads in the US-EAST-1 region to the new architecture without issue. The root cause of this event was the cell management system’s behavior in managing the novel workload profile of one of these internal cells. Specifically, unlike workload profiles in other Kinesis Data Streams cells, one of the internal workloads on the impacted cell had a very large number of very low-throughput shards which caused the cell management system to behave incorrectly and triggered this event.
At 9:09 AM PDT on July 30th, 2024, a routine deployment began in a single availability zone of the Kinesis cell with this novel workload profile. The change did not affect operations, but the act of taking hosts in and out of service triggered the issue that impacted the cell and degraded performance. Kinesis service deployment is a gradual process that removes each host from service, performs the deployment, runs health checks, and then brings the host back into service. As part of the deployment process, the Kinesis cell management system is responsible for shifting work from hosts being taken out of service to other hosts in the cell. The cell management system focuses on balancing work based on throughput and other I/O dimensions that impact the ability of hosts to handle additional shards. The impacted Kinesis cell had an unusually high number of low throughput shards which the cell management system did not effectively distribute to all the healthy hosts in the cell. As a result, a small number of hosts received a very large number of these low throughput shards to process. While this did not initially impact the ability of these hosts to perform their primary functions, it did trigger an unintended behavior by the cell management system.
For health monitoring purposes, each host in the cell periodically sends a status message to the cell management system. The status message includes information about each shard that the host is processing. For the hosts hosting the unusually large number of shards, the status messages became very large in size. With this increase in message size, these messages were unable to be transmitted to, and processed by, the cell management system in a timely fashion and the processing of status messages from some other hosts became delayed. When the cell management system did not receive status messages in a timely fashion, the cell management system incorrectly determined that the healthy hosts were unhealthy and began redistributing shards that those hosts had been processing to other hosts. This created a spike in the rate of shard redistribution across the impacted Kinesis cell and eventually overloaded another component that is used to provision secure connections for communication to Kinesis data plane subsystems, which in turn impaired Kinesis traffic processing. By 2:45 PM PDT, requests made to the impacted Kinesis cell began to experience elevated latency and error rates. Due to Kinesis’ cellular architecture, requests made to other Kinesis Data Streams cells in the US-EAST-1 Region by AWS users and services continued to operate normally.
Engineering teams were immediately engaged and began investigating the affected cell. While working to identify the root cause of the increased resource contention within the affected cell, engineering began to take steps to mitigate the issue. At 4:25 PM PDT, engineers deployed a change to shed load by reducing incoming stream volume from less time-sensitive internal workloads. This reduced the total volume of data being processed by the cell management system and, at 5:39 PM PDT, error rates and latencies began to show signs of improvement. Engineers continued to restore the health of the cell management system by adding additional capacity to a subsystem involved in provisioning secure connections. By 5:55 PM PDT, the cell management system was able to effectively establish secure connections for processing Kinesis requests to the cell. At 6:04 PM PDT, the majority of incoming requests to the affected cell were being processed successfully. At 6:32 PM PDT, engineers added more capacity to further improve connection provisioning, which led to a further reduction in error rates. By 7:21 PM PDT, the vast majority of requests to the Kinesis Data Streams cell were being processed normally.
During the course of the event, we made a number of changes to restore normal operations to the impacted Kinesis Data Streams cell that mitigated impact and improved the cell management subsystem’s ability to handle workloads that generated a very high number of very low throughput shards. These changes included increasing the capacity to provision data plane connections, operationalizing additional tooling to shed load, changing connection limits, and other changes that will remain in place in US-EAST-1. We have also already made these same changes in all other AWS Regions that have cells with the new Kinesis Data Streams architecture. These changes will protect the system from a similar event while we work on further updates to the cell management system to better manage the workload that triggered the issue.
AWS Service Impact
Amazon CloudWatch Logs, Amazon Data Firehose, the event framework that publishes Amazon S3 events, Amazon Elastic Container Service (ECS), AWS Lambda, Amazon Redshift, and AWS Glue make use of the impacted Kinesis Data Streams cell and had direct or indirect impact from this event. These services experienced elevated error rates and latencies when the Kinesis Data Streams cell was impacted at 2:45 PM PDT with initial improvement to error rates at 5:39 PM PDT, significant improvement by 7:21 PM PDT, and normal operations by 9:37 PM PDT with the recovery of the cell.
CloudWatch Logs, which customers use to ingest and analyze logs, leverages Kinesis Data Steams to buffer incoming log streams. With the continued mitigations to the Kinesis Data Streams cell, the vast majority of CloudWatch Logs requests were being processed normally by 7:21 PM PDT. When CloudWatch Logs resumed normal operations, the newest log streams were processed first to provide real time observability with the backlog of older delayed log streams processing in parallel. As a result, the backlog of older log streams that had built over several hours took some time to fully process. CloudWatch log processing completed by 5:50 AM PDT on July 31st, at which point logs, metrics filtered from log events, and CloudWatch alarms on delayed metrics were operating normally.
The event framework that manages event delivery in Amazon S3 processed event requests using the affected Kinesis Data Streams cell. S3 event delivery was delayed from 2:45 PM PDT until 7:21 PM PDT. Once normal operations were restored to the Kinesis Data Streams cell, S3 event delivery resumed normal operations. The backlog of delayed S3 events completed processing at 2:38 AM PDT on August 1st.
Amazon Data Firehose saw increased failures when invoking PutRecord and PutRecordBatch APIs to deliver data streams, increased latencies in delivering streams, and increased latencies in creating and deleting delivery streams in the Northern Virginia (US-EAST-1) Region. There was no data loss, and all records accepted by the service during the impacted period were delivered.
Some Amazon Elastic Container Service (ECS) customers experienced impact to running tasks when these tasks were configured to use the awslogs log driver in the default blocking mode. Blocking mode prevents log loss but also results in the task being blocked when logs can't be sent to CloudWatch Logs. This in turn leads to tasks being unable to respond to health checks. Customers who configured the awslogs driver to use non-blocking mode were able to run their applications normally.
Lambda customers experienced missing CloudWatch Logs for their Lambda function executions. Customers who call CloudWatch Logs or Amazon Data Firehose APIs directly from within their Lambda functions may have experienced elevated function errors (if they block on logging) or increased latency. Additionally, due to the impact to function invocation logging, customers were unable to determine if all their functions were executing successfully during the event. As CloudWatch Logs availability gradually improved during the event, Lambda was able to publish more logs successfully and log delivery fully recovered when CloudWatch Logs API fully recovered.
API Gateway customers experienced failures in the delivery of logs and an elevated rate of errors for API Gateway Logging configuration until CloudWatch Logs recovery. Some customers saw errors when invoking Lambda functions through API Gateway. API Gateway continued to process API invokes normally during the event.
Other AWS services were affected by the ECS Fargate task failures as well as CloudWatch Logs elevated error rates. For example, Amazon Redshift customers experienced intermittent issues connecting to Redshift clusters when using the Amazon Redshift Query Editor v2 or via applications using JDBC, ODBC, or Python drivers in the Northern Virginia (US-EAST-1) Region. There was an elevated error rate for viewing CloudWatch metrics for Redshift clusters until CloudWatch Logs recovery. Additionally, Amazon Managed Workflows for Apache Airflow (MWAA) and AWS Glue customers experienced elevated latencies and API failures when creating or updating resources and logging metrics in the Northern Virginia (US-EAST-1) Region. MWAA and Glue latencies and API errors recovered when Fargate task launch errors and CloudWatch Logs availability improved.
In closing
We apologize for the impact this event caused for our customers. While we are proud of our track record of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.