We wanted to provide you with some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on June 13th, 2023.
Issue Summary
Starting at 11:49 AM PDT on June 13th, 2023, customers experienced increased error rates and latencies for Lambda function invocations within the Northern Virginia (US-EAST-1) Region. Some other AWS services – including Amazon STS, AWS Management Console, Amazon EKS, Amazon Connect, and Amazon EventBridge – also experienced increased error rates and latencies as a result of the degraded Lambda function invocations. Lambda function invocations began to return to normal levels at 1:45 PM PDT, and all affected services had fully recovered by 3:37 PM PDT.
To explain this event, we need to share a little about the internals of AWS Lambda. AWS Lambda makes use of a cellular architecture, where each cell consists of multiple subsystems to serve function invocations for customer code. First, the Lambda Frontend is responsible for receiving and routing customer function invocations. Second, the Lambda Invocation Manager is responsible for managing the underlying compute capacity – in the form of Lambda Execution Environments – depending on the scale of function invocation traffic on a per-function, per-account basis.
At 10:01 AM PDT, the Lambda Frontend fleet started scaling in response to an increase in service traffic, within the normal daily traffic patterns, in the Northern Virginia (US-EAST-1) Region. At 11:49 AM PDT, the Lambda Frontend fleet, while adding additional compute capacity to handle the increase in service traffic, crossed a capacity threshold that had previously never been reached within a single cell. This triggered a latent software defect which caused Lambda Execution Environments to be successfully allocated for incoming requests, but never fully utilized by the Lambda Frontend. Since Lambda was not able to provision new Lambda Execution Environments for incoming requests, function invocations within the affected cell experienced increased error rates and latencies. Customers triggering Lambda functions through asynchronous or streaming event sources also saw an increase in their event backlog since the events were not being processed. Lambda function invocations within other Lambda cells were not affected by this event.
Engineering teams were immediately engaged and began investigating. By 12:26 PM PDT, engineers had identified the latent software defect and the impact on the provisioning of underlying compute capacity. As an immediate mitigation, engineers were able to confirm that traffic levels had subsided, and initiated a scale down of the Lambda Frontend fleet to a level that no longer triggered the latent software defect. By 1:30 PM PDT, new Lambda function invocations began to see recovery. By 1:45 PM PDT, Lambda was fully recovered for synchronous function invocations. At this stage, the vast majority of affected AWS services began to fully recover. Between 1:45 PM and 3:37 PM PDT, Lambda completed processing the backlog of asynchronous events from various event sources, consistent with customer specified event source specific retry policies. By 3:37 PM PDT, the AWS Lambda service had resumed normal operations, and all services dependent on Lambda were operating normally.
We have taken several actions to prevent a recurrence of this event. We immediately disabled the scaling activities for the Lambda Frontend fleet activities that triggered the event, while we worked to address the latent bug that caused the issue; this bug has since been resolved and deployed to all Regions. This event also uncovered a gap in our Lambda cellular architecture for the scaling of the Lambda Frontend, which allowed a latent bug to cause impact as the affected cell scaled. Lambda has already completed several action items to address the immediate concern with cellular scaling and remains on track to complete a larger effort later this year to ensure that all cells are bounded to a well-tested size to avoid future unexpected scaling issues.
AWS Service Impact
Several AWS services experienced impact as a result of increased error rates and latencies for Lambda function invocations. Amazon Secure Token Service (STS) experienced elevated error rates between 11:49 AM and 2:10 PM PDT with three distinct periods of impact. AWS Sign-in experienced elevated error rates for SAML federation in the US-EAST-1 Region between 1:13 PM and 2:11 PM PDT. Customers using AWS Sign-in to federate from external identity providers (IDP) using SAML, experienced elevated error rates when throttles were put in place within Amazon STS. Existing IAM sessions were not impacted, but new sign-in federation via SAML was degraded.
Amazon EventBridge supports routing events to Lambda and experienced elevated delivery latencies of up to 801 seconds between 11:49 AM and 1:45 PM PDT. Amazon EKS experience increased errors rates and latencies during the provisioning of new EKS clusters, however existing EKS clusters were not affected. At 1:45 PM PDT, these error rates returned to normal levels when the Lambda function invocation issue was resolved. AWS Management Console for the Northern Virginia (US-EAST-1) Region experienced elevated error rates from 11:48 AM to 2:02 PM PDT. During this time, customers accessing the AWS Management Console saw either an 'AWS Management Console is currently unavailable', or a '504 Time-out' error page. AWS Management Consoles outside of the Northern Virginia (US-EAST-1) Region were not affected by this event.
Amazon Connect uses AWS Lambda to process contacts and agent events. As a result of the Lambda event, Amazon Connect experienced degraded contact handling between 11:49 AM and 1:40 PM PDT. During this time, calls would have failed to connect, and chats and tasks would have failed to initiate. Agents would have experienced issues logging in and using Connect. AWS Support Center functionality was degraded between 11:49 AM and 2:38 PM PDT. During the first eleven minutes of impact, requests to create/view/update support cases may have failed. By 12:00 PM we had restored access to support cases. However, our call and chat functionality remained impaired, making these channels of communication unavailable. The ability to create/view/update cases via the web/email method was not impacted. AWS Support Center functionality was fully restored by 2:38 PM PDT after dependent services recovered alongside AWS Lambda.