We’d like to share more detail about the AWS service disruption that occurred this past weekend in the AWS Sydney Region. The service disruption primarily affected EC2 instances and their associated Elastic Block Store (“EBS”) volumes running in a single Availability Zone.
Loss of Power
At 10:25 PM PDT on June 4th, our utility provider suffered a loss of power at a regional substation as a result of severe weather in the area. This failure resulted in a total loss of utility power to multiple AWS facilities. In one of the facilities, our power redundancy didn't work as designed, and we lost power to a significant number of instances in that Availability Zone.
Normally, when utility power fails, electrical load is maintained by multiple layers of power redundancy. Every instance is served by two independent power delivery line-ups, each providing access to utility power, uninterruptable power supplies (UPSs), and back-up power from generators. If either of these independent power line-ups provides power, the instance will maintain availability. During this weekend’s event, the instances that lost power lost access to both their primary and secondary power as several of our power delivery line-ups failed to transfer load to their generators. These particular power line-ups utilize a technology known as a diesel rotary uninterruptable power supply (DRUPS), which integrates a diesel generator and a mechanical UPS. Under normal operation, the DRUPS uses utility power to spin a flywheel which stores energy. If utility power is interrupted, the DRUPS uses this stored energy to continue to provide power to the datacenter while the integrated generator is turned on to continue to provide power until utility power is restored. The specific signature of this weekend’s utility power failure resulted in an unusually long voltage sag (rather than a complete outage). Because of the unexpected nature of this voltage sag, a set of breakers responsible for isolating the DRUPS from utility power failed to open quickly enough. Normally, these breakers would assure that the DRUPS reserve power is used to support the datacenter load during the transition to generator power. Instead, the DRUPS system’s energy reserve quickly drained into the degraded power grid. The rapid, unexpected loss of power from DRUPS resulted in DRUPS shutting down, meaning the generators which had started up could not be engaged and connected to the datacenter racks. DRUPS shutting down this rapidly and in this fashion is unusual and required some inspection. Once our on-site technicians were able to determine it was safe to manually re-engage the power line-ups, power was restored at 11:46PM PDT.
As power was restored to the affected infrastructure, our automated systems began to bring customers’ EC2 instances and EBS volumes back online. By 1:00 AM PDT, over 80% of the impacted customer instances and volumes were back online and operational. After power recovery, some instances in the Availability Zone experienced DNS resolution failures as the internal DNS hosts for that Availability Zone were brought back online and handled the recovery load. DNS error rates recovered by 2:49 AM PDT.
A latent bug in our instance management software led to a slower than expected recovery of the remaining instances. The team worked over the next several hours to manually recover these remaining instances. Instances were recovered continually during this time, and by 8AM PDT, nearly all instances had been recovered.
There were also a small number of EBS volumes (less than 0.01% of the volumes in the Availability Zone) that were unable to recover after power was restored. EBS volumes are replicated to multiple storage servers in the same Availability Zone, which protects against most hardware failure scenarios and allows EBS to provide a 0.1%-0.2% annualized failure rate. This does mean volumes can be lost when multiple servers fail at the same time. During the power event, a small number of storage servers suffered failed hard drives which led to a loss of the data stored on those servers. In cases where both of the replicas were hosted on failed servers, we were unable to automatically restore the volume. After the initial wave of automated recovery, the EBS team focused on manually recovering as many damaged storage servers as possible. This is a slow process, which is why some volumes took much longer to return to service.
During the initial part of this event, customers experienced errors when trying to launch new instances, or when trying to scale their auto-scaling groups. To remediate this, our team had to manually fail away from degraded services in the affected zone. Starting at 11:42 PM PDT, the manual failover was complete and customers were able to launch instances in the unaffected Availability Zones. When the APIs initially recovered, our systems were delayed in propagating some state changes and making them available via describe API calls. This meant that some customers could not see their newly launched resources, and some existing instances appeared as stuck in pending or shutting down when customers tried to make changes to their infrastructure in the affected Availability Zone. These state delays also increased latency of adding new instances to existing Elastic Load Balancing (ELB) load balancers.
While we have experienced excellent operational performance from the power configuration used in this facility, it is apparent that we need to enhance this particular design to prevent similar power sags from affecting our power delivery infrastructure. In order to prevent a recurrence of this correlated power delivery line-up failure, we are adding additional breakers to assure that we more quickly break connections to degraded utility power to allow our generators to activate before the UPS systems are depleted.
Additionally, we will be taking actions to improve our recovery systems. The first is to fix the latent issue that led to our recovery systems not being able to automatically recover a subset of customer instances. That fix is already in testing, and will be deployed over the coming days. We will also be starting a program to regularly test our recovery processes on unoccupied, long-running hosts in our fleet. By continually testing our recovery workflows on long-running hosts, we can assure that no latent issues or configuration setting exists that would impact our ability to quickly remediate customer impact when instances need to be recovered.
For this event, customers that were running their applications across multiple Availability Zones in the Region were able to maintain availability throughout the event. For customers that need the highest availability for their applications, we continue to recommend running applications with this architecture. We know that it was problematic that for a period of time there were errors and delays for the APIs that launch instances. We are working on changes that will assure our APIs are even more resilient to failure and believe these changes will be rolled out to the Sydney Region in July.
We apologize for any inconvenience this event caused. We know how critical our services are to our customers’ businesses. We are never satisfied with operational performance that is anything less than perfect, and we will do everything we can to learn from this event and use it to drive improvement across our services.
-The AWS Team