AWS – Ready to Weather the Storm
As people across the Northeastern United States are stocking up their pantries and preparing their disaster supplies kits, AWS is also preparing for winter snow storms and the subsequent hurricane season. After fielding several customer requests for information about our preparation regime, my colleagues Brian Beach and Ilya Epshteyn wrote the following guest post in order to share some additional information.
AWS takes extensive precautions to help ensure that we will remain fully operational, with no loss of service for our hosted applications even during a major weather event or natural disaster. How reliable is an application hosted by AWS? In 2014, Nucleus Research surveyed 198 AWS customers that reported moving existing workloads from on-premises to AWS and found that they were able to reduce unplanned downtime by 32% (see Availability and Reliability in the Cloud: Amazon Web Services for more info).
AWS replicates critical system components across multiple Availability Zones to ensure high availability both under normal circumstances and during disasters such as fires, tornadoes, or floods. Our services are available to customers from 12 regions in the United States, Brazil, Europe, Japan, Singapore, Australia, Korea, and China with 32 Availability Zones. Each Availability Zone runs on its own independent infrastructure, engineered to be highly reliable so that even extreme disasters or weather events should only affect a single Availability Zone. The datacenters’ electrical power systems are designed to be fully redundant and maintainable without impact to operations. Common points of failure, such as generators, UPS units, and air conditioning, are not shared across Availability Zones.
At AWS, we plan for failure by maintaining contingency plans and regularly rehearsing our responses. In the words of Werner Vogels, Amazon’s CTO: “Everything fails, all the time.” We regularly perform preventative maintenance on our generators and UPS units to ensure that the equipment is ready when needed. We also maintain a series of incident response plans covering both common and uncommon events and update them regularly to incorporate lessons learned and prepare for emerging threats. In the days leading up to a known event such as a hurricane, we make preparations such as increasing fuel supplies, updating staffing plans, and adding provisions like food and water to ensure the safety of the support teams. Once it is clear that a storm will impact a specific region, the response plan is executed and we post updates to the Service Health Dashboard throughout the event.
During Hurricane Sandy—the most destructive hurricane of the 2012 Atlantic hurricane season, and the second-costliest hurricane in United States history— AWS remained online throughout the entire storm. An extensive Hurricane Sandy Response Plan, including 24/7 staffing by all service teams, escalation plans and continuous status updates, assured normal operations and service quality for our customers.
In fact, AWS’s highly reliable platform also played a key role in enabling a more effective storm response. A&T Systems (ATS.com), an AWS Advanced Consulting partner, used AWS in support of a statewide emergency management agency as Hurricane Sandy struck. Another AWS customer, MapBox, provided maps for several storm-related services to help predict and track Sandy’s progression, communicate evacuation plans, and track surges.
In the aftermath of the storm, some companies established operations in the AWS Cloud to replace datacenters lost to flooding and power outages. One such example is NYU’s Langone Medical Center. As noted in the article Still Recovering from Sandy, “…NYU researchers [were] able to push forward with their sequencing experiments. They were able to salvage 200 terabytes of backup sequencing data, and have set up temporary data storage in a New Jersey facility, using computing power from the NYU Center for Genomics and Systems Biology and the Amazon cloud.”
What’s even more interesting is that AWS provided a unique capability for our customers to prepare for worst case scenarios by copying and replicating their data to other AWS regions proactively. Although ultimately this was not necessary, since US East (Northern Virginia) stayed up without any issues, our customers had peace of mind that they would be able to continue their business as usual even if it did fail. One example is the Obama 2012 Campaign: in a nine hour period, they proactively replicated their entire environment from the US East (Northern Virginia) to the US West (Northern California) region, providing cross-continent fault tolerance on demand. The Obama campaign was able to copy over 27 terabytes of data from East to West in less than four hours (watch the re:Invent video, Continuous Integration and Deployment Best Practices on AWS, to learn more). Leo Zhadanovsky, a DevOps engineer for the Obama Campaign & Democratic National Committee, who now works for AWS commented that “AWS’s scalable, on-demand capacity allowed Obama for America to quickly spin up a disaster-recovery copy of their infrastructure in another region in a matter of hours — something that would normally take weeks, or months in on premise environment.”
While AWS goes to great lengths to provide availability of the cloud, our customers share responsibility for ensuring availability within the cloud. These customers and others like them have succeeded because they designed for failure and have adopted best practices for high availability, such as taking advantage of multiple Availability Zones and configuring Auto Scaling groups to replace unhealthy instances. The Building Fault-Tolerant Applications on AWS whitepaper is a great introduction to achieving high availability in the cloud. In addition, the AWS Well-Architected Framework codifies the experiences of thousands of customers, helping customers assess and improve their cloud-based architectures and mitigate disruptions.
As winter storms threaten the East Coast, AWS customers can rest assured that our Services and Availability Zones provide the most solid foundation upon which to build a reliable application. Together, we can build a highly available and resilient application in the cloud, ready to weather the storm.