Testing AWS GameDay with the AWS Well-Architected Framework – Continued Remediation

Editor’s note: This is the third in a three-part series about testing AWS GameDay. Read Part 1 >> Read Part 2>>

By Ian Scofield, Juan Villa, and Mike Ruiz, Partner Solutions Architects at AWS

Build_border This is the third post in our series documenting a project to fix issues with the AWS GameDay architecture by using tenets of the AWS Well-Architected Framework. See Part 1 of the series for an overview of the process, a description of the initial review, and a list of the critical findings identified by the review team.

In Part 2 of the series, we remediated the critical findings found in our initial review. We still have other recommendations that we identified to address in our roadmap, and it’s been six months since we remediated our findings. A lot has happened since then, with all the new services and features that were announced at AWS re:Invent 2017.

In this post, we’ll cover how we plan on remediating the deficiencies found in our Disaster Recovery plan, as well as other optimizations we’ve made due to recent announcements. We will also discuss how to address another crucial component in our application development—testing.

Learn more about AWS GameDay >>

Disaster Recovery Deficiencies

In part 2, you’ll recall we ran a “war game” for our Disaster Recovery plan which focused on the scenario of losing control of a production account. In these tests, we discovered that our move to AWS CloudFormation to deploy resources gave us the ability to quickly provision identical infrastructure in other accounts and regions. However, we still had work to do to refine our playbooks and Standard Operating Procedures (SOPs) on the exact steps. The most important piece that required attention was our strategy for data stored in Amazon DynamoDB. We didn’t have automatic backups, and the few manual backups we did have weren’t actively copying to other AWS accounts due to the complexity involved.

Our initial plan to backup DynamoDB was to leverage AWS Data Pipeline, which has a built-in template which takes snapshots of your database and stores it in Amazon Simple Storage Service (Amazon S3). This allows us to back up our tables, but it had some downsides, too. Behind the scenes, Data Pipeline is running an Amazon EMR cluster that retrieves all the items from our table and writes them to Amazon S3. For this reason, it isn’t as quick as we’d like, and due to the abstraction layer it’s hard to debug if anything were to go wrong.

Additionally, this has a performance impact on our tables since it’s doing a scan operation and is taxing on read capacity. Also, since this is for our Disaster Recovery scenario, our desired end state is for our backups to end up in a different account—the one we will be failing over to in the event of a disaster. This task can be done with Data Pipeline but requires a lot of additional configuration and adds more complexity.

We then considered using DynamoDB streams to write to a DynamoDB table in our other account, but this doesn’t protect us from propagating errors, and it doesn’t provide us with point-in-time recovery. We had been brainstorming a solution to this problem for some time, when, at AWS re:Invent 2017, we announced the ability to take backups of DynamoDB tables via the click of a button or simple API call. This feature allows us to take encrypted backups of DynamoDB tables without a performance impact, and it doesn’t require us to provision any additional infrastructure. However, as of this post, the missing piece is that these backups are only available within the same AWS account and region, and they cannot be copied to another account or region.

The DynamoDB team also came out with the concept of Global Tables, a multi-master DynamoDB table that spans AWS regions. This currently doesn’t support replication across accounts; only across regions within the same account. Now this by itself wouldn’t meet our Disaster Recovery requirements, as replication is not a substitute for backup (e.g. data corruption would be replicated). But if we combined this approach with the previous Backup/Restore feature, this solves our Disaster Recovery strategy. Due to AWS continuing to release new features and enhancements, we communicated the need for these enhancements to our account team and will continue to check to see if they are implemented.

In the meantime, we needed to come up with a viable solution. We stumbled upon a continuous DynamoDB backup solution that’s published on the AWS Labs GitHub repo. This solution restores individual items and offers point-in-time recovery. We are in the process of evaluating this solution, which provides us with better granularity of our backups, but it still requires some modification to fulfill our cross-account requirement.

Based on the lessons learned from our “war game” and continued refinement of our SOPs and playbooks, we have implemented quarterly testing to ensure we are prepared and have the necessary tooling in place.

Amazon EC2 – M5 Instances

When we moved to an Amazon Elastic Container Service (Amazon ECS) cluster, we selected the M4 Amazon Elastic Compute Cloud (Amazon EC2) instance family for our nodes. We chose this family after careful consideration, benchmarking, and cost comparison to other EC2 instance types. In particular, this instance family was a good fit because the Docker containers running on the clusters were neither CPU-bound, IO-bound, nor Memory-bound, but rather a balance of all.

After AWS re:Invent 2017, we were excited to hear about the introduction of M5, the next-generation general purpose instance family. In Jeff Barr’s blog post, he announced the M5 could achieve a better price-performance ratio than the M4 family, by as much as 14 percent. Needless to say, we quickly dove into the console and began testing the M5 instance family in our development environment. We determined M5 provided approximately a 10 percent performance improvement over our current configuration while keeping costs the same.

At the end of the day, migrating from the M4 to M5 instance family meant we could run less EC2 instances in our Amazon ECS cluster to achieve equivalent performance, and therefore lower operating costs. Keep in mind there was no guarantee the M5 family would enable us to run our workload more cost effectively, as every workload is different. Benchmarking and extensive testing allowed us to safely and methodically calculate the benefits and make the final decision to migrate.

AWS Fargate

Another exciting announcement from AWS re:Invent 2017 was the launch of AWS Fargate, a managed and easy-to-use service for deploying and managing containers. This announcement was particularly exciting because we were in charge of deploying and managing our own Amazon ECS cluster to run and scale our GameDay workload.

The deployment of the GameDay workload Amazon ECS cluster is currently automated using CloudFormation and Amazon EC2 Scaling Groups. While this approach drastically reduced our day-to-day burden of operating the cluster, we still saw an opportunity for further improvement, simplification, and cost savings by considering AWS Fargate. Naturally, we investigated and dug in further.

The first thing we considered is how AWS Fargate could eliminate the burden of operating an Amazon ECS cluster. As a managed service, we don’t have to create the cluster, but we instruct AWS Fargate how to deploy our container and how many we need. This means less opportunity for operator error on our end, and easier management of our infrastructure during production events. This also translates to cost savings since it means we need less high-skilled engineers at the helm when running GameDay events.

In addition to operational simplicity, there is an opportunity for additional cost savings as AWS Fargate’s billing model means we pay only for the running containers. This means that for smaller events we pay only for the containers that are running rather than paying for a full-sized Amazon ECS cluster that costs the same whether we are running a small game or a large game. It’s also worth noting that AWS Fargate allows for seamless scaling, meaning we can scale for larger events very easily without have to scale our Amazon ECS clusters manually, which can be error prone.

As of this post, we are planning our strategy for testing and ultimately migrating toward AWS Fargate rather than deploying and managing our current Amazon ECS cluster.

Testing

Based on the changes we have implemented thus far, there have been some significant architectural modifications. Our team had an event coming up and we knew it would be wise to run several tests to ensure—from an infrastructure and application perspective—everything was working and that we hadn’t introduced any new issues. We found minor items in our testing that we were able to remediate, and we felt confident everything would perform as expected. When it came time to run the actual event, we encountered errors that we previously hadn’t found in our testing.

When we did our testing, we ran them in large numbers to stress the application and discover as many edge cases as possible. But we didn’t reach the same level of scale we would in our actual event. We were experiencing a hot key issue with DynamoDB, which isn’t something that can be easily remedied during the event since it involves changing our schema. We have since remedied this, and our next iteration of testing will mirror not only our estimated numbers but additional load to ensure we have room for unexpected growth.

It Was Worth It

Our team was initially apprehensive about performing a Well-Architected review, as we knew areas of our application weren’t pretty and we didn’t want to draw attention to them. We knew these areas needed work, but we were struggling to prioritize fixing them. We focused on new features as opposed to fixing our existing technical debt. However, looking back on it, we now have a much stronger architecture and can honestly say we feel more prepared for various scenarios in the future. We will continue to work with our account team to ensure we are continuing to follow best practices. We will also keep an eye towards new features and enhancements that can solve some of our existing challenges.

What’s Next

GameDay as a training and enablement tool has taken off internally due to the feedback we have received from customers. While our team was continuing to iterate on the version of GameDay outlined in this post, we have been working on building a multi-tenant SaaS platform to enable other teams within AWS to run GameDay and make it more self-service. This new platform powered GameDay at AWS re:Invent 2017 and will power new versions of GameDay moving forward.

We hope to see you at the next event!