Testing AWS GameDay with the AWS Well-Architected Framework – Review
By Ian Scofield, Juan Villa, and Mike Ruiz, Partner Solutions Architects at AWS
AWS GameDay is an immersive, team-based event we have hosted at AWS Summits and re:Invent over the past few years. The event has teams of players settling into a challenging—and hopefully entertaining—scenario as DevOps leads at Unicorn.Rentals, a popular startup minutes away from the very public launch of a widely anticipated product. For more information, see the GameDay website.
Of course, we have a lot going on behind the scenes to make GameDay work. Beyond all the enthusiastic acting and silly props, you’ll find a complex AWS infrastructure that includes a live score tracking engine, a single-instance load generator capable of dynamically varying the load over the course of the game, and various command and control functions. Overall, the infrastructure is simplistic in design but complex to operate, with room for improvement by incorporating the same best practices we encourage players to adopt during the course of the game.
Today, in an attempt to improve the player experience (and our quality of life), we have invited a team of AWS Partner Solutions Architects to review the GameDay architecture against a standard benchmark: the AWS Well-Architected Framework. The review team will work to understand the details of our architecture, ask detailed questions about our design and intent, and then deliver a document with prioritized findings.
In this post, we’ll cover the initial architecture review and the findings delivered from the review team. In future posts, we will share the process of making improvements and our plans to refine our architecture through continuous improvement and collaboration with AWS Solutions Architects.
We began the review session by providing the review team with an architectural overview of GameDay, using diagrams and other collateral to highlight various components and relationships where appropriate. To help you follow along, here’s a summary of the high-level details we shared regarding the architecture of GameDay:
The GameDay infrastructure runs in a master AWS account, with each team having their own player AWS account, as shown in Figure 1. Various components in the master account serve load to player accounts, and host other services such as the scoreboard and cost calculator. The master account utilizes an IAM Cross-Account Role in each player account that gives it the required permissions to perform administrative tasks throughout the day.
Figure 1: Master – player account relationship
The master account has the following components:
- Cost calculator – In order to encourage players to take cost optimization into account, we charge players for their Amazon Elastic Computer Cloud (Amazon EC2) utilization (as in the real world!). The cost calculator includes three AWS Lambda functions that deduct points proportional to their consumption.
- Amazon DynamoDB – We use several Amazon DynamoDB tables to hold team information, score information, generic game configuration values, and other supporting information that is used by the master account components.
- Load generator – This is the heart of the game implementation. It is made up of a single EC2 instance. The load generator controls the game and initiates administrative actions.
- When player accounts are dynamically created, a message is posted to an Amazon Simple Notification Service (Amazon SNS) topic in the master account with a notification of the account creation. On the load generator, PHP scripts run to do the account registration/provisioning based on the SNS messages.
- The load generator runs one process per team that initiates connections to the infrastructure running in each player’s account.
- The number of messages delivered to player accounts is scaled by creating additional processes per team within this load generator instance.
Figure 2 shows a high-level overview of the master account architecture:
Figure 2: Initial architecture
Once they understood the architecture, the review team began a deep dive and asked clarifying questions on the various components based on the questions in the appendix of the Well-Architected Framework whitepaper. In particular, they were very interested in manual operations (especially in the operation of the load generator), disaster recovery (specifically the recovery timing for assets lost before an event), and the security of the application as a whole (specifically the security of customer data and credentials). On the whole, the review was comprehensive and took approximately three hours to complete.
The review team consolidated the data and provided us with a written report that outlined the various findings. In addition, they provided us with notes and prioritized recommendations for each finding, which would serve as a starting point for us to develop our remediation plan.
Looking at GameDay through the lens of the Well-Architected Framework, it was obvious that there were many opportunities for improvement. The AWS review team prioritized the findings into two sets: critical and recommended. Most of the findings were classified as recommended—these don’t pose an immediate risk and will be incorporated into our roadmap. However, the three elements that were identified as critical needed to be addressed immediately.
Here’s the text of the findings from the review team:
SEC11. How Are You Managing Keys?
The legacy administrative scripts for GameDay use AWS access keys and secret access keys and are stored in plain text in an Amazon DynamoDB table.
The legacy administrative scripts require the use of an AWS access key and secret access key in order to interact with the AWS API on the player’s account, and do not support cross-account roles. Currently, these keys are being stored in plain text in an Amazon DynamoDB table, which the scripts query to retrieve the keys. AWS access keys and secret access keys are long-lived credentials that do not expire until they are explicitly revoked. Storing them in plain text increases the probability of the keys being compromised, and in the current design, any person with read access to the DynamoDB table (though the application or application administrative interface, indirectly via backups or logs, or directly via the AWS DynamoDB API) can read and exploit the keys.
Modify the legacy administrative scripts to support cross-account roles in order to avoid the need to store and use AWS access keys and secret access Keys.
REL 7. How Are You Planning for Disaster Recovery?
There is no clearly defined disaster recovery plan, recovery point objectives (RPO), or recovery time objectives (RTO). Additionally due to not having a plan, it cannot be periodically tested against the RPO and RTO objectives.
GameDay was originally conceived as a set of instructions players would iteratively execute in a minimally configured account. As tooling and additional features were added over time, they have failed to step back and consider the entire stack and how to protect it from accidental, malicious, or environmental faults. Although it’s just a game, GameDay customers invest a whole day to attend and deserve as good an experience as can be delivered; having to scramble to invent a recovery process in the run-up to an event or, worse, in the middle of a live game would be a bad experience for all involved.
- Define a disaster recovery plan, including RPO and RTO.
- Periodically test the plan against the defined objectives.
REL 2. How Does Your System Withstand Component Failures?
Currently the load generator is a single instance in a single Availability Zone, and no recovery options have been configured.
If this load generator instance were to fail or become unavailable either due to a hardware fault or in the (unlikely) event of an Availability Zone failure, the game would no longer be able to continue, because there is no automated process to recover the failed node. The load generator is currently not in an Auto Scaling group, nor does it have EC2 instance recovery configured. Additionally, the instance has been configured manually and doesn’t contain all the necessary settings and scripts. Lastly, all state is stored locally on the instance and will need to be broken out when implementing a multi-instance architecture. By storing state externally, this will also alleviate the issue of losing state in the event of an instance failure.
- Implement an EC2 Auto Scaling group with a launch configuration by creating an Amazon Machine Image (AMI) which self-contains all necessary components. Optionally, you can utilize user data to pull down all necessary components.
- Configure your Auto Scaling group to span multiple Availability Zones to increase the resiliency and fault tolerance of your architecture.
- Make your instances stateless to reduce the chance of losing information in the event of a failure.
Now that the review team has given us this feedback and the list of critical items that need to be resolved, we need to construct our remediation plan to correct these deficiencies. In our next blog post, we’ll go through this remediation plan and explain in depth how we plan to correct these items to improve the security and reliability of the GameDay application.