AWS for Games Blog
How The Pokémon Company International Orchestrates Zero-Downtime Deployments for Pokémon TCG Live – Part 2
This blog is co-authored by Kylie Yamamoto, DevOps Engineer from TPCi, Sarath Kumar Kallayil Sreedharan and Jackie Jiang from AWS.
Introduction
Part 1 of this two-part blog series delved into the architectural design considerations that underpin Pokémon TCG Live. Part 2 will explain how zero-downtime deployment has been designed and implemented to ensure seamless gameplay.
Zero-downtime deployments
AWS Step Functions was the main orchestration tool that managed the entire zero-downtime deployment. It was built upon a state machine, where each state, represented by the boxes in the image preceding, can be a task, wait, or choice type. Each state saved its output to a specified ResultPath
variable that was passed along as input to the next state. Following is a description of the state types:
- Task: In this implementation, all task states invoke a Lambda that performs an action like updating AWS resources or running status/health checks.
- Wait: Pauses the Step Function for a determined amount of time, used when waiting for ECS services to spin up/down.
- Choice: Evaluates output from previous states and determines which state to proceed to next.
- End states: Either “succeed” or “failed” state indicates the results of the Step Function.
In addition to the typical parameters necessary for a deployment (for example: target environment, new service version, region, and so on.), the following are the specific parameters used in the ZDD pipeline from the CI/CD job:
switch_active_cluster
- Activating this will switch traffic from the active to the inactive cluster; default is set to true.
- If set to false, the new service version will be deployed to the inactive cluster and active traffic will not be switched. This is used when a new version needs additional testing prior to receiving active public traffic.
shutdown_inactive_cluster
- Option to keep the inactive (old) cluster running after a deployment has finished and the active cluster has been switched, default is set to true.
- If set to false, the old cluster will remain running. This is used when testers want to quickly switch active traffic between two different versions or when there is a need to intentionally roll back to a previous version.
Example zero-downtime deployment
This next section will walk through the state machine steps and how the resources are updated during a typical zero-downtime deployment.
- Initial state
-
- With a typical zero-downtime deployment, the parameters are set to:
switch_active_cluster = true
shutdown_inactive_cluster = true
- For the initial state in this example, version 1 is currently running in the blue cluster as noted in the parameter store values and the cluster status.
- With a typical zero-downtime deployment, the parameters are set to:
Setup_parameters
– The Lambda marshals together the user provided inputs and sets any default values.Deploy_new_infrastructure
The new green cluster resources are created/updated with the new service version and the parameter store value is updated via Lambda.
4. Is_new_cluster_healthy
, wait_for_cluster,
check_new_cluster
This loop is entered and will wait until the green cluster health checks come back as healthy.
The check_new_cluster Lambda uses API calls to check service health.
5. Integration_tests
, check_test_results
○ Optional integration tests are run after the cluster is healthy.
○ Check_test_results
will end the deployment if the tests fail and will send the deployment to error_handling/deploy_failed
to marshal together the failure results.
○ If the parameter switch_active_cluster
were set to false and integration tests passed, the deployment would be marked as complete and successful.
6. Switch_blue_green
○ The parameter store value for current_active_color
is updated to the new color.
○ The App Mesh routes are updated to point active traffic to the new cluster.
○ Connected game clients on the blue cluster receive a pop-up message to reconnect. Clients only receive this message after they have completed games/purchases.
7. Shutdown_inactive_cluster
○ If shutdown_inactive_cluster
is set to true, the Step Function will go through the is_old_cluster_empty
loop to wait for old connections to drain.
○ If shutdown_inactive_cluster
was set to false, the deployment would be marked as complete at this step and the old (blue) cluster would be left running.
8. Is_old_cluster_empty
, wait_for_old_connections
, check_old_connections
○ While in this loop, the blue cluster is still running as connected players may be completing games/purchases.
○ The check_old_connection
Lambda uses an API call with the version 1 value set in the header to check the count of connected players on the blue (inactive) cluster.
○ The loop will exit once old connections have been drained or a timeout is reached.
9. Cleanup_old_cluster
○ ECS tasks in the blue cluster are shut down and the service node count is set to zero.
○ After the old cluster is shut down, the zero-downtime deployment is now complete.
A note on interfacing with Terraform
One of the hurdles the team had to overcome with using Step Functions/Lambda for orchestrating zero-downtime deployments was how to avoid conflicts with the Infrastructure as Code (IaC), Terraform. Typically, when any change was made to a resource, Terraform would try to recreate that resource. This could create downtime as services were spun up/down and would result in a poor user experience. To mitigate this, parameters were created in the Systems Manager Parameter Store that acted as a source of truth for updating resources either through Terraform or the Step Function ZDD pipeline.
This introduced the issue of new Task Definition revisions being created either by Terraform or the Step Function pipeline, therefore the ECS Service always had to use the newest revision of the ECS Task Definitions. To achieve this, in Terraform the Task Definitions were created but were also imported as data resources. This allowed Terraform to find the maximum revision (latest version) for the Task Definition family based on either the Terraform managed Task Definition or the data resource which would be the Task Definition created by the Step Function. See below for how the Task Definition was defined for the ECS Service resource in Terraform:
Future considerations
While the current set up for zero-downtime deployments worked without issue, there were considerations for future iterations of the pipeline. Currently, there is a separate Step Function execution for each of the two regions the game is deployed to. These can be consolidated and managed by one Step Function that runs parallel executions, offering a more comprehensive method of updating all regions in a more unified manner.
Another consideration was implementing a message-based approval system where the Step Function will wait for approval before switching active traffic. After the Step Function brings up the new cluster, test engineers can spend additional time testing against the new cluster as needed. Once validation is complete, the team can then signal the Step Function to proceed with the deployment to switch traffic/shut down the old cluster.
Additionally, there can be an improved rollback procedure. After traffic has switched and before shutting down the old cluster, the new cluster metrics can be monitored and used to verify that it is healthy after receiving active public traffic. If the new cluster is not healthy or performing as expected, traffic can automatically be switched back to the old cluster. The current workaround is to roll forward and run another full zero-downtime deployment using the old version. Automating this within the pipeline would reduce the time it would take to roll back.
Lastly, TPCi has been evaluating other control plane options such as AWS Service Connect and Amazon VPC Lattice as well as third party options such as Istio and Cilium.
Conclusion
This blog described how TPCi leveraged App Mesh, ECS, and Step Functions with Lambda to perform a zero-downtime deployment for Pokémon TCG Live. The pipeline has been successfully implemented in the production environment and has run without any issues. The Step Function and Lambda have been reliable resources and are consistent across all deployments. The testing experience has significantly improved, as validation tests could be performed without disrupting active traffic. And last, but certainly not least, this has improved the overall user experience. Delighting customers is one of the six core values at TPCi, and achieving seamless, zero-downtime deployments exemplified this commitment to satisfying the customers.
Explore Amazon ECS
Learn more about AWS Lambda
Learn more about AWS Step Functions