AWS for Games Blog
How The Pokémon Company International Orchestrates Zero-Downtime Deployments for Pokémon TCG Live – Part 1
This blog is co-authored by Kylie Yamamoto, DevOps Engineer from TPCi, Sarath Kumar Kallayil Sreedharan and Jackie Jiang from AWS.
Introduction
Pokémon is one of the most popular and successful entertainment franchises in the world, encompassing video games, mobile apps, the Pokémon Trading Card Game (TCG), animation and movies, Play! Pokémon competitive events, and licensed products. First established in Japan in 1996 with the launch of the Pokémon Red and Pokémon Green video games for the Game Boy system, the world of Pokémon has since connected people across the globe beloved by kids, adults, and every Trainer in between!
The Pokémon Company International (TPCi), a subsidiary of The Pokémon Company in Japan, manages the property outside of Asia with a mission to delight their fans through excellent products and meaningful experiences. In June 2023, TPCi officially launched Pokémon TCG Live, an app that allows Trainers to enjoy the Pokémon Trading Card Game (TCG) in an updated digital format. The free-to-play game also marks the first time the Pokémon TCG Live is playable across iOS, Android, macOS, and Windows devices.
This blog post describes the journey of how TPCi uses Amazon ECS, AWS App Mesh, AWS Step Functions and AWS Lambda to achieve seamless zero-downtime deployments. Instead of experiencing 8-hour downtime every two weeks, fans now simply click on a reconnect message to access the most updated content.
Initial architecture and challenges
The initial architecture for the game followed a common microservice set up in AWS. Amazon Route 53 was used to direct traffic to an Application Load Balancer (ALB) that pointed to a set of microservices hosted in Amazon ECS running on Amazon EC2 instances. In this design, Envoy Front Proxy was used as the ingress gateway which managed traffic to containerized microservices such as account, user, inventory, commerce, and others. Each microservice ran an Envoy Sidecar alongside its service container.
Since Pokémon TCG Live was in development phase and not handling live traffic, this architecture met the developer needs. However, moving forward, TPCi recognized a need to have more fine-grained control of its network connections to allow for canary and blue/green deployments once the game was released. Additionally, the configuration for the Envoy Front Proxy was managed manually in a separate code repository and deployed via AWS CloudFormation. The process for adding or removing microservices was cumbersome and time-consuming. TPCi’s aim was to find a solution that seamlessly integrates with a Continuous Integration/Continuous Delivery (CI/CD) pipeline while utilizing Terraform as its Infrastructure as Code (IaC).
The solution
To facilitate zero-downtime deployments (ZDD), the solution relied on implementing AWS App Mesh as the control plane instead of the Envoy Front Proxy. AWS App Mesh is an application networking service mesh that allows for monitoring and communications across services. This allowed the use of AWS Step Functions to orchestrate a zero-downtime deployment and relied on a set of Lambda to update the ECS and App Mesh resources. Leveraging Lambda within the Step Functions provided a custom approach to managing the unique resources during a zero-downtime deployment. This solution granted TPCi complete control over its traffic and network connections through App Mesh.
The new ECS architecture remained similar to the infrastructure before ZDD was implemented, with two key exceptions. The first was that there were two ECS services for each microservice, named <service>-blue and <service>-green. This allowed for two versions to run side-by-side so that a new service version could be tested and players could be moved from one to the other. Traffic was routed to its respective blue/green service based on a header within the API call. The other major change was that the container stack was moved to AWS ECS Fargate, which reduced the overhead for managing additional infrastructure.
Architecture overview
1 – Envoy Ingress Gateway
The default AWS App Mesh Envoy image was used for all Envoy containers. After traffic was routed to the appropriate region, the Application Load Balancer pointed directly to the Envoy ingress gateway tasks. This ingress gateway served as the entry point into the App Mesh and was a single service running on ECS managing a set of tasks. Each task ran an Envoy container and had the environment configuration variables set for the task to enable it as an App Mesh virtual gateway.
2 – AWS App Mesh
The virtual gateway used a set of gateway routes to manage incoming traffic to the mesh. The gateway routes determined where to send traffic based on a URL prefix. For example, traffic with a prefix of /account/ was routed to the Account virtual service. There was a separate gateway route that sent traffic to each virtual service.
From the AWS App Mesh documentation, an App Mesh virtual service “is an abstraction of a real service that was provided by a virtual node directly or indirectly by means of a virtual router.” The virtual service was used to route all traffic directed for one microservice. For each App Mesh virtual service, a virtual router was then used to manage three separate virtual routes that sent traffic to the appropriate virtual node. The virtual router is where traffic began to split based on a header value and was routed to the appropriate color virtual node. The active route was evaluated when no header was present, and traffic was sent to the active cluster.
The last piece in the App Mesh was the virtual node. An App Mesh virtual node “acts as a logical pointer to a particular task group, such as an Amazon ECS service or a Kubernetes Deployment.” For this implementation, the ECS tasks each contained an Envoy sidecar that had its environment configuration variables set to act as a virtual node within the specified App Mesh.
3 – AWS Cloud Map
AWS Cloud Map is a cloud resource discovery service. Using AWS Cloud Map, custom names for application resources can be defined, and the service updates the location of these dynamically changing resources. The virtual nodes used AWS Cloud Map for service discovery. As new ECS tasks were created (regardless of their associated color), they were registered within AWS Cloud Map. The services used the internal namespace and ECS service name creating an endpoint like account.myapp.internal. To match the correct virtual node color to its respective task, the AWS Cloud Map service discovery provided the option to filter based on metadata custom attributes. Instances were filtered based on the ECS_TASK_DEFINITION_FAMILY, since these were named with their respective colors.
To be continued
In Part 2 of this two-part blog series, we’re going to dive deep into how TPCi has achieved zero-downtime deployment for Pokémon TCG Live by leveraging native AWS services.