How Singularity 6’s ‘Palia’ Conquered Cross-Regional Gaming with Amazon EKS and Karpenter

Introduction

Developing an online multiplayer game capable of supporting thousands of concurrent players is an incredibly complex endeavor. For Singularity 6 (S6), creators of the hit game “Palia,” which released across platforms in 2023 and 2024—their developers needed to craft an immersive and captivating gaming experience while managing the vast resources and infrastructure required to keep the game’s services running smoothly.

Throughout the development process, unforeseen obstacles and hurdles arose, testing the team’s resilience and problem-solving abilities. Furthermore, launching globally across console and PC platforms introduced an additional layer of intricacy, requiring planning and execution to ensure a seamless experience for players worldwide. To boost their small infrastructure team, S6 leveraged Amazon Elastic Kubernetes Service (Amazon EKS) and Karpenter’s flexibility and elasticity while minimizing cost—allowing Palia to reach over 3 million players in six months, in three different Regions, and across four distribution platforms.

This article describes the process the team at Singularity 6 used to make the Palia launch smooth and successful.

Overview of Palia’s AWS infrastructure

Early on, Singularity 6 made the decision to host Palia on Amazon Web Services (AWS). As an online multiplayer game, the team understood that Palia needed a “cloud-first” approach. Today, Palia runs using Amazon EKS in three AWS Regions: US West (Oregon) – us-west-2, Asia Pacific (Tokyo) – ap-northeast-1, Europe (Frankfurt) – eu-central-1.

Palia’s backend services run inside an EKS cluster, including the game servers powered by Unreal Engine. For persisting player state across a shared world, Palia relies on ScyllaDB as the database layer. ScyllaDB is highly scalable and designed from the start to be distributed and fast, trading speed for eventual consistency. Palia’s external traffic is routed through a geo-load balanced domain name system (DNS) address to a Network Load Balancer (NLB).

Diagram of a distributed game server architecture across three AWS Regions, America using us-west-2, Europe using eu-central-1, and Asia using ap-northeast-1. Each Region has game clients connecting to EKS-hosted stateful game servers, stateless microservices, and a database cluster. The architecture allows scaling game services regionally for low latency.)

Releasing Palia

During Palia’s development, its cloud infrastructure was built entirely in a single datacenter: US West (Oregon). Singularity 6 knew that the studio would eventually need to release Palia in other Regions to reduce latency and improve the player experience. Load testing their system was also necessary to ensure a smooth rollout. Still, there was fear that there were not enough players in alpha tests to accurately simulate launch numbers.

The team used this opportunity to clone part of their production cloud infrastructure to Frankfurt. As the launch window approached, Palia released in the US first, while leveraging Frankfurt as their load and stress testing cluster, then ran a ScyllaDB copy to evaluate the latency of their queries across Regions on different continents. This “dark multiregional rollout” at the start greatly reduced risks when the game began operating in a multi-regional setting later as the S6 team could gather information on real player behavior.

Load testing

Before Palia had any players, Singularity 6 simulated them by writing load tests against their microservices using Locust. Their major focus was stress testing critical backend components, particularly ones that rely on other services to function. To generate sufficient load, S6 wrote a Locust test, ran it from within the Amazon EKS cluster, and then scaled it rapidly and effectively while gathering data that showed how the test systems responded. Using this method, the team identified inefficient code paths, balanced resource requests and limits, and tweaked auto scaling metrics.

Chaos testing with Litmus

After Singularity 6 saw how their services responded to large volumes of requests, the team decided to inject some chaos into their tests. Their goal was to conduct experiments and observe how the rest of the system responded – especially for critical systems where an outage might cause a cascading failure. Using Litmus, S6 injected latency, created network isolation, and caused outages for specific services. Along the way, S6 constantly asked themselves, “Can we still play Palia?” Chaos testing helped build confidence because the team learned that once players were in-game, many types of chaos had little to no effect on them.

Release

Ahead of launch, Palia’s developers added and tested infrastructure to ensure they could match their player’s demands. Even with all their prep work, S6 encountered rough edges amid successes during their launch.

Issue: Poor VPC planning

When Singularity 6 started using Amazon EKS, the team already had a Virtual Private Cloud (VPC) module written in Terraform. The module was not perfect, but S6 decided it was good enough to move forward without major changes. However, the team realized that this was the wrong decision within hours of launching Palia when an alert on pending pods revealed that the game had run out of IP addresses in its VPC’s subnets. S6 knew they could add additional subnets to the VPC, but didn’t notice any lines of code that might eventually cause issues in the EKS Terraform module.

The issue was they were concatenating all their private subnet IDs and public subnet IDs to the EKS cluster’s subnet IDs. When the team added new subnets to the VPC, they were also expanding the EKS cluster control plane onto these subnets. This caused issues when the cluster needed to upgrade or when a managed node group needed to roll its instances.

In summary, an initial lack of VPC planning led to a proliferation of many unique managed node groups in Singularity 6’s environment. This made it harder to make changes to the EKS cluster as the node groups changed. Below, Singularity 6 discusses how using Karpenter helped reduce their usage of managed node groups, which helped to solve this issue.

Success: Templated resource requests and limits

During the early ramp up, the team encountered an incident where a component of their Linkerd stack was impacted from running out of memory (OOM). Despite having multiple replicas of this component, they saw other side effects of the OOMKill pod shutdowns, which meant they needed to act quickly. Within a few minutes, the team identified the component being killed and they were able to patch the resource requests to higher limits.

Luckily, S6 had accounted for this issue by adding templating in their Helm and Kustomize to account for setting resource requests and limits. An extra few hours of development work proved instrumental for patching the system while it was scaling and allowing for a smooth experience for their players.

Post release: Adding new Regions

After Palia’s release in US West (Oregon), the team finished their infrastructure in Frankfurt and started work to expand further in Tokyo. This was in preparation for Palia’s Nintendo Switch launch, which would bring in substantial new players from Asia. The team knew that providing game servers and services closer to their players would be instrumental to the game’s success, so getting Asia Pacific (Tokyo) online was a natural next step.

Issue: Lack of instance diversity

Palia’s backend, particularly the game servers, requires either a “many nodes with fewer vCPUs” or “fewer nodes with more vCPU” approach. In the Asia Pacific (Tokyo) Region, the team found that their preferred instance types had limited availability than either Oregon or Frankfurt. To solve for this, they decided to explore the use of Karpenter.

Success: Using Kustomize for DRY configuration

Adding new Regions revealed an obvious issue in their Gitops-style setup of ArgoCD. Namely, keeping their config (microservice version numbers, chart versions, resource requests and limits) in sync across multiple clusters.

To solve this, the team turned to Kustomize to further “customize” their DRY configurations and centralize changes such as bumping version numbers. Kustomize allowed S6 to define a single file that contained the version of both the game server and microservices. ArgoCD has excelled at noticing git repo changes and deploying updates to all regions.

Using Karpenter

At the end of October 2023, Karpenter moved to a beta readiness. At the time, Singularity 6 was using the cluster-autoscaler and formed several hypotheses that using Karpenter to manage their EKS instances closer to the control plane would provide several benefits, including:

It removed almost all of the managed node groups by reducing the burden of updating them as the cluster changed.
It granted the ability to quickly test additional instance configurations such as family, size, and architecture.
It allowed flexibility to easily adapt to Regions that lack a particular instance type.
It provided a more nuanced view of the cluster to start instances just-in-time and reduce their idle CPU.

Success: Flexibility, experimentation, and idle costs

Initially, Singularity 6 hypothesized that using Karpenter would give them flexibility across instance types, sizes, and even CPU architecture. For example, the team thought it would be straightforward to experiment running Palia on Graviton instances (with ARM processors) without any Terraform changes to managed node groups. Karpenter also streamlined issues with instance constraints in a Region like Tokyo. Finally, by configuring Karpenter to use varied instance sizes and families, S6 was able to run a cluster without any of the overhead of multiple Terraform-defined managed node groups.

Using Karpenter with its NodePool resource resulted in a reduction of idle game servers in their clusters. By configuring a more diverse set of instances, Singularity 6 was able to decrease idle cost by 75%.

Kubecost graph showing the change in idle {gray} after their initial deployment of Karpenter in November

Success: Karpenter accelerates Kubernetes upgrades

Switching to Karpenter removed the need for almost all of their managed node groups. They no longer had an EKS cluster with hundreds of instances across several managed node groups. This simplification, as well as using control_plane_subnet_ids, allowed the team to more easily upgrade their EKS clusters to newer versions of Kubernetes. With fewer instances to launch and less “node group overhead” in their Terraform module, S6 was able to upgrade from K8s v1.26 to v1.28 across six EKS clusters in a few hours.

Conclusion

From the player’s perspective, Palia’s PC launch went smoothly. The team performed load testing and chaos testing, which ensured the systems scaled properly. The elasticity of AWS offerings proved invaluable for this as they were able to double server capacity from 25 to 50 EC2 instances within hours across multiple Regions. Despite Singularity 6’s best efforts, they still ran into issues due to design choices made very early on.

Following Palia’s PC release, Singularity 6’s choice to focus on improving Terraform, Kustomize, and Karpenter, allowed them to rapidly add new Regions to the game and to improve their launch on Nintendo Switch. Again, AWS availability in multiple Regions allowed the team to move from a single Region to a world-wide infrastructure in a matter of months. Moreover, by moving configuration out of Terraform and EKS and into Karpenter, they were able to adapt to resource constraints and more quickly conduct experiments.

If you are curious about more details about the Palia software architecture, head over to the Singularity 6 blog for a more in-depth overview of this topic.

A special thank you to the Singularity 6 team who contributed to this architecture and blog post: Emily Price, Kyle Allan, Marc Tamsky, Matthew Walter, George Lin, Alec Nunn, Maciej Ciezki, Brian Tomlinson, and Stefano Mazzocchi.

AWS for Games Blog