AWS for Games Blog

Developer’s Guide to operate game servers on Kubernetes – Part 1

Introduction

Live operations are a strategy that maintains player interest through continuous updates and fresh content, enhancing dynamic engagement and driving game evolution across platforms. Game operation teams use live operations to deliver new expansions or events in multiplayer online games, enriching the online world.

Player customization, seasonal events, and community challenges boost retention, engagement, and collaborative gameplay. Players need an evolving game that responds to their interactions and actions, influencing future developments. Dynamic in-game economy adjustments and regular content updates keep the game balanced, fresh, and extend player session times.

Through live operations, game operations teams ensure titles adapt to changing player preferences, enhancing longevity, and fostering a dedicated player base, which is essential for success in the gaming industry.

Traditional game infrastructure and its hurdles

The traditional virtual machine infrastructure poses unique challenges when game operations teams establish live operations. They are tasked with the provisioning and monitoring of extensive server fleets. A typical deployment involves the creation of an Amazon Virtual Private Cloud (Amazon VPC), and proper network configuration for the pools of virtual machines used to host game sessions.

Figure 1 - A diagram that shows a traditional game servers deployment. In the picture, a game client is connecting to game servers running on Amazon EC2 instances. The EC2 instances are launched by Auto-Scaling Groups. Traffic from the game client to these private notes running inside a VPC is controlled via a NAT Gateway. The EC2 instances are launched in multiple availability zones for resilience.


Figure 1 – A view of a traditional game servers deployment.

When a game goes live, players are directed to regional server fleets, and game operation teams must dynamically scale server capacity based on player counts, optimizing resource usage. This scaling process is crucial for game launches and ongoing live operations activities. However, scaling virtual machines (VMs) to handle increased player demand can be a daunting task, involving extensive capacity planning, provisioning, and maintenance, straining resources, especially for smaller studios.

Containers to the rescue

Containers emerge as heroes in game development, offering developers unparalleled agility and scalability for managing complex game demands.

Containerization enables rapid deployment of updates, seamless introduction of new in-game features, and swift rollout of new quests and environments, essential for maintaining continuous player engagement and keeping virtual worlds fresh. Containers provide isolation for microservices architectures, improving game performance and development workflow. Developers can scale services using containers to meet player demand, maintain stability during peak loads, and ensure a consistent gaming experience, making containers a cornerstone of modern game operations.

Benefits such as resource efficiency, cost savings, improved scalability, and reduced latency with edge-base architectures are explored in the blog post Optimize game servers hosting with containers.

Figure 2 is showing a containerized game servers deployed in two AWS regions. On the diagram we can see the game client sending traffic to two EKS clusters in the region. Each EKS cluster includes a namespace called gameservers which is hosting the pods running the game server containers.

Figure 2 – containerized game servers deployed in two regions.

Diverse container offerings on AWS

Amazon Web Services (AWS) provides essential container services for live operations in game development. Developers can leverage Amazon Elastic Container Service (Amazon ECS), a fully managed container orchestration service, or Amazon Elastic Kubernetes Service (Amazon EKS) to tap into the Kubernetes ecosystem. AWS Fargate brings serverless container compute to Amazon ECS and Amazon EKS, freeing game operation teams from server management and allowing them to focus solely on the creation of compelling content.

With diverse container services, game operation teams can deploy updates, manage in-game events, and introduce new features, ensuring games remain dynamic and responsive to player interactions. Within the AWS ecosystem, containers are pivotal enablers of live operations, facilitating real-time game evolution and growth alongside their player communities.

This article focus on the design of Kubernetes clusters to deploy game servers for games. The article outlines how to leverage infrastructure-as-code with Terraform to build clusters based on Amazon EKS best practices.

AWS networking considerations for containers runtime

The starting point for game operation teams before running game servers on Kubernetes is to plan the location, sizing and how traffic will flow to the game servers.

Since the game servers are hosted in multiple regions, each Kubernetes cluster runs in a VPC created in the region where players are based. The VPC defines public and private subnets with the IP addresses range assigned.

The terraform script for the Kubernetes cluster declares a module called “vpc” which defines the availability zones, private subnets, and public subnets. The variable “vpc_cidr” is used to split the IP addresses range for the deployment between the private and public subnets.

A good practice when planning your containers runtime deployment is to aim for a minimum of 2 availability zones and ideally 3 availability zones (2n+1) to improve reliability. The list of availability zones used by the cluster is defined with the variable “azs”.

module "vpc" {
  #checkov:skip=CKV_TF_1:Module registry does not support commit hashes for versions
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 4.0"

  name = local.name
  cidr = local.vpc_cidr

  azs                     = local.azs
  private_subnets         = [for k, v in local.azs : cidrsubnet(local.vpc_cidr, 4, k)]
  public_subnets          = [for k, v in local.azs : cidrsubnet(local.vpc_cidr, 8, k + 48)]
  map_public_ip_on_launch = true

  enable_nat_gateway = true
  single_nat_gateway = true

  # Manage so we can name
  manage_default_network_acl    = true
  default_network_acl_tags      = { Name = "${local.name}-default" }
  manage_default_route_table    = true
  default_route_table_tags      = { Name = "${local.name}-default" }
  manage_default_security_group = true
  default_security_group_tags   = { Name = "${local.name}-default" }

  public_subnet_tags = {
    "kubernetes.io/role/elb" = 1
  }

  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = 1
  }

  tags = local.tags
}

In addition, it is possible to define traffic rules directly in the terraform code to restrict the origin, protocol, and ports for the game servers. The following code sample defines traffic rules for UDP, TCP and custom game server webhooks:

node_security_group_additional_rules = {
    ingress_gameserver_udp = {
      description      = "Game Server UDP Ports"
      protocol         = "udp"
      from_port        = local.gameserver_minport
      to_port          = local.gameserver_maxport
      type             = "ingress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    },
    ingress_gameserver_tcp = {
      description      = "Game Server TCP Ports"
      protocol         = "tcp"
      from_port        = local.gameserver_minport
      to_port          = local.gameserver_maxport
      type             = "ingress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    },
    ingress_gameserver_webhook = {
      description                   = "Cluster API to node 8081/tcp control webhook"
      protocol                      = "tcp"
      from_port                     = 8081
      to_port                       = 8081
      type                          = "ingress"
      source_cluster_security_group = true
    }

  }

It is a best practice to evaluate the number of IP addresses required for the game servers and plan how IP addresses are managed during game operations.

The code sample assigns public IP addresses to game servers for demonstration purposes. The best practice is to use AWS Global Accelerator and create custom routes to send traffic to private IP addresses assigned to game servers.

IP prefixes can be used to prevent IPv4 addresses exhaustion on large clusters hosting game servers. Amazon EKS supports IPV6 which is another option to ensure a large amount of IP addresses is available for the game servers. Depending on the game requirements you can customize the terraform definition for the Amazon EKS cluster to build dual stack Kubernetes clusters offering support for both IPv6 and IPv4. You can also opt for a setup where IPv6 addresses are assigned to the pods running the game servers and the game clients connect to the game servers using an IPv4 address.

Since the resulting infrastructure will run a large number of game servers, Amazon VPC IP Address Manager (IPAM) can be used to maintain the inventory of IP addresses.

Figure 3 shows a view of IP address management in AWS Systems Manager. We can see a dashboard showing the CIDRs for the resources and their state. We can also see overlapping CIDRs.

Figure 3 – IP address management on AWS.

With IPAM, game operations teams can prevent IP ranges from overlapping and impact the performance of game servers. The full list of best practices for Amazon EKS networking are listed in this guide maintained by the Amazon EKS team.

Cluster-to-cluster communication.

Since Kubernetes clusters running game servers can be in different regions, it is important to look at cluster-to-cluster communication and factor in data transfer costs in the architecture. As a rule, we do not recommend exposing services on each cluster via appliances or services over Internet. There are many secure options available to you like VPC Peering, AWS Transit Gateway or, AWS PrivateLink that will keep cluster traffic secure.

Transit Gateway is a great option that balances the total cost of ownership with scale and performance. The service offers multi-region connectivity over the AWS backbone and the ability to connect Amazon VPCs across accounts.

Game operations teams running clusters in the same region, can also use Amazon VPC Lattice to setup communications between services. VPC Lattice is an application networking service that connects, monitors, and secures communications between services. With that approach game operations teams can handle many-to-many VPC connectivity over the AWS backbone, with a viable option to control data transfer costs. In addition, the centralized management of interconnectivity and the ability to use AWS Identity and Access Management (IAM) policies to control access VPC Lattice reinforce security during game operations.

Blueprint for Kubernetes

The blog post outlines the fundamental networking considerations when deploying game servers on Kubernetes clusters. Game operation teams can use Infrastructure-as-code to standardize the definition and provisioning of Kubernetes clusters to make sure every deployment is consistent. Amazon EKS blueprints, an open-source project, provides game operations teams with a collection of pre-configured and validated Amazon EKS cluster patterns, allowing them to bootstrap production-ready Amazon EKS clusters. These patterns have been tested and validated by Kubernetes experts, ensuring reliability and adherence to best practices.

The blueprints can be customized to meet the game requirements. In the code sample below, the Amazon EKS Blueprints for Terraform is used to define an Amazon EKS cluster:

module "eks" {

versions
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.13"

  cluster_name                   = local.name
  cluster_version                = local.cluster_version
  cluster_endpoint_public_access = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = concat(module.vpc.private_subnets, module.vpc.public_subnets)

  eks_managed_node_groups = {
    public_gameservers = {
      instance_types = var.gameservers_instance_types
      min_size       = var.gameservers_min_size
      max_size       = var.gameservers_max_size
      desired_size   = var.gameservers_desired_size
      labels = {
        "gameservers.dev/public-gameservers" = true
      }

      subnet_ids = module.vpc.public_subnets
    }

    gameservers_metrics = {
      instance_types = var.gameservers_metrics_instance_types
      labels = {
        "gameservers.dev/gameservers-metrics" = true
      }
      taints = {
        dedicated = {
          key    = "gameservers.dev/gameservers-metrics"
          value  = true
          effect = "NO_EXECUTE"
        }
      }
      min_size     = var.gameservers_metrics_min_size
      max_size     = var.gameservers_metrics_max_size
      desired_size = var.gameservers_metrics_desired_size

      subnet_ids = module.vpc.private_subnets
    }

  }

The Amazon EKS cluster blueprint uses the “vpc” module defined in the previous section. Amazon Elastic Compute Cloud (Amazon EC2) nodes for the Amazon EKS cluster are launched in the Amazon VPC created with by the Terraform script . To facilitate game operations, the module reads key variables from input parameters defined during infrastructure creation or updates.

The code sample defines parameters for the name of the cluster, the Kubernetes version to use, the type of Amazon EC2 instance running the game servers in containers, the minimum, and maximum number of Kubernetes nodes.

Game operation teams can control on which Kubernetes nodes game servers and dependent services should run. The code sample demonstrates this with an example where a node group called public_gameservers is used to isolate game servers. A separate node group called gameservers_metrics runs the components for an observability stack to monitor the game.

Amazon EKS allows for building a diverse cluster composed of different instance types, architectures (Graviton or Intel), and capacity models, such as spot instances. This flexibility enables optimizing the cluster’s cost-performance ratio by selecting the appropriate resource mix for each component. Kubernetes nodes are labeled and tainted to allow pods to selectively schedule themselves on appropriate nodes, optimizing resource utilization and workload distribution across the cluster.

Figure 4 shows a kubernetes cluster with multiple node groups. Each node group is based on a specific instance type to better serve the pods running on them. The 3 node groups are the monitoring pool, the the game servers pool and the kube-system pool. The monitoring pool uses the m6g.medium instance type, the game servers pool uses the c6g.4xlarge instance type, and the kube-system pool uses the m6g.medium instance type.

Figure 4 – Kubernetes node groups.

Custom blueprints resulting from design iterations can be stored in a Git repository, offering the possibility to clone the infrastructure template and use a GitOps approach to launch standardized games clusters in specific regions.

Figure 5 shows a pipeline to manage containerized game servers infrastructure. The code for the infrastructure is stored in code commit. Game operation team can make changes to the versioned code and trigger a code pipeline flow that will update the definition of the cluster and the game servers deployments. Instance types and node groups definitions can be created or updated using that mechanism.

Figure 5- A pipeline to manage containerized game servers infrastructure.

Conclusion

The article presented best practices to guide game operation teams during the creation of Kubernetes clusters to host game servers. The article explains the importance of infrastructure-as-code to create reusable patterns for game launches.

We hope this blog has provided you with fundamental knowledge to improve the creation of Kubernetes clusters for games. The future articles will explore the AWS solution guidance to host game servers on Amazon EKS with Agones and Open Match, two popular open source frameworks for game servers hosting and matchmaking on Kubernetes. Stay tuned!

Serge Poueme

Serge Poueme

Serge Poueme is a technology professional with over 15 years of experience driving strategic cloud initiatives and digital transformation. As a Senior Solutions Architect at AWS, he works with gaming customers on cloud adoption, cloud native architectures, and solution guidance for game servers hosting. Based in Montreal, Serge is passionate about technology, music, books, movies, and video games.

Badrish Shanbhag

Badrish Shanbhag

Badrish Shanbhag is a Sr. Containers Specialist Solutions Architect at AWS. He has spent close to two decades helping customers with Application Design, Architecture and implementation of distributed systems using Microservices Architecture (MSA), Service Oriented Architecture (SOA). Currently, he serves as a principal tech advisor for Media, Entertainment and GameTech Industries, and is responsible to influence & shape their App Modernization journey using Cloud Native & AWS Container services.

Marcos Hauer

Marcos Hauer

Marcos Hauer is a technology professional with extensive experience enabling Linux and Open Source adoption, assisting companies in cloud computing, DevOps, and SRE. With a background in networking and web technologies, he leverages expertise in Docker, Kubernetes, and CI/CD to aid digital transformation. He is passionate about games and improving gaming infrastructure.

Scott Flaster

Scott Flaster

Scott Flaster is a Technical Account Manager in North America who works with Games customers. He is passionate about building large-scale distributed applications to solve business problems using his knowledge in AI/ML, Security, and Infrastucture.