AWS Machine Learning Blog

Reducing player wait time and right sizing compute allocation using Amazon SageMaker RL and Amazon EKS

As a multiplayer game publisher, you may often need to either over-provision resources or manually manage compute allocation when launching or maintaining an online game to avoid long player wait times. You need to develop, configure, and deploy tools that help you monitor and control the compute allocation. This post demonstrates GameServer Autopilot, a new machine learning (ML) based tool that makes it easy to reduce both the time players wait and compute over-provisioning. It eliminates manual configuration decisions and reduces the opportunity for human errors.

An early version of the GameServer Autopilot used linear regression to estimate the compute capacity needs. For more information, see Autoscaling game servers with Machine Learning on YouTube. Even with predictions, optimizing compute resource allocation is non-trivial because it takes substantial time to prepare Amazon EC2 instances. The allocation algorithm must account for the time needed to spin up an EC2 instance and install game assets. Ever-changing usage patterns require a model that is adaptive to emerging player habits—the system needs to scale up and down in concert with changes in demand.

This post shows how to use Amazon SageMaker RL to apply reinforcement learning (RL) with Amazon EKS, Amazon DynamoDB, AWS Lambda functions, and Amazon API Gateway. The post describes an ML system that learns to allocate resources in response to player usage patterns. The hosted model directly predicts the required number of game server instances to initialize to reduce player wait time. The training process integrates with the game ecosystem, requires minimal configuration, and you can extend it to other compute resource allocation scenarios.

Allocating compute

The architecture of multiplayer games includes the compute hosting for game servers and the matchmaking system that matches players to game sessions based on location and skills. Players wait in a virtual game lobby for their match. In current multiplayer game systems, the number of players at different levels, together with a system like matchmaking, dictates the demand for game servers. EKS manages the hosting platform, which schedules game server jobs reactively when it receives requests from the matchmaking system. Game servers are vertically packed into available EC2 instances. The EKS cluster autoscaler spins up new EC2 instances when the current EC2 instances are packed with game servers. The following diagram depicts a typical dedicated game server farm and architecture. It illustrates the interactions between the player, the matchmaking, and game lobby application with the dedicated game servers hosted by EKS.

Challenges in autoscaling

Reactive allocation is preferable to static allocation, which must over-provision for temporal peaks. An Amazon EC2 Auto Scaling group contains a collection of EC2 instances that are treated as a logical grouping for automatic scaling and management. The reactive approach includes a rule-based dynamic scaling policy of the Auto Scaling group, for example, adding EC2 instances when the CPU usage of the existing instances goes beyond 60%.

However, reactive scaling can lead to frustration because players may need to wait for the Auto Scaling group to spawn an EC2 instance to host a new game. Over-aggressive scale-down events may result in long wait times after short lulls in traffic or even shut down live game sessions. For example, if game servers are allocated based on requests from a matchmaking service and the matchmaking service experiences an outage or has delays in communication, a reactive system releases resources. When the matchmaking system recovers, it generates a significant load on the system based on the real compute allocation needed.

The following graph demonstrates the matchmaking system outage scenario and how a proactive allocation system such as the GameServer Autopilot could alleviate the impact of an outage. The current_gs_demand line is the demand for game servers from the matchmaking system. The num_of_gs line represents the number of active game servers that players connect to. In the graph, between 01:00 and 3:30, an outage in the matchmaking system caused the demand for game servers to drop to 32 from its average 86. In contrast, the autopilot detected that the current drop was a false-negative event and kept the compute capacity ready for players. Because the issue in the matchmaking gets resolved, there is no further impact on player experience. To summarize, a good dynamic scaling system needs to predict and schedule the correct number of game servers in advance.

GameServer Autopilot

The number of allocated game servers is required to scale as players join or leave multiplayer game sessions. You can extend the preceding architecture to allow the game server autopilot to directly learn compute allocations patterns that are based on specifics of its matchmaking service. You can use Amazon SageMaker RL to train and deploy state-of-the-art reinforcement learning algorithms.

The training phase requires integration with the game server scheduling system. You can use EKS to deploy the game servers. When the training job ends, it saves the trained model to Amazon S3. The model deploys to an Amazon SageMaker hosting endpoint that communicates with an API Gateway and Lambda function for secure access. You can also use the Lambda function to incorporate guardrails; for example, to override decisions that may have a significant negative impact on customer experience, such as breaking existing game sessions. The autopilot server stores the observation history in a DynamoDB table and uses the history for calls to the Amazon SageMaker endpoint. For more information, see the GitHub repo.

The following diagram depicts the extended dedicated game server architecture with autopilot.

On the EKS side, the autopilot client queries the API Gateway endpoint and sets the size of the game server Kubernetes deployment. The Kubernetes scheduler proactively satisfies player requests by deploying the game servers needed based on the decisions of the RL model. This method bridges the EC2 instances allocation system and the game server allocation by the EC2 Auto Scaling group and translates game sessions to various EC2 instances types and sizes.

Using reinforcement learning

In Reinforcement Learning (RL), an agent uses trial and error to learn to make a sequence of decisions that maximizes the total reward over time. In Amazon SageMaker RL, most of the components of an RL Markov Decision Process described in the previous section are defined in an environment file. This post connects open-source and developed a custom environment using OpenAI Gym, which is a popular set of interfaces to help define RL environments and is fully integrated into Amazon SageMaker. For more information, see the GitHub repo. Because RL models learn by a continuous process of receiving rewards and punishments for every action taken by the agent, it is possible to train systems to make decisions under uncertainty of availability of massive On-Demand or Spot EC2 Instances. For more information, see Use Reinforcement Learning with Amazon SageMaker.

RL-based decision-making encourages actions that are appropriate for the long term (for example, pre-allocating servers) even if they increase costs in the short term. The agent learns to act in an uncertain environment, with hard-to-predict future arrivals and delayed rewards. Uncertainty manifests in demand for game servers and the time it takes to provision new compute to accommodate the players’ needs. Initially, the RL agent allocates game servers with no knowledge of the right game server allocation. The RL agent learns over time to reduce player wait time and trade-off over-allocation.

The duration of game sessions varies from a few minutes to tens of minutes for first-person shooter (FPS), arcade, or massively multiplayer online (MMO) games. When a game session ends, and the game server is idle, the server terminates itself and eventually terminates the EC2 instance due to low utilization. The session length dictates the frequency of the compute scale actions. For example, it takes 10–15 minutes to prepare an EC2 instance, a time that includes deploying game binaries and assets. It may sometimes take longer to fulfill a large-scale compute request or requests for Spot Instances. For more information, see Spot Instance Requests. Therefore, the algorithm needs to predict the number of servers over a time horizon. The horizon length is determined by both the game session length and the time it takes to prepare the EC2 instance to run the required game servers.

Using the model

For simplicity, this post distinguishes between two use cases: training and model deployment. The RL approach combines the two into a single phase. This post later describes how game server autopilot is a safe autoscaler that allows the combination of the training and deployment phases.


The game server autopilot model is trained using an Amazon SageMaker notebook sample on GitHub. The training uses the RL environment, which requests the current demand for servers, throughout the training duration. For more information, see the GitHub repo. You can configure an endpoint gs_inventory URL that returns the current demand for servers when requested by the environment. For more information, see the GitHub repo.

During training, when the step() function runs within, it produces an action that indicates the number of game servers the algorithm predicts as needed. The RL environment queries external endpoints that interface with the EKS control plane and the matchmaking service. The current demand for servers is retrieved from the matchmaking service. To avoid public access to the matchmaking service, you can deploy a Lambda function and API Gateway that secures the call made by the RL environment. In a production setting, you should limit the access using API keys. For more information, see Create and Use Usage Plans with API Keys.

The step() function now has two numbers; the first is an action, denoted by N_action, which is the current prediction of the required number of game servers. The second number is the actual number of servers as pulled from EKS, N_demand. The step() function calculates the reward by distinguishing between two cases: false negative (ratio more than 1) and false positive (ratio less than 1). A false positive is considered worse because lack of capacity causes long player wait times. Hence, a false positive is five times worse than a false negative. You can adjust these depending on the business application, and they can be even more complicated functions. See the following code:

def step(self, action):
        if (ratio>1):
           reward = -1 * (self.curr_alloc - self.curr_demand)
        if (ratio<1):
           reward = -5 * (self.curr_demand - self.curr_alloc) 
        if (ratio==1):
        reward -= (self.curr_demand - self.curr_alloc)*self.over_prov_factor

The time it takes to prepare an EC2 instance to host the needed game servers might be larger than the rate at which the demand for servers changes. Thus, the controller needs to consider future demand appropriately by selecting the discount factor (gamma). Therefore the current server allocation (curr_alloc) in step() considers the oldest demand in the demand history (demand_observation[]). See the following code:

def step(self, action):
  self.curr_alloc = self.demand_observation[0]

The following graph shows two metrics, the current demand (curr_demand) for game server given by matchmaking and the predicted allocation (curr_alloc). The graph shows how the model over-provisions when the demand is trending up (the curr_alloc line rises in advance of the curr_demand line). It also under-provisions as the demand ramps down, causing false negative predictions. This behavior is acceptable because the model is critical only for scaling up to reduce player wait time. Scale down happens automatically, driven by game servers terminating upon completion and EC2 instances terminating due to low utilization.

The following graph shows the overall training progress. The graph shows three metrics, the current demand (curr_demand) for game servers needed by matchmaking, the predicted allocation (curr_alloc), and the model reward. The expected allocation is what you wish to learn. The graph shows that the reward value helps the current demand and production converge as the training progress.

Model deployment

GameServer Autopilot includes server and client components. The server comprises of the endpoint deployed via Amazon SageMaker hosting, a DynamoDB table to persist history of inferences, and an API Gateway and Lambda function to simplify Amazon SageMaker runtime semantics. For more information, see the GitHub repo. The autopilot client deploys as a Kubernetes Pod that queries the autopilot server for the number of game servers that need to launch for the next 10 minutes. For more information, see the GitHub repo.

The notebook section deploys the model. See the following code:

from sagemaker.tensorflow.serving import Model
print ("model name: %s" % estimator.model_data)
model = Model(model_data=model_data,role=role)
predictor = model.deploy(initial_instance_count=1, instance_type=instance_type)

The autopilot server logic starts with a request from the client. The server pulls the history of five inferences from the last_observation DynamoDB table. For more information, see the GitHub repo. The server queries the current allocation in EKS to avoid erroneous scale downs caused by the false positive. The autopilot server returns the safe inference number to the autopilot client that sets the size of game server deployment. Kubernetes makes sure the required server available is ready by scheduling the game servers. If there are not enough EC2 instances available, the status of the server jobs is marked Pending. That indicates the cluster autoscaler to prepare more EC2 instances for the next 10–15 minutes.

The game server autopilot is a safe autoscaler. If the training phase is allowed to control the real server directly, this corresponds to an RL approach. If the scenario does not allow direct control of the live server, the algorithm learns to forecast the allocations of the current algorithm; for example, the approach reduces to supervised regression or forecasting. See the following code:


   if (action<curr_demand):

Implementing the autopilot

To implement the game server autopilot, you need a compute farm that runs servers and a model that provides game server demand inferences. You can use EKS for compute farm management and Amazon SageMaker to train, evaluate, and deploy a model on Amazon SageMaker as the endpoint. You also need to integrate the two pieces using Lambda and API Gateway, and a DynamoDB table that persists the predictions history for continuous prediction of needed servers. This post assumes that the server farm is an existing live system that operates regardless of the game server autopilot. Therefore, you can build the EKS cluster and deploy the server example that serves players. For more information, see the GitHub repo.

This post provides a Jupyter notebook that includes the following:

  • The parameters needed for the Amazon SageMaker training jobs and other AWS resources required for the session-server environment
  • IAM role setup that creates an appropriate execution role and shows how to add more permissions to the role needed for specific AWS resources
  • An S3 bucket to store intermediate training job artifacts and the final model
  • The client application (environment): session-server environment
  • Trained model deployment with public endpoint
  • Visualization through Amazon CloudWatch throughout the training
  • Cleanup of resources

How do I try it out?


About the Author

Yahav Biran is a Solutions Architect in AWS, focused on Game tech at scale. Yahav enjoys contributing to open source projects and publish in AWS blog and academic journals. He currently contributes to the K8s Helm community, AWS databases and compute blogs, and Journal of Systems Engineering. He delivers technical presentations at technology events and working with customers to design their applications in the Cloud. He received his Ph.D. (Systems Engineering) from Colorado State University.

Bharathan Balaji is a Research Scientist in AWS and his research interests lie in reinforcement learning systems and applications. He contributed to the launch of Amazon SageMaker RL and AWS DeepRacer. He received his Ph.D. in Computer Science and Engineering from University of California, San Diego.

Murali is a Senior Machine Learning Scientist in AWS. His research interests lie at the intersection of AI, optimization, learning and inference particularly using them to understand, model and combat noise and uncertainty in real world applications. He is particularly interested in Reinforcement Learning in practice and at scale. He contributed to the launches of Amazon Personalize and Amazon Forecast, and works with Amazon SageMaker RL. He received his PhD from Carnegie Mellon University.