Hyper-scale online games with a hybrid AWS Solution

Online multiplayer games, such as multiplayer online battle arenas (MOBA), are becoming increasingly popular. One option for game server hosting is to use on-premises data centers, which require multi-year contracts for a set number of resources. As the number of players for a given game grows, developers have to determine what to do if they reach the capacity limits in their datacenter operations.

The AWS Cloud allows companies to scale beyond their resources on demand with pay-as-you-go pricing. Using AWS is a great way for gaming developers to scale with load when they have reached their data center capacity.

This article discusses how the AWS Cloud can scale session-based games beyond existing data center capacity limits, without altering the existing game-ecosystem essentials. It also explains how companies can run game servers in locations where they don’t have a data center, by leveraging the global AWS Infrastructure. This approach reduces the latency and improves the player experience by using game servers closer to the players in those Regions.

In addition, this post demonstrates how to connect a VPN connection from an on-premises data center to an AWS Transit Gateway to establish a private connection between resources allocated in the AWS Cloud and the on-premises games ecosystem. It also covers how to use Auto Scaling Groups to scale a game server fleet in multiple AWS Regions.

Session-Based Online Game Architecture

Modern online games have at least three major components that are relevant for this article. The first one is a matchmaker, the second one is the game backend, and the third is the game serve.

The matchmaker’s task is to find the best possible match of players for a new game session. This process is called matchmaking. Matchmaking takes multiple metrics into account when creating the next game session. One major metric is the latency from the game clients to the game sever, which is crucial for the game experience, especially in really fast-paced multiplayer online games. Other metrics such as player level, player experience, and queue waiting time are commonly used, however they won’t be discussed in this article. For more information on matchmaking, refer to the What is GameLift FlexMatch? documentation.
The game backend utilizes data from matchmakers to generate game sessions based on the matchmaking logic and prepares the game server for the incoming connections. Game backends are similar to web services and are deployed as a container, running on a Virtual Machine or as several serverless functions. Game backends are typically stateless and do not run any game logic. In addition, the game backend knows the current demand of game sessions and takes care of scaling the game server fleet.
The game server is the heart of a game session. A set of game servers is also called a game server fleet. One game session is a collection of connected clients representing players and is valid until that game session is over, or the game server is shut down. An online multiplayer game server uses low level networking protocols like UDP and TCP for receiving and sending game events from and to clients. Typically, the game server is the single source of truth for a game session. For some type of games, like MOBAs, players who have lower latency to the game server have an advantage because their events arrive to the server before those of the players with higher latency. Thus, it is crucial to have the game servers close to the players of one game session.Online multiplayer game servers run on virtual machines or as containers on a container orchestration service like Kubernetes. It is common to run multiple game server processes on one virtual machine, utilizing a main process to assess that the game server processes are alive. This main process also restarts game server processes if they are unhealthy. In a containerized environment, the container orchestration service takes care of terminating unhealthy containers.

These three components work closely together to create the game sessions. Typically, a matchmaker doesn’t have a low latency requirement. Thus, there is only one matchmaker in one location (i.e. Asia, North America, South America and EU.) Game server fleets are distributed all over these locations to reduce the latency for players within one game session – for example, one matchmaker and a game backend for Europe based in Frankfurt with game server fleets in Frankfurt, London, and Stockholm.

It is worth noting that there are other parts in a modern online game architecture. This article omits these to reduce its complexity and focus on the hyper scaling aspects of games.

Hybrid Architecture

Often, enterprise environments are a mix of cloud, on-premises data centers, and edge locations. Hybrid cloud architectures help organizations integrate their on-premises and cloud operations to support a broad spectrum of use cases using a common set of cloud services, tools, and APIs operations across on-premises and cloud environments.

A common use case of a hybrid architecture is data center extension. In this situation, a company operates on-premises networking, security, storage, and access control infrastructure seamlessly alongside AWS to enable data center extension to the cloud.

Combining hybrid architecture with a game architecture

Hybrid architecture allows scaling beyond the limits of the on-premises data center. It helps to cover local game server blind spots without the need to re-architect the game backend, matchmaker and game server applications.

Setting up a hybrid network for a session-based multiplayer game

The first step in setting up a data center extension is configuring the networking components. This is required to establish private connectivity between the on-premises ecosystem and the AWS Cloud.

The following sections of this blog post describe how to:

Set up a VPC with Route Table and public Subnets to run a fleet of game server in.
Configure an AWS Transit Gateway and an Amazon Site-to-Site VPN to an on-premises VPN server.
Add a second region to the hybrid network for a fleet of game servers in a different region.

Game server fleet VPC creation

Before setting up the private connection, a Virtual Private Clouds (VPC) for game server fleets is required. A VPC is a logically isolated network space in the AWS Cloud where the game server resides. Since the game server instances in this example are reachable from internet, a simple VPC with two or three public subnets in different Availability Zones (AZs) within the desired Region is sufficient.

In the AWS Management Console, follow this procedure:

Navigate to VPC.
On the VPC Dashboard menu, choose Your VPCs.
Choose Create VPC.
For IPv4 CIDR block, enter 10.0.0.0/16.
(Optional) For Name Tag, enter <region>-game-server-fleet.
(Optional) Select the option to use an IPv6 CIDR block.
Choose Create VPC.

This will create a VPC with the given CIDR, a default Route Table and a Network Access Control List (Network ACL).

The next step is to divide the VPC into subnets. Each subnet should have a distinct CIDR block. It’s good practice to name a subnet according to the Availability Zone it resides in.

On the VPC Dashboard menu, choose Subnets.
Choose Create subnet.
For VPC ID, select the VPC with the name <region>-game-server-fleet.
For Subnet name, enter subnet-<region><az>-game-server-fleet.
For Availability Zone, select <region>a.
For IPv4 CIDR block, enter 10.0.1.0/24.
Choose Create subnet.

Repeat this process a second time with a unique CIDR range for the second subnet. For the purpose of this blog, you may use 10.0.2.0/24 and 10.0.3.0/24.

As the game servers require an internet connection, next add an Internet Gateway to the VPC.

On the VPC Dashboard menu, chose Internet gateways.
Choose Create Internet gateway.
For Name tag, enter igw-<region>-game-server-fleet.
Choose Create Internet gateway.

From the Actions, choose Attach to VPC.
Select the previous created VPC from the Available VPCs.
Choose Attach Internet gateway.

The next step for a functional VPC is to edit the default route table to route traffic to the internet and associate the subnets to it.

On the VPC Dashboard menu, choose Route Tables.
Select the default route table associated with your VPC ID.
On the navigation bar, choose Routes.
Choose Edit routes.
For Destination, enter 0.0.0.0/0 and for Target select the Internet Gateway.
Choose Save routes.

Additionally, associate the subnets with the Route Table.

On the navigation bar of the route table, choose Subnet Associations.
Choose Edit subnet associations.
Select all subnets to associate.
Choose Save.

Setup of the AWS Transit Gateway as central network hub

Use an AWS Transit Gateway as a central entry point to the AWS Network to route traffic to multiple AWS Regions later on.

In the AWS Management Console:

Navigate to VPC.
On the VPC Dashboard menu, choose Transit Gateways.
Choose Create Transit Gateway.
(Optional) For Name tag and Description, enter tgw-<region>.
Choose Create Transit Gateway.

This creates an AWS Transit Gateway and a Transit Gateway Route Table. To route traffic to and from an AWS Transit Gateway, create Attachments to the AWS Transit Gateway. The first Attachment is the game server fleet VPC.

On the VPC Dashboard menu, choose Transit Gateway Attachments.
Choose Create Transit Gateway Attachment.
Select the Transit Gateway ID.
Select the VPCI ID and all Subnet IDs.
(Optional) For Attachment name tag, enter tgw-vpc-attachment-<region>.
Choose Create attachment.

Note that this process updates the default route table of the AWS Transit Gateway so that traffic with the destination 10.0.0.0/16 reaching the AWS Transit Gateway routes to the VPC Attachment.

To verify the route information:

Choose Transit Gateway Route Tables on the VPC Dashboard menu.
Choose the created Transit Gateway Route Table.
On the navigation bar, choose Routes.

VPN Connection

The next step and last step of the networking component is to establish a connection from the on-premises data center to the new game server fleet VPC. The focus of this article is on using a VPN connection; however, the VPN connection may be exchanged with an AWS Direct Connect if a low-latency, low-jitter, and high-throughput connection is required.

To establish a VPN connection to an on-premises VPN server, open the AWS Management Console and complete the following:

Navigate to VPC.
On the VPC Dashboard menu, choose Customer Gateways.
Choose Create Customer Gateway.
For IP Address, enter the IP of your on-premises VPN Server.
(Optional) For Name, enter Customer VPN Device Name.
(Optional) For Certificate ARN, select your-certificate-arn.
(Optional) For Device, enter VPN-Device-Type.

After the creation of the Customer Device, set up the Site-to-Site VPN.

On the VPC Dashboard menu, choose Site-to-Site VPN Connections.
Choose Create VPN Connection.
From Target Gateway Type, select Transit Gateway.
For Transit Gateway, choose the previously created AWS Transit Gateway.
For Customer Gateway ID, choose the previously created Customer Gateway.
Choose Create VPN Connection.
(Optional) For Name tag, enter site-to-site-to-<on-premises-name>.

After a couple of minutes, the Amazon Site-to-Site VPN is provisioned and the Transit Gateway Route Table can be adjusted so that traffic with an on-premises destination is routed to the VPN connection.

On the VPC Dashboard menu, choose Transit Gateway Route Tables.
Choose the route table of the previously created AWS Transit Gateway.
On the navigation bar of the route table, choose Routes.
Choose Create Static Route.
For CIDR, enter the CIDR of on premises network.
For Choose attachment, choose the VPN Attachment ID.
For Routing Options, choose Static.
Choose Create static route.
(Optional) Configure the Site-to-Site according to your customer device requirements. For example, select dynamic routing if your device supports it.

One step to achieve connectivity from the game server fleet VPC. Because in this case traffic would be routed to the VPN Attachment, traffic must be route from the VPC to the Transit Gateway, if the traffic’s target is the on-premises CIDR range.

On the VPC Dashboard menu, choose Route Tables.
Choose the route table of the previously created VPC.
On the navigation bar of the route table, choose Routes.
Choose Edit routes.
For CIDR, enter the CIDR of on premises network.
For Target, choose the AWS Transit Gateway.
Choose Save routes.

Now traffic from the VPC with the destination to on-premises is routed to the AWS Transit Gateway. The AWS Transit Gateway will route it to the VPN Server defined by the customer gateway device. Incoming traffic to the Transit Gateway destined for the game server is routed to the VPC.

Adding a second Region for game server fleets

Because AWS Transit Gateway is used as a central entry point, you can add a second Region for game server fleets by following the steps for the VPC, AWS Transit Gateway and VPC Attachment creation. But instead of adding a VPN to the AWS Transit Gateway, use transit peering to peer the central transit gateway with a transit gateway in a second Region.

On the VPC Dashboard menu, choose Transit Gateway Attachment.
Choose Create Transit Gateway Attachment.
Select the Transit Gateway ID of the new transit gateway in the second game server fleet region.
For Attachment type, select Peering Connection.
(Optional) For Attachment name tag, enter peering-to-<region>.
For Region in the Peering Connection Attachment, select the region of the central transit gateway.
For Transit gateway (accepter)* in the Peering Connection Attachment, enter the transit gateway id of the central transit gateway.
Choose Create attachment.

This initializes a peering request to the Transit Gateway that must be accepted to function as desired.

On the navigation bar, choose the region of the central transit gateway.
Navigate to VPC.
On the VPC Dashboard menu, choose Transit Gateway Attachments.
Select the transit gateway attachment with the Resource Peering.
From Actions, choose Accept to accept the peering connection.

Peering two Transit Gateways doesn’t create a route between the resources. Thus, the route tables of the AWS Transit Gateway need to be adjusted so that traffic from the second Region, with the destination of on-premises, is routed to the peering and vice versa. Refer to the VPN Connection section above for the process of adding a route to the default Route Table.

The VPC’s Default Route Table in the second game server Region needs to route the traffic for the on-premises destination to the Transit Gateway in the second Region.

Utilizing Auto Scaling Groups to scale a game server fleet in a specific Region

Speed matters for scaling game server instances to meet the current number of players. The first step is to create an Amazon Machine Image (AMI) using a running game server instance with an installed game server. This reduces startup time, since no installation or unzipping of the game server files is required on startup of the instance. For how to create an AMI from an EC2 Instance, refer to Creating an AMI from an Amazon EC2 Instance.

Once the AMI is created, use an EC2 Launch Template to standardize the configuration of the Auto Scaling group. The template contains all configuration information to launch an instance, like the instance type, key pair, and which VPC the instance should be deployed to.

The last step to hyper-scaling a game server fleet in an AWS Region is to create an AWS Auto Scaling group for a new game server fleet in the desired AWS Region. That region needs to be set up with the networking components for hybrid connectivity as described before. The Auto Scaling group handles meeting the desired Amazon EC2 instance count set by a scaling policy. The Auto Scaling group distributes the instances evenly across the desired subnets in both cases.

In this architecture, the desired instance count of an Auto Scaling group is the primary value used for scaling the game server fleet. It is either manipulated manually, based on a custom calculated metric of the matchmaker, or by a scaling policy.

In the second step of the Auto Scaling group creation process, it is important to enable the collection of Amazon CloudWatch metrics. CloudWatch metrics help automatically adjustments sthe instance count of the Auto Scaling group, as well as give specific insights relevant to the group.

Automatically scaling an AWS Auto Scaling Group to meet demand

One way to scale a game server fleet is based on a metric of the server. A common CloudWatch metric used for adjusting the desired instance count of an Auto Scaling group is the CPU utilization. While this is good practice for web-based applications, it’s not suitable for session-based multiplayer games as the game instance server metrics do not reflect the current demand of game sessions.

The game backend knows how many players are queued, how many servers are currently running, the current wait time for players, and much more. This means that you can calculate a metric from this data, and that can be published to Amazon CloudWatch on a regular basis. Depending on the value of the metric, you can use CloudWatch to raise alarms for high and low states. This alarm is used to create scaling policies that describe how the Auto Scaling group adjusts its desired value. For further details of Auto Scaling groups, visit AWS Blog Category: Auto Scaling.

As an example, a metric with a value [0,1] is published to CloudWatch: 0 means that 0% of the game servers are used and 1 means that 100% of the game servers are used. For this metric, two alarms are set up: one when the threshold is above 0.8 and the other one when it’s below 0.5. Those alarms are used for a scale-out and scale-in policy, resulting in a scale-out when 80% is reached, and a scale-in when fewer than 50% of the game servers are used.

Manually Scaling an Auto Scaling Group to meet demand

The second option is to use manual scaling by setting the desired instance count of the Auto Scaling group. This allows granular and fast adjustments to meet the current game server demand. If the game requires one game server for every 10 players, and currently 100 players are waiting for a session, you could increase the desired value to meet demand without waiting for CloudWatch to rise an alarm to trigger the Auto Scaling group. These adjustments are made with the Command Line Interface (AWS CLI) or the Software development Kit (AWS SDK).

As mentioned previously, the game backend is the central manager for the game server instances. Thus, the game backend for instances knows:

How many game servers there are;
How many active sessions are running and on which instance;
How many sessions are requested;
If there’s capacity on premises or if the next game session should be scheduled to the cloud;
And the region of a specific player.

All this information is used to determine where a game session is scheduled. Additionally, it can be used to determine the desired instance count of an Auto Scaling group of a specific region to meet the demand in that region.

Here is an AWS CLI command example of how to manually scale an Auto Scaling group:

aws autoscaling set-desired-capacity –autoscaling-group-name <name> --desired-capacity <count>

Preventing scale-in on game server with a running game session

In both scaling options the Auto Scaling group’s task is to complete a graceful shut down and startup of new game server. This can have major implications for the players if not taken care of properly. For example, it’s not a good customer experience if a game session is stopped because the fleet is scaling in and that instance is shut down. The Auto Scaling group has several options and hooks when scaling in and out that are used to handle this case.

For the given architecture, the most important feature is the ability to set the scale to protect an active game server from scaling. To prevent this, the game backend sets the instance scale to protect the game server when launching a new game session on it. This can either be done with the AWS SDK or the AWS CLI using the following command.

aws autoscaling set-instance-protection --instance-ids <id1, id2, id3> --auto-scaling-group-name <name> --protected-from-scale-in

In the unfortunate event of a crash of an entire game server, a programmatic de-registration of a game server instance isn’t possible. You can use the Auto Scaling group lifecycle hook to inform you when instances shut down. This information can then be used to prevent the game backend from trying to setup shut down instances.

Cleanup

If you are finished experimenting for today and want to stop using the resources you created, follow these steps in the AWS Management Console to avoid incurring unwanted cost.

Delete the VPN:

Navigate to VPC in the region of the central AWS Transit Gateway.
On the VPC Dashboard menu, choose Site-to-Site VPN Connections.
Select the created VPN and from Actions choose Delete.

Delete the Customer Gateway:

Navigate to VPC in the region of the central AWS Transit Gateway.
On the VPC Dashboard menu, choose Customer Gateways.

Delete the Transit Gateway Attachments:

On the VPC Dashboard menu, choose Transit Gateway Attachment.
Select the all created attachments one by one and from Actions choose Delete.
Repeat these steps for the peered regions.

Delete the AWS Transit Gateway:

Navigate to VPC in the region of the central AWS Transit Gateway.
On the VPC Dashboard menu, choose Transit Gateway.
Select the AWS Transit Gateway and from Actions choose Delete.
Repeat these steps for the peered regions.

Delete the VPC:

Navigate to VPC in the region of the central AWS Transit Gateway.
On the VPC Dashboard menu, choose Your VPC.
Select the created VPC and from Actions choose Delete VPC.
Repeat these steps for the peered regions.

Conclusion

This post explained how to set up a hybrid architecture to scale beyond existing data center capacity limits for online multiplayer games. You learned how to use the AWS Transit Gateway as a central network hub to the AWS Cloud and what options you have for scaling a game server fleet using the AWS Auto Scaling Groups. You also learned how to expand that hybrid architecture to enable scaling to multiple AWS Regions and how to prevent game server instances in those Regions from being stopped when scaling in.

To get started with your own hybrid network with AWS Transit Gateway, refer to Getting started with transit gateways. For a general overview about hybrid solutions, refer to Hybrid Cloud with AWS. And finally, to get started with scaling game server fleets on AWS using AWS Auto Scaling groups, refer to Getting started with Amazon EC2 Auto Scaling.