AWS for Games Blog

Amazon GameLift achieves 100 million concurrently connected users per game

Demonstrating the extreme scalability of Amazon GameLift, Amazon Web Services (AWS) has benchmark tested support for up to 100 million concurrent users (CCU) for a single game—a groundbreaking feat for the AWS managed solution. Amazon GameLift alleviates developers’ game server hosting concerns by dynamically scaling backend resources. The test also showcased how Amazon GameLift can add 100,000 players to a game each second and spin up more than 9,000 new compute instances each minute. This offers developers the ability to scale their games far beyond what’s been previously possible.

Online video game developers pour tremendous resources and passion—sometimes even blood, sweat and tears—into bringing new games and updates to life. On launch days, those efforts are put to the ultimate test as players flood the system, but developers don’t necessarily know how many players to expect. What developers do know is that the backend infrastructure supporting incoming traffic must be able to scale instantly without compromising player interactions, or the entire experience falls apart.

Chris Byskal, GM of AWS Game Services/Game Tech – Video Games & Immersive Tech, explained the ambitious benchmark, “While most developers don’t publish CCU numbers, unofficial tracking data indicates that the most popular games in the world today top out around 14 million CCU, and the latest SteamDB reporting puts the CCU for the entire platform at not quite 40 million. We choose 100 million CCU as our goal metric to highlight how Amazon GameLift can easily handle even the biggest games, several times over.”

Estimating CCU is a great starting point for developers to determine game scaling needs prior to launch. Additional scaling considerations include how to quickly handle spikes in traffic, as well as game session allocation and player geography. The AWS team behind Amazon GameLift has nearly a decade of experience helping customers run their games at scale while avoiding launch day missteps. This includes putting the solution through extreme scenarios. For a step-by-step guide on how they scaled for 10 million CCU and 100 million CCU, check out the following demonstration.

Preparing for scale

Developers can use projected CCU to determine the required virtual machine (VM) capacity for optimal game performance. For example, here’s how to calculate VM requirements to support 10 million CCU for a game, assuming one VM can support 12 game sessions before performance starts to degrade.

  • Total VMs = CCU / (players per game session x game sessions per VM)
  • Total VMs = 10,000,000 / (10 x 12) = 83,333

Amazon GameLift has capacity available in 23 AWS Regions and 9 Local Zones, allowing it to comfortably scale to the 83,333 VMs needed to support 10 million CCU. Amazon GameLift can bring VMs into service in a matter of minutes, enabling developers to smoothly and quickly increase capacity from zero to ten million players. This autoscaling functionality minimizes idle capacity—keeping the overall costs down.

Creating game sessions quickly and sensibly

Next, we allocate players to spare game server capacity. All game sessions must be run on a game server process, and developers must avoid assigning multiple games to the same process. Consider the throughput of these allocation requests:

  • Allocation requests/second = CCU / (players per game session x seconds per game session)
  • Allocation requests/second = 10,000,000 / (10 * 900) = 1112 (rounding up)

Here are other considerations:

  1. Game sessions: Having each session assigned to a single game process.
  2. Location: Assigning a game server with low latency to the game session’s players.
  3. Cost: If multiple hardware options are available, using the lower cost capacity first.
  4. Capacity: Leveraging autoscaling to avoid paying for more capacity than needed.
  5. VM usage: Monitoring which VMs have live games on them so they are not accidently terminated—interrupting players’ games.
  6. Capacity health: Validating that the capacity allocated is healthy (for example, avoiding non-responsive server processes).
  7. Number of VM game sessions: Starting too many game sessions on the same VM in a short period of time might exceed compute capabilities of the VM if there are computationally heavy actions, like loading a map or initializing game state.

Many of these considerations can narrow down suitable VM choices for a specific game session. At this level of scale, it’s possible to have 100 game sessions wanting to start in a single AWS Region during the same second. Developers could try to pack these 100 game sessions onto two idle game server processes on a single VM. However, this means 98 of them are going to fail and need to locate new idle capacity. Meanwhile, more game sessions may have arrived and are waiting to start, compounding the problem.

These types of problems are minimal at low rates of throughput, but quickly become a huge issue as rates increase. This can lead to anything from elevated game start times to a complete collapse of availability, potentially ruining launch day.

Scaling with Amazon GameLift

Once you’ve integrated the game server with Amazon GameLift, upload the game server executable and any dependent assets as a Build and create a Fleet to run in AWS. This can be accomplished with the following AWS Command Line Interface (AWS CLI) code.

#!/bin/bash
aws gamelift upload-build --name BlogDemo \
                          --operating-system AMAZON_LINUX_2023 \
                          --build-root . \
                          --server-sdk-version 5.1.2 \
                          --build-version 1

aws gamelift create-fleet --name BlogDemo \
    --ec2-instance-type t3.medium \
    --build-id $BUILD_ID \
    --ec2-inbound-permissions \
'[{"FromPort":6250,"ToPort":6299,"IpRange":"0.0.0.0/0","Protocol":"UDP"},{"FromPort":22,"ToPort":22,"IpRange":"0.0.0.0/0","Protocol":"TCP"}]' \
    --runtime-configuration \
"ServerProcesses=[{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6250,ConcurrentExecutions=1}, \
{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6251,ConcurrentExecutions=1}, \
{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6252,ConcurrentExecutions=1}, \
{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6253,ConcurrentExecutions=1}, \
{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6254,ConcurrentExecutions=1}, \
{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6255,ConcurrentExecutions=1}, \
{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6256,ConcurrentExecutions=1}, \
{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6257,ConcurrentExecutions=1}, \
{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6258,ConcurrentExecutions=1}, \
{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6259,ConcurrentExecutions=1}, \
{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6260,ConcurrentExecutions=1}, \
{LaunchPath=/local/game/gamelift-test-app,Parameters=port:6261,ConcurrentExecutions=1}]"
Bash

The upload-build command takes the files in the local directory and uploads them into the system for use on Amazon Linux 2023. The game is built with server SDK version (5.1.2).

The “Create-fleet” command sets up a Multi-Location Fleet in Amazon GameLift, which is an abstraction combining hardware with the created build. This demonstration uses t3.medium Amazon Elastic Compute Cloud (Amazon EC2) VMs to minimize costs.

The runtime configuration provided in the prior code block sets us up with 12 game server processes for the game on each VM, each with different parameters (notably the port). Note: Inbound permissions are locked down by default, but we’ve allowed UDP on a range of ports, as well as SSH access (for any debugging) through port 22.

Next, scale up the Fleet to prepare for the game’s launch. We’ll want to be in a number of Regions so that our VMs are close to the players and provide a low-latency experience. We can expand our fleet to 14 Regions through the AWS CLI.

#!/bin/bash
aws gamelift create-fleet-locations --endpoint-url $ENDPOINT --fleet-id $FLEET_ID --locations \
    Location=us-east-1 Location=us-east-2 Location=us-west-1 \
    Location=eu-west-1 Location=eu-west-2 Location=eu-central-1 \
    Location=ap-south-1 Location=sa-east-1 Location=ap-northeast-1 \
    Location=ap-northeast-2 Location=ap-southeast-1 \
    Location=ap-southeast-2 Location=ca-central-1
Bash

Note that the specific Regions chosen here provide broad coverage (good for targeting a low player latency of under 50 ms, for example), but the list of Regions may be larger, smaller, or just different depending upon the needs of the game. Generally speaking, more Regions will give better player latencies at the cost of potential fragmentation of the player base and a marginal increase in the cost.

With the Fleet now set up in 14 Regions around the globe, the next step is to bring the VMs online. While autoscaling will help adapt the size of the Fleet, new VMs will still take a minute or two to spin up. This means that we want to have a set of VMs ready for the launch time. After a few minutes of traffic has come through, we can rely on autoscaling to adjust capacity to match our needs. With this in mind, let’s scale up the Fleet to desired levels with this bash script.

#!/bin/bash
for LOCATION in us-east-1 us-east-2 us-west-1 us-west-2 eu-west-1 \
                eu-west-2 eu-central-1 ap-south-1 sa-east-1 \
                ap-northeast-1 ap-northeast-2 ap-southeast-1 \
                ap-southeast-2 ca-central-1
do
  aws gamelift update-fleet-capacity --fleet-id $FLEET_ID \
    --desired-instance 1400 --max-size 6500 --min-size 1400 \
    --location $LOCATION
Done
Bash

We’ve specified 1,400 as the minimum and desired number of VMs for each Region. This gives us 19,600 VMs, enough to support almost two million players. We’ve specified 6,500 as the maximum number of VMs for each Region. This allows for demand-based autoscaling up to 91,000 VMs, enough for 10.9 million players. Note that these additional VMs will not be introduced until needed, saving on cost.

The following graph (Figure 1) shows how quickly this happens, as we’re able to scale up from zero to 19,564 VMs in three minutes (the remaining 36 VMs arrived over the next few minutes). Note that we are (temporarily) keeping the minimum number of VMs to 1,400 in each Region to verify those are ready for the launch. We’ll reduce the minimum once the game is live and we’ve determined a suitable capacity floor.

This graph shows the scale up of 19,600 VMs. Over 99.4% were available within 3 minutes of the scale up request.

Figure 1: Over 99.4% of 19,600 requested instances active within 3 minutes.

Next, we want to configure our Fleet to scale between the minimum and maximum VM counts. We can do this by attaching the following scaling policy to our Fleet.

#!/bin/bash
 aws gamelift put-scaling-policy \
    --fleet-id $FLEET \
    --name "My_Target_Policy_1" \
    --policy-type "TargetBased" \
    --metric-name "PercentAvailableGameSessions" \
    --target-configuration "TargetValue=40"
Bash

This scaling policy states that we will adjust our VM counts to target 40 percent of our game session capacity being available (60 percent actively hosting games). This helps us adapt to the initial player surge at release. We’ll reduce it significantly (to 5–10 percent) once traffic stabilizes in the following hours or days.

At this point, our VMs and corresponding game server processes are up and running with Amazon GameLift and we are ready to for the arrival of players. One of the features that Amazon GameLift offers to help with this aspect is Game Session Queues. A queue can contain one or more Fleets and helps in routing a set of players to the best (lowest latency, lowest cost) location amongst those Fleets. After setting up a queue for our fleet using CreateGameSessionQueue, the game launch is now open and players are free to join.

Testing

To simulate the arrival of players as the game is released, we’ll leverage two components:

  1. A load test generator ramps up game session requests to simulate a release of a game.
    • This is done through calls to the StartGameSessionPlacement API, which creates a game session request for a Game Session Queue.
  2. A player simulator listens to notifications about newly allocated game sessions and connects players to those games.

We will have our load test start out with 100 game sessions for each second and ramp up to a high of 1,112 game sessions for each second (providing 1,000 and 11,120 simulated players for each second respectively), into game sessions hosted on Amazon GameLift. Our simulated players have latency measurements included when submitting game session requests. This is so the game session queue system can route these sessions to the closest Region for those players. Each simulated player sends and receives one packet every few seconds for the 15-minute duration of the game session.

Both of these components may need to be customized. The AWS Game Backend Framework Workshop includes a load testing component to help with this. There is also a UnityBotClient available which could serve as a starting point for producing simulated players.

The following graph (Figure 2) shows the number of active game sessions and connected players (CCU) over the duration of the test.

This graph shows the increase of GameSession creation rate and CCU over time (to 66,667/minute and 10 million respectively). EC2 Instances adjust to support the traffic. It also shows the amount of EC2 Instances and percent of available game sessions during the same time frame.

Figure 2: Increase of Game Session creation rate and CCU over time (to 66,667/minute and 10 million respectively). Amazon EC2 Instances adjust to support the traffic.

The graph demonstrates that we are scaled up for initial traffic levels at 19:00 and that players begin to arrive at 19:10. We hit our initial traffic estimates at approximately 19:20 and greatly exceed that level over the next 50 minutes, reaching maximum CCU of 10 million at 20:10 (note that the VMs scaled up to 91,000 over the same period). During this time period, the configured autoscaling policy is bringing additional capacity into service to keep up with observed utilization. These traffic levels are sustained for 20 minutes (longer than the game duration of 15 minutes), demonstrating the ability to maintain session allocation of over 1,100 games for each second. All of this happens while confirming 100 percent of players have access into our game without delay. It satisfies our seven game session placement considerations: no double-booking, minimize latency, reduce cost, auto scale appropriately, avoid game interruptions, ensure health of capacity, and avoid overload.

In practical terms, this means that we had a successful launch even though our initial player expectations were exceeded by 10 times. Every player was able to be placed into the lowest-latency game session for them, without delay and without disruption. Let’s now move on to examine how the same tests might perform under modified conditions.

Demonstration: Battle Royale

Imagine that the demonstrated scenario was replicated for a 100-player battle royale style game. We will assume the same Fleet configuration (including VM types/locations and Fleet sizes), and the same Game Session start rate, but each individual game has 100 players.

Since each game has 10 times as many players, this means that we should expect to hit 100 million players. However, despite the massively higher player count (peaking at 111,200 new players for each second and 100 million CCU), the other metrics behave largely the same. This is because CCU has only a marginal effect on game session allocation. Game Session creation rate and VM deployment speed are crucial, obeying the allocation constraints.

It’s also of interest to note how quickly we were able to get players into games even at this scale. In this Battle Royale test, 90% of players were allocated to games in less than 3 seconds, 99.9% in less than 5 seconds, and 100% in under 15 seconds.

This graph shows the increase of GameSession creation rate and CCU over time (to 66,667/minute and 100 million respectively). Amazon EC2 Instances adjust to support the traffic. It also indicates the percent of available game sessions during the same time frame.

Figure 3: Increase of Game Session creation rate and CCU over time (to 66,667/minute and 100 million respectively). Amazon EC2 Instances adjust to support the traffic.

Demonstration: Extreme game session throughput

As a final demonstration to highlight the actual driving factors in scaling game session management, let’s consider a scenario at the other extreme. Instead of 15-minute, 100-player games, consider a head-to-head fighting title with three-minute game sessions, such as Mortal Kombat 1, which uses Amazon GameLift.

Consider how this new scenario impacts our setup:

  • Fighting titles often have lower compute requirements than something like a first-person shooter due to simpler compute requirements. Instead of running 12 game sessions for each VM, we will run 50.
  • How many VMs do we need?
    • One million players in our 1v1 game means 500,000 games at any point in time.
    • At a density of 50 game sessions for each VM, this gives us a requirement of 10,000 VMs.
  • What is our Game Session creation rate?
    • Given our three-minute games (180 seconds) and 500,000 games in flight, we can expect a Game Session creation rate of 500,000/180 = 2,778 game sessions/second.

Note that even though some aspects of this setup seem minor compared to the earlier demonstrations (only 11 percent as many VMs and 1 percent as many players), the Game Session creation rate is 2.5 times as high. This scenario is the most demanding due to its high throughput and allocation contention. It highlights the importance of the cost and capacity requirements mentioned earlier.

This graph shows the increase of GameSession creation rate and CCU over time (to 166,667/minute and one million respectively). The lines are almost in complete sync.

Figure 4: Increase of Game Session creation rate and CCU over time (to 166,667/minute and one million respectively).

 

Cleanup

As a final step in this demonstration, we delete our Fleet so we don’t continue paying for the underlying VMs.

#!/bin/bash
aws gamelift delete-fleet –fleet-id $FLEET_ID
Bash

Summary

We’ve identified how to determine the scaling requirements for your game to support extremely large scenarios. We’ve also shown that CCU is an important data point, but only a first step in evaluating scale. The essential elements to successfully scaling for a game’s launch are the ability to rapidly introduce large numbers of VMs and assigning players to appropriate game servers quickly enough to avoid delaying those players.

We’ve also demonstrated how Multi-Location Fleets on Amazon GameLift and Game Session Queues can be leveraged to streamline your infrastructure. All while supporting some of the largest dimensions you might encounter in online gaming: 100 million CCU, 91,000 VMs, and over 2,700 games for each second.

You can learn more about Amazon GameLift features and prepare for your game’s launch with our launch checklist.

Contact an AWS Representative to know how we can help accelerate your business.

Further reading

Brian Schuster

Brian Schuster

Brian Schuster is a Principal Engineer at AWS for Amazon GameLift where he works on shaping the technical direction of the service. He has a deep focus on driving improvement in areas of availability and scalability in order to support the most demanding requirements of large-scale games.