# Building an AI-powered Battlesnake with reinforcement learning on Amazon SageMaker

by Jonathan Chung, Anna Luo, Bharathan Balaji, Vishaal Kapoor, and Xavier Raffin | on | in | Permalink | Comments |  Share

Battlesnake is an AI competition based on the traditional snake game in which multiple AI-powered snakes compete to be the last snake surviving. Battlesnake attracts a community of developers at all levels. Hundreds of snakes compete and rise up in the ranks in the online Battlesnake global arena. Battlesnake also hosts several offline events that are attended by more than a thousand developers and non-developers alike and are streamed on Twitch. Teams of developers build snakes for the competition and learn new tech skills, learn to collaborate, and have fun. Teams can build snakes by using a variety of strategies ranging from state-of-the-art deep reinforcement learning (RL) algorithms to unique heuristics-based strategies.

This post shows how to use Amazon SageMaker to build an RL-based snake.

This post shows how to use the SageMaker Battlesnake Starter Pack to reduce the time and effort required to build your snakes. The SageMaker Battlesnake Starter Pack provides you with an AI-powered snake trained with RL. The Starter Pack also provides an environment for you to train your own RL policy and the tools to build custom heuristics-based rules on top of the RL algorithms. Furthermore, the web infrastructure to deploy and host your AI bot are generated automatically. The SageMaker Battlesnake Starter Pack allows you to focus on developing your AI instead of worrying about the infrastructure surrounding it.

The Amazon SageMaker Battlesnake Starter Pack uses quick-create links in AWS CloudFormation, which provide one-click deployment from the GitHub repo to your AWS Management Console. The creation of the stack deploys the following three layers and a development environment:

• The first layer is an Amazon API Gateway, which exposes the snake HTTP API to the Battlesnake engine
• The second layer is an AWS Lambda function that transcribes the Battlesnake API into the RL agent’s internal representation
• The third layer is an Amazon SageMaker endpoint that hosts the snake
• The development environment is made of Jupyter notebooks that run inside an Amazon SageMaker notebook instance

The following diagram illustrates the runtime AI infrastructure.

The SageMaker Battlesnake Starter Pack does the following:

• Allows you to deploy the framework, which includes the development environment and an AI-powered snake
• Provides you with tools to modify and evaluate new heuristic-based rules to alter your snake’s behavior
• Includes training scripts to retrain and optimize a custom RL agent

By the end of this post, you should understand the basics of RL and how to model Battlesnake in an RL framework, and have a snake that can compete against other snakes in the online global arena.

## Reinforcement learning with Battlesnake

This section reviews the basics of reinforcement learning and how the SageMaker Battlesnake Starter Pack models the Battlesnake environment in an RL framework.

### Introduction to reinforcement learning

Reinforcement learning develops strategies for sequential decision-making problems. Given a pre-specified reward signal, the RL agent interacts with the environment. Its goal is to make actions based on the state it receives, and to maximize the expected cumulative reward. The strategy to take actions is called a policy (π(a|s)). Assume you have a starting point t=0 and an ending point t=T (your snake dies). Each game creates an episode that contains a list of states (st), actions (at), rewards (rt), and next states (st+1). Therefore, you can express the cumulative reward as the following equation:

In this equation, γ is the discount factor for future rewards.

For more information on RL concepts and mathematical formulation, see Automating financial decision making with deep reinforcement learning.

### Understanding the Battlesnake environment

Building a Battlesnake agent requires an extension of the described RL framework to accommodate for multiple agents. The following diagram illustrates a multi-agent RL problem to train a Battlesnake agent.

With the SageMaker Battlesnake Starter Pack, you can customize your environment through the open-source OpenAI Gym interface. A custom file called snake_gym.py is located in the LocalEnv/battlesnake_gym/battlesnake_gym/folder and specifies all the entities discussed previously. The following code defines the state space (which is defined by the map_size, number and location of snakes, and location of food, and is denoted as observation_space) and the action space (up, down, left, right):

# Custom environment file in Open AI Gym

class BattlesnakeGym(gym.Env):
def __init__(self, observation_type="flat-51s", map_size=(15, 15),
number_of_snakes=4,
snake_spawn_locations=[], food_spawn_locations=[],
verbose=False, initial_game_state=None, rewards=SimpleRewards()):

# Action space
self.action_space = MultiAgentActionSpace(
[spaces.Discrete(4) for _ in range(number_of_snakes)])

# Observation space
self.observation_type = observation_type
if "flat" in self.observation_type:
self.observation_space = spaces.Box(low=0, high=2,
shape=(self.map_size[0],
self.map_size[1],
self.number_of_snakes+1),
dtype=np.uint8)
elif "bordered" in self.observation_type:
self.observation_space = spaces.Box(low=0, high=2,
shape=(self.map_size[0]+2,
self.map_size[1]+2,
self.number_of_snakes+1),
dtype=np.uint8)

(...)

The default state representation of the Battlesnake gym is to use a multi-channel image, in which the first channel represents the food positions, the second channel indicates the position of the snake your agent is controlling, and the third channel represents the positions of the other snakes. The representation of the environment is defined in LocalEnv/battlesnake_gym/battlesnake_gym/snake.py. See the following image.

The step() method takes the actions performed by each snake and runs the state transition to get the resulting new state and rewards. You can define your state transitions with the following code:

def step(self, actions, episodes=None):
# setup reward dict
reward = {}
snake_info = {}

# Reduce health and move
for i, snake in enumerate(self.snakes.get_snakes()):
reward[i] = 0
if not snake.is_alive():
continue

# Reduce health by one
snake.health -= 1
if snake.health == 0:
snake.kill_snake()
reward[i] += self.rewards.get_reward("starved", i, episodes)
snake_info[i] = "starved"
continue

action = actions[i]
is_forbidden = snake.move(action)
if is_forbidden:
snake.kill_snake()
reward[i] += self.rewards.get_reward("forbidden_move", i, episodes)
snake_info[i] = "forbidden move"

# check for food and collision
(...)

return self._get_observation(), reward, snake_alive_dict, {'current_turn': self.turn_count,
'snake_health': snakes_health,
'snake_info': snake_info}

You can customize your reward function in LocalEnv/battlesnake_gym/battlesnake_gym/rewards.py. By default, the gym supports the following reward definitions:

• Surviving another turn ("another_turn")
• Eating food ("ate_food")
• Winning the game ("won")
• Losing the game ("died")
• Eating another snake ("ate_another_snake")
• Hitting a wall ("hit_wall")
• Hitting another snake ("hit_other_snake")
• Hitting yourself ("hit_self")
• Being eaten by another snake ("was_eaten")
• Getting hit by another snake ("other_snake_hit_body")
• Performing a forbidden move, such as moving south when facing north ("forbidden_move")
• Dying by starving ("starved")

### Reinforcement learning algorithm

There are many algorithms with which to learn an RL policy. The SageMaker Battlesnake Starter Pack provides a classic method called deep Q-learning (DQN). The Q stands for quality and represents how good a given action a is in gaining future rewards given the current state s. Mathematically, the Q-value is defined as the following equation:

In this equation, s’ denotes the next state. Essentially, you consider all possible actions and all possible next states in the preceding equation, and then you take the maximum value given by taking a certain action.

Q(s’,a) again depends on Q(s”,a) . Therefore, the Q-value depends on Q-values of all future states. See the following equation:

We learn the Q function using the following equation:

In this equation, α is the learning rate. The updating process is called Q-learning. It enables you to update the Q-value based on the current Q-value at timestep t and the Q-value at t+1, and α controls to what extent newly acquired information overrides the old information. Theoretical analysis has proven that under mild conditions of α, the update converges to the optimal policy π*, assuming infinite random action selection.

In practice, the environment can have a considerable number of states, and it’s not feasible to record all Q-values in the table. This is also true for Battlesnake because the input is an image of the current state. The amount of memory and time required to save and update the Q-table is unrealistic because every input image can be different. To mitigate this issue, you can approximate the Q-values with a neural network, and this leads to the deep Q-learning. Specifically, the game state is provided as the input, and the Q-values of all possible actions is served as the output. This post provides an attention- and concatenation-based Q-network, which you can use as-is or as the starting point for your custom snake. You can find the network in qnetworks.py under LocalEnv/battlesnake_src/networks/.

The SageMaker Battlesnake Starter Pack includes a model trained on the described RL framework. This post includes the following steps:

1. Deploying this trained model and a development environment with which you can improve it
2. Evaluating new heuristic-based rules to customize your snake’s behavior
3. Customizing and tuning your RL algorithm

## Step 1: Deploying the Amazon SageMaker Battlesnake Starter Pack

This section contains the following steps:

1. Launch the Amazon SageMaker Battlesnake Starter Pack to deploy a starter snake and the development environment to customize the snake.
2. Link the API Gateway to the Battlesnake engine. The snake is then available to compete in the Battlesnake arena.

### Deploying the Battlesnake Starter Pack

To deploy the Starter Pack, complete the following steps:

1. Navigate to Deploy environment in the GitHub repo.
2. Choose deploy in your desired Region.

The Battlesnake engine runs on us-west-2; you may want to deploy in the same Region to have the lowest latency.

After you choose deploy, you should be in the CloudFormation stack creation process.

1. In Parameters, you can define the default Amazon EC2 instance types to use.

Instance types m5.xlarge and m4.xlarge are a part of the free tier of Amazon SageMaker. However, they are not available to new AWS accounts by default. If you are new to AWS, you can instead use a t2.medium instance, which is both cost-effective and sufficiently powerful for the Battlesnake use case.

1. In Parameters, you can also define your snake’s color, head style, and tail style.

You should see the following image of snakes in different colors and styles.

The following screenshot shows the different snake styles you can choose from.

1. In the Capabilities and transforms section, select all the permissions.
2. Choose Create stack.

You are now on the BattlesnakeEnvironment creation page. After about 10 minutes, you see the status CREATE_COMPLETE.

If you ever need to navigate to the stack again, navigate to the AWS CloudFormation console and choose BattlesnakeEnvironment.

1. On the Outputs tab, choose the link next to CheckSnakeStatus.

You should be redirected to a webpage indicating the status of the snake creation. The following lines show an example of the webpage when the snake is being created.

Sagemaker endpoint status : Creating

You can visit the Amazon SageMaker service page in the AWS Management Console to see detailed information.

After approximately 15 minutes, you should see snake status : ready, which indicates that the Amazon SageMaker endpoint creation is complete. You now have a deployed web server that can respond to the Battlesnake engine.

To create a snake and link it to the Battlesnake engine, complete the following steps:

1. Create a snake on the Battlesnake website.
2. For Name, enter your desired name.
3. For URL, enter the URL for your SnakeAPI.

The URL is available on the Outputs tab of your CloudFormation stack.

1. Optionally, enter a description and tags.
2. Optionally, select if others can add your snake to a game.
3. Choose Save.

You can now test your snake against existing snakes. The following video shows a game of Battlesnake.

The snake My SageMaker Snake is slow but steady, and ultimately wins. For more information, see the details of this game on the Battlesnake website.

The following screencast shows the full procedures of this step. The video has been edited to skip the wait time.

### Navigating the development environment

Now that you have a working snake, you can start exploring the development environment. On the Outputs tab of the CloudFormation stack, you can see the following keys:

• SourceEditionInNotebook – A link to the source directory of the development environment in a Jupyter notebook.
• HeuristicsDevEnvironment – A link to the heuristics development notebook. Details are in Step 2 of this post.
• ModelTrainingEnvironment – A link to the RL training notebook. Details are in Step 3 of this post.

### The Starter Pack environment

The Starter Pack creates a local development directory. You can access it by opening the SourceEditionInNotebook and navigating to battlesnake/LocalEnv.

In LocalEnv, you can modify the following files to customize the SageMaker Starter Pack training, heuristics development, and evaluations:

• battlesnake_gym/battlesnake_gym/ – Files in this directory define the Battlesnake RL environment based on the OpenAI gym. This directory is copied into the Amazon SageMaker training job. It contains the following files:
• food.py handles the food spawning mechanism and food representation.
• rewards.py defines the reward function.
• snake.py defines the snake movement mechanism, representation, and death conditions.
• snake_gym.py defines the interactions between the snakes, food, and environment. The RL agents also interact with this file.
• battlesnake_inference/ – Files in this directory define the Amazon SageMaker endpoint used for inference of the model. It includes the following files:
• battlesnake_heuristics.py defines custom user-defined rules that can override the decision of the RL agent.
• predict.py is the entry point to the Amazon SageMaker endpoint. It loads the network artifacts, obtains the decisions from the network, and activates the heuristics module.
• battlesnake_src/ – Files in this directory define the RL training jobs. This directory is copied into the Amazon SageMaker training job. It contains the following files:
• train.py is the entry point to the training job. It defines the hyperparameters of the agent and the environment. It also initiates the training loop.
• dqn_run.py defines the deep Q-learning training loop.
• networks/agent.py defines the RL agent. The agent performs the interaction between the Q network and the environment (to get an action and to learn).
• networks/qnetworks.py defines the neural network (Q network) that the RL agent uses.

## Step 2: Customizing your snake’s behavior

This section demonstrates how to write the heuristics that serve as ground rules for your snake. These ground rules override certain detrimental decisions that the deep learning model makes. For example, the following screenshot shows two snakes. If your snake (the shorter one) is in a situation where one decision leads to certain death (going down and hitting the longer snake), heuristics-based rules make sure that your snake goes up instead.

Other examples include rules that determine if a given movement decision results in a collision with a wall or if your snake can eat a shorter snake.

The heuristics-based rules are invoked within the Amazon SageMaker endpoint. The endpoint queries the deep learning model for an action. The action from the deep learning model and the state of the environment are fed into the heuristics code for any overriding actions.

To develop your heuristics, open the heuristic development notebook (defined in HeuristicsDevEnvironment). This notebook simulates your model (with your heuristics), provides you with step-by-step visualization of your snakes, and deploys your new heuristics.

You can find a template for the heuristics code in LocalEnv/battlesnake_inference/battlesnake_heuristics.py. This consists of a class with a main run function. See the following code:

import numpy as np
import random

class MyBattlesnakeHeuristics:
'''
The BattlesnakeHeuristics class allows you to define handcrafted rules of the snake.
'''
FOOD_INDEX = 0
def __init__(self):
pass

def go_to_food_if_close(self, state, json):
# Example heuristic to move towards food if it's close to you.

# Get the position of the snake head
your_snake_body = json["you"]["body"]
i, j = your_snake_body[0]["y"], your_snake_body[0]["x"]

# Set food_direction towards food
food = state[:, :, self.FOOD_INDEX]

# Note that there is a -1 border around state so i = i + 1, j = j + 1
if -1 in state:
i, j = i+1, j+1

food_direction = None
if food[i-1, j] == 1:
food_direction = 0 # up
if food[i+1, j] == 1:
food_direction = 1 # down
if food[i, j-1] == 1:
food_direction = 2 # left
if food[i, j+1] == 1:
food_direction = 3 # right
return food_direction

def run(self, state, snake_id, turn_count, health, json, action):
'''
The main function of the heuristics.

Parameters:
-----------
state: np.array of size (map_size[0]+2, map_size[1]+2, 1+number_of_snakes)
Provides the current observation of the gym.
Your target snake is state[:, :, snake_id+1]

snake_id: int
Indicates the id where id \in [0...number_of_snakes]

turn_count: int
Indicates the number of elapsed turns

health: dict
Indicates the health of all snakes in the form of {int: snake_id: int:health}

json: dict
Provides the same information as above, in the same format as the battlesnake engine.

action: np.array of size 4
The qvalues of the actions calculated. The 4 values correspond to [up, down, left, right]
'''
log_string = ""
# The default best_action to take is the one that provides has the largest Q value.
# If you think of something else, you can edit how best_action is calculated
best_action = int(np.argmax(action))

# Example heuristics to eat food that you are close to.
if health[snake_id] < 30:
food_direction = self.go_to_food_if_close(state, json)
if food_direction:
best_action = food_direction
log_string = "Went to food if close."

assert best_action in [0, 1, 2, 3], "{} is not a valid action.".format(best_action)
return best_action, log_string

The run function takes in the representation of the environment as the following arguments:

• state – An image format that can represent the environment
• json – A dictionary format that can also represent the environment
• snake_id – An integer indicating the ID of the snake
• turn_count – An integer indicating the current turn count
• health – A dictionary that represents the health of each snake
• action – A numpy array (of size 4) that provides the Q-values that represent the four possible actions (up, down, left, right), which is the output of the neural network.

In the preceding code, you can see a simple example heuristic, go_to_food_if_close. When your snake’s health is below 30 and there is food beside your snake, your snake moves towards the food. The purpose of this rule is to reduce the chances of your snake starving to death.

To write your own heuristics, you have to understand the advantages and disadvantages of using each representation of the environment. For example, the image representation is good when you are exploring possible moves or distances between different coordinates (such as determining if your snake is adjacent to a wall). However, the order of the snake body is lost in the image representation. The dictionary representation provides easy access to the head and tail of each snake. The example heuristics showcase the advantages of each representation method.

The snake head was obtained from the json argument (i, j = your_snake_body[0]["y"], your_snake_body[0]["x"]). Conversely, if you wanted to get the coordinates of the head in the image representation, you need an o(m*n) algorithm to iterate through the image to search for the head.

The direction of movements towards nearby food is obtained from the image representation (if food[i-1, j] == 1: ...). To use the json argument to search for the food, you have to use an o(n) algorithm to iterate through the list of foods for each possible direction.

Therefore, balancing between using the json list definitions and the state image allows you to write heuristic-based rules that override the decisions of the model.

To evaluate your heuristics, you can use the heuristic development notebook. This notebook simulates the model with your heuristics and provides a step-by-step playback of all events. Firstly, define the initial conditions of the environment in the Define the openAI gym section of the notebook. To define the initial condition, set USE_INITIAL_STATE to be True and specify the coordinates of the food and snakes based on the Battlesnake API. Note that there is an easy way to define the initial_state here. See the following code:

USE_INITIAL_STATE = False

# Sample initial state for the situation simulator
initial_state = {
"turn": 4,
"board": {
"height": 11,
"width": 11,
"food": [{"x": 1, "y": 3}],
"snakes": [{
"health": 90,
"body": [{"x": 8, "y": 5}],
},
{
"health": 90,
"body": [{"x": 1, "y": 6}],
},
{
"health": 90,
"body": [{"x": 3, "y": 3}],
},
{
"health": 90,
"body": [{"x": 6, "y": 4}],
},
]
}
}

if USE_INITIAL_STATE == False:
initial_state = None

map_size = (11, 11)
number_of_snakes = 4
env = BattlesnakeGym(map_size=map_size, number_of_snakes=number_of_snakes, observation_type="bordered-51s",
initial_game_state=initial_state)

Proceeding through the notebook loads the trained neural network and simulates the actions of each snake. You can see the step-by-step visualizer after you run the cells in the Playback the simulation section. The following screencast of the visualizer shows each step iterated step-by-step, then plays automatically.

During the visualization process, if you find that your snake cannot handle certain situations, you can run the following cell containing get_env_json(). This prints out a dictionary that represents the current state of the environment. You can directly copy the dictionary into the initial_state in the Define the openAI gym section and write specific heuristics to address the issues with the certain situation.

After you have finished with your heuristics, you can create an Amazon SageMaker endpoint with the new heuristics. Define the Amazon SageMaker session and only run the cell titled “Run if you retrained the model” if you have any trained model artifacts that you did not upload (see Step 3 in this post for more details). Continue to proceed through the notebook to update your Amazon SageMaker endpoint.

The deployment of your new Amazon SageMaker endpoint takes up to 10–15 minutes and you do not need to update your snake on the Battlesnake engine.

## Step 3: Customizing the reinforcement learning training environment

This section shows how to train your deep RL for Battlesnake, perform hyperparameter optimization (HPO) to find the best model, and deploy the updated model into an Amazon SageMaker endpoint.

In this step, you use the SagemakerModelTraining notebook (defined in ModelTrainingEnvironment). You can define the following two parameters of the notebook:

• The run_hpo flag determines whether you should run a single training job or run HPO.
• The map_size parameter determines the size of the map for the environment to train on. The current neural network is limited to a single map size.

See the following code:

run_hpo = False
map_size = (15, 15)

You can define the hyperparameters to train in the cell Define the hyperparameters of your job. For more information on the details of each parameter, see the GitHub repo. See the following code:

map_size_string = "[{}, {}]".format(map_size[0], map_size[1])
static_hyperparameters = {
'qnetwork_type': "attention",
'seed': 111,
'number_of_snakes': 4,
'episodes': 10000,
'print_score_steps': 10,
'activation_type': "softrelu",
'state_type': 'one_versus_all',
'sequence_length': 2,
'repeat_size': 3,
'kernel_size': 3,
'starting_channels': 6,
'map_size': map_size_string,
'snake_representation': 'bordered-51s',
'save_model_every': 700,
'eps_start': 0.99,
'models_to_save': 'local'
}

You can create an MXNet estimator in the Train your model here cell to launch a training job. This process could take several hours, depending on your instance type. See the following code:

estimator = MXNet(entry_point="train.py",
source_dir='battlesnake_src',
dependencies=["battlesnake_gym/"],
role=role,
train_instance_type=train_instance_type,
train_instance_count=1,
output_path=s3_output_path,
framework_version="1.6.0",
py_version='py3',
base_job_name=job_name_prefix,
metric_definitions=metric_definitions,
hyperparameters=static_hyperparameters
)

estimator.fit()

You should see a training job with the prefix job_name_prefix launched on the Training Jobs session tab in the Amazon SageMaker console. After you choose the training job, you can see detailed information about the training run, for example, Amazon CloudWatch logs, real-time visualization of the metrics defined by metric_definitions, and the Amazon S3 location link for model output and artifacts.

If you run an HPO job, you have to define hyperparameter ranges to iterate through. You can also perform hyperparameter optimization on the static_hyperparameters that were previously defined. See the following code:

hyperparameter_ranges = {
'buffer_size': IntegerParameter(1000, 6000),
'update_every': IntegerParameter(10, 20),
'batch_size': IntegerParameter(16, 256),

'lr_start': ContinuousParameter(1e-5, 1e-3),
'lr_factor': ContinuousParameter(0.5, 1.0),
'lr_step': IntegerParameter(5000, 30000),

'tau': ContinuousParameter(1e-4, 1e-3),
'gamma': ContinuousParameter(0.85, 0.99),

'depth': IntegerParameter(10, 256),
'depthS': IntegerParameter(10, 256),
}

Whether you chose to run HPO or a single training job, the model artifacts are saved on Amazon S3. This notebook automatically downloads the trained model artifacts from Amazon S3, packages it, and processes it for updating the Amazon SageMaker endpoint automatically. You do not have to update your snake on the Battlesnake engine.

## Conclusion

The mission of Battlesnake is for developers of all levels to have fun with friends, learn new skills, and build a community. The SageMaker Battlesnake Starter Pack builds on this mission to encompass developers at all levels. Whether you want to build a snake based on decision trees or RL, you can use the SageMaker Battlesnake Starter Pack.

This post showed you how to deploy a snake in the Battlesnake arena to compete against other snakes and the development environment to modify and upgrade the models with your custom configuration. Historically, snakes built with heuristics-based decisions tree outperformed machine learning-based snakes in the offline Battlesnake competitions. Lately, there’s a trend that RL-based snakes are rising up in the ranks in the online global arena. Try out the SageMaker Battlesnake Starter Pack and see if this year a hybrid RL and heuristics snake can take the win!

Jonathan Chung is an Applied scientist in AWS. He works on applying deep learning to various applications including games and document analysis. He enjoys cooking and visiting historical cities around the world.

Xavier Raffin is a Solutions Architect at AWS where he helps customers to transform their businesses and build industry leading cloud solutions. Xavier curiosity, pushed him to apply technology onto many domains: Public Transportation, Web Mapping, IoT, Aeronautics and Space. He contributed to several OpenSource and Opendata projects: OpenStreetMap, Navitia, Transport APIs.

Anna Luo is an Applied Scientist in AWS. She works on utilizing reinforcement learning techniques for different domains including supply chain and recommender system. Her current personal goal is to master snowboarding.

Bharathan Balaji is a Research Scientist in AWS and his research interests lie in reinforcement learning systems and applications. He contributed to the launch of Amazon SageMaker RL and AWS DeepRacer. He likes to play badminton, cricket and board games during his spare time.

Vishaal Kapoor is a Senior Software Development Manager with AWS AI. He loves all things AI and works on building deep learning solutions using SageMaker. In his spare time, he mountain bikes, snowboards, and spends time with his family.

TAGS: