ANT61 accelerates reinforcement learning for space robots using AWS

Introduction

ANT61 robotics is a start-up based in Sydney, Australia, that develops autonomous robots for space applications such as in-orbit satellite servicing to avoid putting human lives at risk. Their robots use AI-based control systems that enable them to perform installation and servicing tasks in unpredictable environments where remote control is impossible.

In this post, you’ll learn how ANT61 uses simulation-based reinforcement learning to train its robots to perform tasks in space. You’ll understand how they use AWS to run simulations in parallel to reduce the time and cost of their development.

Robots learn by trial and error

Over the past few years, reinforcement learning has been increasingly used as a method to train AI models for robots. With reinforcement learning, the robots constantly improve by trying. Instead of engineers writing code that moves the robot’s joints, the robot learns to perform the movement faster, safer and more reliable through feedback from its environment.

ANT61 is taking advantage of the recent developments in AI to build autonomous robots for work in space and other dangerous environments. They predict that soon learned robotic skills will surpass human coded behaviors, similar to how neural networks first outperformed hand-written algorithms, and then eventually humans at object recognition. They use deep reinforcement learning to train general robotic skills for installation and assembly and satellite servicing skills like docking, de-tumbling, refueling, and part replacements.

By using reinforcement learning, ANT61 can put their brains to much better use, solving problems that, for now, require a human being.

Observations are the new Big Data

A decade ago, there was a race to accumulate as much data as possible. More data meant better machine learning models, more accurate predictions, happier customers, and a larger market share.

With reinforcement learning, observations are the data. Observations are the inputs to the neural network that is being trained to take some action. For example, a robot camera input is used for the input to a robot that needs to determine the location of an object it is trying to pick up. Observations can be produced by the robot simply working and trying to solve a problem independently. More observations mean a better control system, a more efficient robot, more tasks accomplished at lower cost, and more value generated.

The cost of an experiment

Most companies use simulated environments to train their machine-learning models. ANT61 uses Gazebo as its primary simulation platform, allowing them to create worlds quickly.

Training using one simulation on one EC2 instance

When training a robot using reinforcement learning, the simulation may need to run thousands or even tens of thousands of iterations to teach the robot. These simulations running serially on a desktop system can take far too long to be practical for software developers. However, because reinforcement learning only changes the model after performing 1000s of simulations and aggregating the observations, it’s a highly parallelizable workload, ideal for running across many cloud systems.

Initially, ANT61 simulated one robot in a simulation environment with a speed approaching the real world (0.8 real-time factor). Using the least expensive Amazon EC2 instance with a GPU, one training took two weeks to complete. This approach was too long for engineers to iterate and complete projects on time.

The next step was to scale horizontally and add more servers, each running one robot. This approach allowed them to run numerous experiments simultaneously, but it still took two weeks to finish each experiment.

Multi-agent training

Modern reinforcement learning algorithms like TD3 and APPO can train one model from observations collected by multiple agents. ANT61 took advantage of this to reduce the calendar time of the experiment by having multiple robots training in parallel and sharing their observations and outputs. This means that if one robot learns a new behaviour, it teaches it to the rest of the group.

Multi-robot simulation based training

After running several experiments, the engineers at ANT61 found that instead of running each robot in its own simulation application, it’s much more efficient to run multiple instances of the same robot in one simulated world. This approach allowed them to get four times more observations from the same iteration by placing 12 robots within the scene, thus bringing the experiment duration from two weeks to four days.

ANT61 uses the Ray library for the training, which can create and manage clusters of EC2 instances on AWS and run machine learning training tasks on those instances.Using this method, each instance in a cluster will train several robots in parallel, generating millions of observations every hour. The primary neural network uses observations from distributed training tasks, which makes the robots continuously smarter.

Thanks to the power of horizontal scale in AWS, running ten robots for 1000 hours costs the same as running 10,000 robots for 1 hour. Using this parallel training, ANT61 was able to decrease their experiment time from four days to four hours, at the same cost. Running clusters of simulations, the team could scale their virtual robot fleet until they hit their monthly training budget constraint. To get the most out of the budget, they then started to look at instance cost optimization.

Using an EC2 cluster for training

Initially, the ANT61 team had been using g4dn.xlarge instances to run both simulation and training workloads. These GPU-accelerated Amazon EC2 instances are great for neural network training; however, most of the time and resources used by simulation-based reinforcement learning flows like this are spent collecting observations, which is a CPU-intensive task, and doesn’t require a GPU. Thus there was a space for further optimization by finding the best instance type to use for the simulations. After going through all CPU-focused instance types, the team found that the m5.large instance offered the best cost per observation among other instance generations and types. It turned out to be five times less expensive than the g4dn.xlarge used previously.

Clustered spot instances

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud. Spot Instances are available at up to a 90% discount compared to On-Demand prices and are designed for running jobs that can be stopped and re-started without data loss. Spot instances are perfect for ANT61 because their simulations do not store persistent data on the worker nodes. Instead, they send the observations to the head node in real time. By using Amazon EC2 Spot instances, ANT61 achieved a 62% cost reduction compared to on-demand EC2.

Robot training machine

ANT61 has built a very efficient machine for generating observations with AWS that has enormous potential to scale. They can now run multiple experiments daily, spending minutes instead of weeks to get the results, and it costs them 15 times less than just several months ago when they first began running experiments on AWS.

Next steps

ANT61 continues to innovate and improve their robot training workflow, and AWS remains an essential tool in their inventory. Training robots in the real world is ideal, but the observations are much more expensive in terms of costs and calendar time than simulation. After all, robots can’t learn faster than they can move in the physical world. Today, powerful simulation software like NVIDIA’s Isaac Sim enables developers to create hyper-realistic environments. Perhaps in the future someone will find a way to fuse reality and simulation and get the best of both worlds.

AWS Robotics Blog