What is reinforcement learning?

Reinforcement learning (RL) is a machine learning (ML) technique that trains software to make decisions to achieve the most optimal results. It mimics the trial-and-error learning process that humans use to achieve their goals. Software actions that work towards your goal are reinforced, while actions that detract from the goal are ignored. 

RL algorithms use a reward-and-punishment paradigm as they process data. They learn from the feedback of each action and self-discover the best processing paths to achieve final outcomes. The algorithms are also capable of delayed gratification. The best overall strategy may require short-term sacrifices, so the best approach they discover may include some punishments or backtracking along the way. RL is a powerful method to help artificial intelligence (AI) systems achieve optimal outcomes in unseen environments.

What are the benefits of reinforcement learning?

There are many benefits to using reinforcement learning (RL). However, these three often stand out.

Excels in complex environments

RL algorithms can be used in complex environments with many rules and dependencies. In the same environment, a human may not be capable of determining the best path to take, even with superior knowledge of the environment. Instead, model-free RL algorithms adapt quickly to continuously changing environments and find new strategies to optimize results.

Requires less human interaction

In traditional ML algorithms, humans must label data pairs to direct the algorithm. When you use an RL algorithm, this isn’t necessary. It learns by itself. At the same time, it offers mechanisms to integrate human feedback, allowing for systems that adapt to human preferences, expertise, and corrections.

Optimizes for long-term goals

RL inherently focuses on long-term reward maximization, which makes it apt for scenarios where actions have prolonged consequences. It is particularly well-suited for real-world situations where feedback isn't immediately available for every step, since it can learn from delayed rewards.

For example, decisions about energy consumption or storage might have long-term consequences. RL can be used to optimize long-term energy efficiency and cost. With appropriate architectures, RL agents can also generalize their learned strategies across similar but not identical tasks.

What are the use cases of reinforcement learning?

Reinforcement learning (RL) can be applied to a wide range of real-world use cases. We give some examples next.

Marketing personalization

In applications like recommendation systems, RL can customize suggestions to individual users based on their interactions. This leads to more personalized experiences. For example, an application may display ads to a user based on some demographic information. With each ad interaction, the application learns which ads to display to the user to optimize product sales.

Optimization challenges

Traditional optimization methods solve problems by evaluating and comparing possible solutions based on certain criteria. In contrast, RL introduces learning from interactions to find the best or close-to-best solutions over time.

For example, a cloud spend optimizing system uses RL to adjust to fluctuating resource needs and choose optimal instance types, quantities, and configurations. It makes decisions based on factors like current and available cloud infrastructure, spending, and utilization.

Financial predictions

The dynamics of financial markets are complex, with statistical properties that change over time. RL algorithms can optimize long-term returns by considering transaction costs and adapting to market shifts.

For instance, an algorithm could observe the rules and patterns of the stock market before it tests actions and records associated rewards. It dynamically creates a value function and develops a strategy to maximize profits.

How does reinforcement learning work?

The learning process of reinforcement learning (RL) algorithms is similar to animal and human reinforcement learning in the field of behavioral psychology. For instance, a child may discover that they receive parental praise when they help a sibling or clean but receive negative reactions when they throw toys or yell. Soon, the child learns which combination of activities results in the end reward.

An RL algorithm mimics a similar learning process. It tries different activities to learn the associated negative and positive values to achieve the end reward outcome.

Key concepts

In reinforcement learning, there are a few key concepts to familiarize yourself with:

  • The agent is the ML algorithm (or the autonomous system)
  • The environment is the adaptive problem space with attributes such as variables, boundary values, rules, and valid actions
  • The action is a step that the RL agent takes to navigate the environment
  • The state is the environment at a given point in time
  • The reward is the positive, negative, or zero value—in other words, the reward or punishment—for taking an action
  • The cumulative reward is the sum of all rewards or the end value

Algorithm basics

Reinforcement learning is based on the Markov decision process, a mathematical modeling of decision-making that uses discrete time steps. At every step, the agent takes a new action that results in a new environment state. Similarly, the current state is attributed to the sequence of previous actions.

Through trial and error in moving through the environment, the agent builds a set of if-then rules or policies. The policies help it decide which action to take next for optimal cumulative reward. The agent must also choose between further environment exploration to learn new state-action rewards or select known high-reward actions from a given state. This is called the exploration-exploitation trade-off.

What are the types of reinforcement learning algorithms?

There are various algorithms used in reinforcement learning (RL)—such as Q-learning, policy gradient methods, Monte Carlo methods, and temporal difference learning. Deep RL is the application of deep neural networks to reinforcement learning. One example of a deep RL algorithm is Trust Region Policy Optimization (TRPO).

All these algorithms can be grouped into two broad categories.

Model-based RL

Model-based RL is typically used when environments are well-defined and unchanging and where real-world environment testing is difficult.

The agent first builds an internal representation (model) of the environment. It uses this process to build this model:

  1. It takes actions within the environment and notes the new state and reward value
  2. It associates the action-state transition with the reward value.

Once the model is complete, the agent simulates action sequences based on the probability of optimal cumulative rewards. It then further assigns values to the action sequences themselves. The agent thus develops different strategies within the environment to achieve the desired end goal. 


Consider a robot learning to navigate a new building to reach a specific room. Initially, the robot explores freely and builds an internal model (or map) of the building. For instance, it might learn that it encounters an elevator after moving forward 10 meters from the main entrance. Once it builds the map, it can build a series of shortest-path sequences between different locations it visits frequently in the building.

Model-free RL 

Model-free RL is best to use when the environment is large, complex, and not easily describable. It’s also ideal when the environment is unknown and changing, and environment-based testing does not come with significant downsides.

The agent doesn’t build an internal model of the environment and its dynamics. Instead, it uses a trial-and-error approach within the environment. It scores and notes state-action pairs—and sequences of state-action pairs—to develop a policy. 


Consider a self-driving car that needs to navigate city traffic. Roads, traffic patterns, pedestrian behavior, and countless other factors can make the environment highly dynamic and complex. AI teams train the vehicle in a simulated environment in the initial stages. The vehicle takes actions based on its current state and receives rewards or penalties.

Over time, by driving millions of miles in different virtual scenarios, the vehicle learns which actions are best for each state without explicitly modeling the entire traffic dynamics. When introduced in the real world, the vehicle uses the learned policy but continues to refine it with new data.

What is the difference between reinforced, supervised, and unsupervised machine learning?

While supervised learning, unsupervised learning, and reinforcement learning (RL) are all ML algorithms in the field of AI, there are distinctions between the three.

Read about supervised and unsupervised learning »

Reinforcement learning vs. supervised learning

In supervised learning, you define both the input and the expected associated output. For instance, you can provide a set of images labeled dogs or cats, and the algorithm is then expected to identify a new animal image as a dog or cat.

Supervised learning algorithms learn patterns and relationships between the input and output pairs. Then, they predict outcomes based on new input data. It requires a supervisor, typically a human, to label each data record in a training data set with an output. 

In contrast, RL has a well-defined end goal in the form of a desired result but no supervisor to label associated data in advance. During training, instead of trying to map inputs with known outputs, it maps inputs with possible outcomes. By rewarding desired behaviors, you give weightage to the best outcomes. 

Reinforcement learning vs. unsupervised learning 

Unsupervised learning algorithms receive inputs with no specified outputs during the training process. They find hidden patterns and relationships within the data using statistical means. For instance, you could provide a set of documents, and the algorithm may group them into categories it identifies based on the words in the text. You do not get any specific outcomes; they fall within a range. 

Conversely, RL has a predetermined end goal. While it takes an exploratory approach, the explorations are continuously validated and improved to increase the probability of reaching the end goal. It can teach itself to reach very specific outcomes.

What are the challenges with reinforcement learning?

While reinforcement learning (RL) applications can potentially change the world, it may not be easy to deploy these algorithms. 


Experimenting with real-world reward and punishment systems may not be practical. For instance, testing a drone in the real world without testing in a simulator first would lead to significant numbers of broken aircraft. Real-world environments change often, significantly, and with limited warning. It can make it harder for the algorithm to be effective in practice.


Like any field of science, data science also looks at conclusive research and findings to establish standards and procedures. Data scientists prefer knowing how a specific conclusion was reached for provability and replication.

With complex RL algorithms, the reasons why a particular sequence of steps was taken may be difficult to ascertain. Which actions in a sequence were the ones that led to the optimal end result? This can be difficult to deduce, which causes implementation challenges.

How can AWS help with reinforcement learning?

Amazon Web Services (AWS) has many offerings that help you develop, train, and deploy reinforcement learning (RL) algorithms for real-world applications.

With Amazon SageMaker, developers and data scientists can quickly and easily develop scalable RL models. Combine a deep learning framework (like TensorFlow or Apache MXNet), an RL toolkit (like RL Coach or RLlib), and an environment to mimic a real-world scenario. You can use it to create and test your model.

With AWS RoboMaker, developers can run, scale, and automate simulation with RL algorithms for robotics without any infrastructure requirements.

Get hands-on experience with AWS DeepRacer, the fully autonomous 1/18th scale race car. It boasts a fully configured cloud environment that you can use to train your RL models and neural network configurations.

Get started with reinforcement learning on AWS by creating an account today.

Next Steps with AWS