Reinforcement Learning – Deep Learning with TensorFlow 2 and Keras – Second Edition


Reinforcement Learning

This chapter introduces reinforcement learning (RL)—the least explored and yet most promising learning paradigm. Reinforcement learning is very different from both supervised and unsupervised learning models we have done in earlier chapters. Starting from a clean slate (that is, having no prior information), the RL agent can go through multiple stages of hit and trials, and learn to achieve a goal, all the while the only input being the feedback from the environment. The latest research in RL by OpenAI seems to suggest that continuous competition can be a cause for the evolution of intelligence. Many deep learning practitioners believe that RL will play an important role in the big AI dream: Artificial General Intelligence (AGI). This chapter will delve into different RL algorithms, the following topics will be covered:

  • What is RL and its lingo
  • Learn how to use OpenAI Gym interface
  • Deep Q-Networks
  • Policy gradients


What is common between a baby learning to walk, birds learning to fly, or an RL agent learning to play an Atari game? Well, all three involve:

  • Trial and error: The child (or the bird) tries various ways, fails many times, and succeeds in some ways before it can really stand (or fly). The RL Agent plays many games, winning some and losing many, before it can become reliably successful.
  • Goal: The child has the goal to stand, the bird to fly, and the RL agent has the goal to win the game.
  • Interaction with the environment: The only feedback they have is from their environment.

So, the first question that arises is what is RL and how is it different from supervised and unsupervised learning? Anyone who owns a pet knows that the best strategy to train a pet is rewarding it for desirable behavior and punishing it for bad behavior. RL, also called learning with a critic, is a learning paradigm where the agent learns in the same manner. The agent here corresponds to our network (program); it can perform a set of Actions (a), which brings about a change in the State (s) of the environment and, in turn, the agent receives a reward or punishment from the environment.

For example, consider the case of training a dog to fetch the ball: here, the dog is our agent, the voluntary muscle movements that the dog makes are the actions, and the ground (including person and ball) is the environment; the dog perceives our reaction to its action in terms of giving it a bone as a reward. RL can be defined as a computational approach to goal-directed learning and decision making, from interaction with the environment, under some idealized conditions. The Agent can sense the state of the Environment, and the Agent can perform specific well-defined actions on the Environment. This causes two things: first, a change in the state of the environment, and second, a reward is generated (under ideal conditions). This cycle continues, and in theory the agent learns how to more frequently generate a reward over time:

Unlike supervised learning, the Agent is not presented with any training examples; it does not know what the correct action is.

And unlike unsupervised learning, the agent's goal is not to find some inherent structure in the input (the learning may find some structure, but that isn't the goal); instead, its only goal is to maximize the rewards (in the long run) and reduce the punishments.

RL lingo

Before learning various RL algorithms, it is important we understand a few important terms. We will illustrate the terms with the help of two examples, first a robot in a maze, and second an agent controlling the wheels of a self-driving car. The two RL agents are shown as follows:

  • State, S: State is the set of tokens (or representations) that can define all of the possible states the environment can be in. It can be continuous or discrete. In the case of the robot finding its path through a maze, the state can be represented by a 4 × 4 array, with elements telling whether that block is empty or occupied or blocked. A block with a value of 1 means it is occupied by the robot, 0 means it is empty, and X represents that the block is impassable. Each element in this array S, can have one of the three discrete values, so the state is discrete in nature. Next, consider the agent controlling the steering wheel of a self-driving car. The agent takes as an input the front view image. The image contains continuous valued pixels, so here the state is continuous.
  • Action, A(S): Actions are the set of all possible things that the agent can do in a particular state. The set of possible actions, A, depends on the present state, S. Actions may or may not result in a change of state. Like states, they can be discrete or continuous. The robot finding a path in the maze can perform five discrete actions [up, down, left, right, no change]. The SDC agent, on the other hand, can rotate the steering wheel in a continuous range of angles.
  • Reward R(S,A,S'): Rewards are a scalar value returned by the environment based on the agent's action(s), here S is the present state and S' the state of the environment after action A is taken. It is determined by the goal; the agent gets a higher reward if the action brings it near the goal, and a low (or even negative) reward otherwise. How we define a reward is totally up to us—in the case of the maze, we can define the reward as the Euclidean distance between the agent's current position and goal. The SDC agent reward can be that the car is on the road (positive reward) or off the road (negative reward).
  • Policy : Policy defines a mapping between each state and the action to take in that state. The policy can be deterministic—that is, for each state there is a well-defined policy. In the case of the maze robot, a policy can be that if the top block is empty, move up. The policy can also be stochastic—that is, where an action is taken by some probability. It can be implemented as a simple look-up table, or it can be a function dependent on the present state. The policy is the core of the RL agent. In this chapter, we'll learn about different algorithms that help the agent to learn the policy.
  • Return Gt: This is the discounted sum of all future rewards starting from current time, mathematically defined as:

    Here Rt is the reward at time t, is the discount factor; its value lies between (0,1). The discount factor determines how important future rewards are in deciding the policy. If it is near zero, the agent gives importance to the immediate rewards. A high value of discount factor, however, means the agent is looking far into the future. It may lose immediate reward in favor of the high future rewards, just as in the game chess you may sacrifice a pawn for checkmate of the opponent.

  • Value function V(S): This defines the "goodness" of a state in the long run. It can be thought of as the total amount of reward the agent can expect to accumulate over time, starting from the state, S. You can think of it as a long-term good, as opposed to an immediate, but short-lived good. What do you think is more important, maximizing immediate reward or value function? You probably guessed right: just as in chess, we sometimes lose a pawn to win the game a few steps later, and so the agent should try to maximize value function.

    Normally, the value is defined either as the State-Value function or Action-Value function , where is the policy followed. The state-value function is the expected return from the state S after following policy :

    Here E is the expectation, and St=s is the state at time t. The action-value function is the expected return from the state S, taking an action A=a and following the policy :

  • Model of the environment: It's an optional element. It mimics the behavior of the environment, and it contains the physics of the environment; in other words, it tells how the environment will behave. The model of the environment is defined by the transition probability to the next state. This is an optional component; we can have a model free reinforcement learning as well where the transition probability is not needed to define the RL process.

In RL we assume that the state of the environment follows the Markov property, that is, each state is dependent solely on the preceding state, the action taken from the action space, and the corresponding reward. That is, if St+1 is the state of the environment at time t+1, then it is a function of St state at time t, At is action taken at time t, and Rt is the corresponding reward received at time t, no prior history is needed. If P(St+1|St) is the transition probability, mathematically the Markov property can be written as:

P(St+1|St) = P(St+1|S1,S2,…,St)

And thus, RL can be assumed to be a Markov Decision Process (MDP).

Deep reinforcement learning algorithms

The basic idea in Deep Reinforcement Learning (DRL) is that we can use a deep neural network to approximate either policy function or value function. In this chapter we will be studying some popular DRL algorithms. These algorithms can be classified in two classes, depending upon what they approximate:

  • Value-based methods: In these methods, the algorithms take the action that maximizes the value function. The agent here learns to predict how good a given state or action would be. An example of the value-based method is the Deep Q-Network.
  • Consider, for example, our robot in a maze: assuming that the value of each state is the negative of the number of steps needed to reach from that box to goal, then, at each time step, the agent will choose the action that takes it to a state with optimal value, as in the following diagram. So, starting from a value of -6, it'll move to -5, -4, -3, -2, -1, and eventually reach the goal with the value 0:
  • Policy-based methods: In these methods, the algorithms predict the optimal policy (the one that maximizes the expected return), without maintaining the value function estimates. The aim is to find the optimal policy, instead of optimal action. An example of the policy-based method is policy-gradients. Here, we approximate the policy function, which allows us to map each state to the best corresponding action. One advantage of policy-based methods over value-based is that we can use them even for continuous action spaces.

Besides the algorithms approximating either policy or value, there are a few questions we need to answer to make reinforcement learning work:

  • How does the agent choose its actions, especially when untrained?

    When the agent starts learning, it has no idea what is the best way in which to determine an action, or which action will provide the best Q-value. So how do we go about it? We take a leaf out of nature's book. Like bees and ants, the agent makes a balance between exploring the new actions and exploiting the learned ones. Initially when the agent starts it has no idea which action among the possible actions is better, so it makes random choices, but as it learns it starts making use of the learned policy. This is called the Exploration vs Exploitation [2] tradeoff. Using exploration, the agent gathers more information, and later exploits the gathered information to make the best decision.

  • The next question that arises is, how does the agent maintain a balance between exploration and exploitation? There are various strategies; one of the most employed is the Epsilon-Greedy () policy. Here, the agent explores unceasingly, and depending upon the value of , at each step the agent selects a random action with probability , and with probability selects an action that maximizes the value function. Normally, the value of decreases asymptotically. In Python the policy can be implemented as:
      if np.random.rand() <= epsilon:
            a = random.randrange(action_size)
            a = np.argmax(model.predict(s))

    where model is the deep neural network approximating the value/policy function, a is the action chosen from the action space of size action_size, and s is the state. Another way to perform exploration is to use noise; researchers have experimented with both Gaussian and Ornstein-Uhlenbeck noise with success.

  • How to deal with the highly correlated input state space?

    The input to our RL model is the present state of the environment. Each action results in some change in the environment; however, the correlation between two consecutive states is very high. Now if we make our network learn based on the sequential states, the high correlation between consecutive inputs results in what is known in literature as Catastrophic Forgetting. To mitigate the effect of Catastrophic Forgetting, in 2018, David Isele and Akansel Cosgun proposed the Experience Replay [3] method.

    In simplest terms, the learning algorithm first stores the MDP tuple: state, action, reward, and next state <S, A, R, S'> in a buffer/memory. Once a significant amount of memory is built, a batch is selected randomly to train the agent. The memory is continuously refreshed with new additions, and old deletions. The use of experience replay provides three-fold benefits:

    • First, it allows the same experience to be potentially used in many weight updates, hence increases data efficiency.
    • Second, the random selection of batches of experience removes the correlations between consecutive states presented to the network for training.
    • Third, it stops any unwanted feedback loops that may arise and cause the network to get stuck in local minima or diverge.

    A modified version of experience replay is the Prioritized Experience Replay (PER). Introduced in 2015 by Tom Schaul et al. [4], it derives from the idea that not all experiences (or, you might say, attempts) are equally important. Some attempts are better lessons than others. Thus, instead of selecting the experiences randomly, it will be much more efficient to assign higher priority to more educational experiences in selection for training. In the Schaul paper it was proposed that experiences in which the difference between the prediction and target is high should be given priority, as the agent could learn a lot in these cases.

  • How to deal with the problem of moving targets?

    Unlike supervised learning, the target is not previously known in RL. With a moving target, the agent tries to maximize the expected return, but the maximum value goes on changing as the agent learns. In essence, this like trying to catch a butterfly yet each time you approach it, it moves to a new location. The major reason to have a moving target is that the same networks are used to estimate the action and the target values, and this can cause oscillations in learning.

    A solution to this was proposed by the DeepMind team in their 2015 paper, titled Human-level Control through Deep Reinforcement Learning, published in Nature. The solution is that now instead of a moving target, the agent has short-term fixed targets. The agent now maintains two networks, both are exactly the same in architecture, one called the local network, which is used at each step to estimate the present action, and one the target network, which is used to get the target value. However, both networks have their own set of weights. At each time step the local network learns in the direction such that its estimate and target are near to each other. After some number of time steps, the target network weights are updated. The update can be a hard update, where the weights of the local network are copied completely to the target network after N time steps, or it can be a soft update, in which the target network slowly (by a factor of Tau ) moves its weight toward the local network.

Reinforcement success in recent years

In the last few years, DRL has been successfully used in a variety of tasks, especially in game playing and robotics. Let us acquaint ourselves with some success stories of RL before learning its algorithms:

  • AlphaGo Zero: Developed by Google's DeepMind team, the AlphaGo Zero paper Mastering the game of Go without any human knowledge, starts from an absolutely blank slate (tabula rasa). The AlphaGo Zero uses one neural network to approximate both the move probabilities and value.

    This neural network takes as an input the raw board representation. It uses a Monte Carlo tree search guided by the neural network to select the moves. The reinforcement learning algorithm incorporates look-ahead search inside the training loop. It was trained for 40 days using a 40-block residual CNN and, over the course of training, it played about 29 million games (a big number!). The neural network was optimized on Google Cloud using TensorFlow, with 64 GPU workers and 19 CPU parameter servers. You can access the paper here:

  • AI controlled sailplanes: Microsoft developed a controller system that can run on many different autopilot hardware platforms such as Pixhawk and Raspberry Pi 3. It can keep the sailplane in the air without using a motor, by autonomously finding and catching rides on naturally occurring thermals. The controller helps the sailplane to operate on its own by detecting and using these thermals to travel without the aid of a motor or a person. They implemented it as a partially observable Markov decision process. They employed the Bayesian reinforcement learning and used the Monte Carlo tree search to search for the best action. They've divided the whole system into level planners—a high-level planner that makes a decision based on experience and a low-level planner that uses Bayesian reinforcement learning to detect and latch onto thermals in real time. You can see the sailplane in action at Microsoft News:
  • Locomotion behavior: In the paper Emergence of Locomotion Behaviours in Rich Environments (, DeepMind researchers provided the agents with rich and diverse environments. The environments presented a spectrum of challenges at different levels of difficulty. The agent was provided with difficulties in increasing order; this led the agent to learn sophisticated locomotion skills without performing any reward engineering (that is, designing special reward functions).

It is really amazing to see how the DRL agent, without any implicit knowledge of the game, learns to play, and even beat, humans – in many specialized tasks. In the coming sections we will explore these fabulous DRL algorithms and see them play games with almost human efficiency within a few thousand epochs.

Introduction to OpenAI Gym

As mentioned earlier, trial and error is an important component of any RL algorithm. Therefore, it makes sense to train our RL agent firstly in a simulated environment.

Today there exists a large number of platforms that can be used for the creation of an environment. Some popular ones are:

  • OpenAI Gym: It contains a collection of environments that we can use to train our RL agents. In this chapter, we'll be using the OpenAI Gym interface.
  • Unity ML-Agents SDK: It allows developers to transform games and simulations created using the Unity editor into environments where intelligent agents can be trained using DRL, evolutionary strategies, or other machine learning methods through a simple-to-use Python API. It works with TensorFlow and provides the ability to train intelligent agents for 2D/3D and VR/AR games. You can learn more about it here:
  • Gazebo: In Gazebo, we can build three-dimensional worlds with physics-based simulation. Gazebo along with Robot Operating System (ROS) and the OpenAI Gym interface is gym-gazebo and can be used to train RL agents. To know more about this, you can refer to the white paper:
  • Blender learning environment: It's a Python interface for the Blender game engine, and it also works over OpenAI Gym. It has at its base Blender: a free 3D modeling software with an integrated game engine. This provides an easy to use, powerful set of tools for creating games. It provides an interface to the Blender game engine, and the games themselves are designed in Blender. We can then create the custom virtual environment to train an RL agent on a specific problem (
  • Malmö: Built by the Microsoft Team, Malmö is a platform for AI experimentation and research built on top of Minecraft. It provides a simple API for creating tasks and missions. You can learn more about Project Malmo here:

We will be using OpenAI Gym to provide an environment for our agent. OpenAI Gym is an open source toolkit to develop and compare RL algorithms. It contains a variety of simulated environments that can be used to train agents and develop new RL algorithms.

The first thing to do is install OpenAI Gym, the following command will install the minimal gym package:

pip install gym

If you want to install all (free) gym modules prefix it by [all]:

pip install gym[all]

The MuJoCo environment requires a purchasing license.

OpenAI Gym provides a variety of environments, from simple text-based to three-dimensional games. The environments supported can be grouped as follows:

  • Algorithms: Contains environments that involve performing computations such as addition. While we can easily perform the computations on a computer, what makes these problems interesting as an RL problem is that the agent learns these tasks purely by example.
  • Atari: This environment provides a wide variety of classic Atari/arcade games.
  • Box2D: Contains robotics tasks in two-dimensions such as a car racing agent or bipedal robot walk.
  • Classic control: This contains the classical control theory problems, such as balancing a cart pole.
  • MuJoCo: This is proprietary (you can get a one-month free trial). It supports various robot simulation tasks. The environment includes a physics engine, hence, it's used for training robotic tasks.
  • Robotics: This environment also uses the physics engine of MuJoCo. It simulates goal-based tasks for fetch and shadow-hand robots.
  • Toy text: A simple text-based environment—very good for beginners.

You can get a complete list of environments from the gym website: To know the list of all available environments in your installation, you can use the following code:

   from gym import envs

This will output a list of all the installed environments along with their environment ID. The core interface provided by OpenAI Gym is the Unified Environment Interface. The agent can interact with the environment using three basic methods, that is, reset, step, and render. The reset method resets the environment and returns the observation. The step method steps the environment by one timestep and returns observation, reward, done, and info. The render method renders one frame of the environment, like popping a window. After importing the gym module, we can create any environment from the list of environments installed using the make command. Next, we create the "Breakout-v0" environment:

    import gym
    env_name = 'Breakout-v0'
    env = gym.make(env_name)

Let's get an observation of the environment once it is reset:

    obs = env.reset()

You can see the Breakout environment in the following screenshot; the render function pops up the environment window:

The Breakout environment

Alternatively, you can use Matplotlib inline and change the render command to plt.imshow(env.render(mode='rgb_array')). This will show the environment inline in the Jupyter Notebook.

You can learn more about the environment state space and its action space using env.observation_space and env.action_space. For our Breakout game we find that the state consists of a three-channel image of size 210 × 160, and the action space is discrete with four possible actions. Once you are done, do not forget to close the OpenAI using:


Random agent playing Breakout

Let's have some fun and play the Breakout game. When I first played the game, I had no idea of the rules, or how to play, so I randomly chose the control buttons. Our novice agent will do the same; it will choose the actions randomly from the action space. Gym provides a function sample() which chooses a random action from the action space – we will be using this function. Also, we can save a replay of the game, to view it later. There are two ways to save the play, one using Matplotlib and another using OpenAI Gym Monitor wrapper. Let us first see the Matplotlib method.

We will first import the necessary modules; we will only need gym and matplotlib for now, as the agent will be playing random moves:

import gym
import matplotlib.pyplot as plt
import matplotlib.animation as animation

We create the Gym environment:

env_name = 'Breakout-v0'
env = gym.make(env_name)

Next we will run the game, one step at a time, choosing a random action, either for 300 steps or until the game is finished (whichever is earlier). The environment state (observation) space is saved at each step in the list frames:

frames = [] # array to store state space at each step
done = False
for _ in range(300): 
    obs,reward,done, _ = env.step(env.action_space.sample())
    if done:

Now comes the part of combining all the frames as a gif image using Matplotlib Animation. We create an image object, patch, and then define a function that sets image data to a particular frame index. The function is used by the Matplotlib Animation class to create an animation, which we finally save in the file random_agent.gif:

patch = plt.imshow(frames[0])
def animate(i):
    anim = animation.FuncAnimation(plt.gcf(), animate, \
        frames=len(frames), interval=10)'random_agent.gif', writer='imagemagick')

Normally, a RL agent requires lots of steps for proper training, and as a result it is not feasible to store the state space at each step. Instead, we can choose to store after every 500th step (or any other number you wish) in the preceding algorithm. OpenAI Gym provides the Wrapper class to save the game as a video. To do so, we need to first import wrappers, then create the environment, and finally use Monitor.

By default, it will store the video of 1, 8, 27, 64, (episode numbers with perfect cubes), and so on and then every 1,000th episode; each training, by default, is saved in one folder. The code to do this is:

import gym
env = gym.make("Breakout-v0")
env = gym.wrappers.Monitor(env, 'recording', force=True)
observation = env.reset()
for _ in range(1000):
    action = env.action_space.sample()     # your agent here (this takes random actions)
    observation, reward, done, info = env.step(action)
    if done:
        observation = env.reset()

For Monitor to work one requires FFmpeg support, you may need to install it depending upon your OS, in case it is missing.

This will save the videos in mp4 format in the folder recording. An important thing to note here is that you have to set force=True option if you want to use the same folder for the next training session.

Deep Q-Networks

Deep Q-networks, DQNs for short, are deep learning neural networks designed to approximate the Q-function (value-state function), it is one of the most popular value-based reinforcement learning algorithms. The model was proposed by Google's DeepMind in NIPS 2013, in the paper entitled Playing Atari with Deep Reinforcement Learning. The most important contribution of this paper was that they used the raw state space directly as input to the network; the input features were not hand-crafted as done in earlier RL implementations. Also, they could train the agent with exactly the same architecture to play different Atari games and obtain state of the art results.

This model is an extension of the simple Q-learning algorithm. In Q-learning algorithms a Q-table is maintained as a cheat sheet. After each action the Q-table is updated using the Bellman equation [5]:

The is the learning rate, and its value lies in the range [0,1]. The first term represents the component of the old Q value and the second term the target Q value. Q-learning is good if the number of states and the number of possible actions are small, but for large state spaces and action spaces, Q-learning is simply not scalable. A better alternative would be to use a deep neural network as a function approximator, approximating the target Q-function for each possible action. The weights of the deep neural network in this case store the Q-table information. There is a separate output unit for each possible action. The network takes the state as its input and returns the predicted target Q value for all possible actions. The question arises: how do we train this network, and what should be the loss function? Well, since our network has to predict the target Q value:

the loss function should try and reduce the difference between the Q value predicted, Qpredicted and the target Q, Qtarget. We can do this by defining the loss function as:

Where W is the training parameters of our deep Q network, learned using gradient descent, such that the loss function is minimized. Following is the general architecture of a DQN. The network takes n-dimensional state as input, and outputs the Q value of each possible action in the m-dimensional action space. Each layer (including the input) can be a convolutional layer (if we are taking the raw pixels as input convolutional layers makes more sense) or can be dense layers:

In the next section, we will try training a DQN, our agent task will be to stabilize a pole on a cart, the agent can move the cart left or right to maintain the balance.

DQN for CartPole

CartPole is a classic OpenAI problem with continuous state space and discrete action space. In it, a pole is attached by an un-actuated joint to a cart; the cart moves along a frictionless track. The goal is to keep the pole standing on the cart by moving the cart left or right. A reward of +1 is given for each time step the pole is standing. Once the pole is more than 15 degrees from the vertical, or the cart moves beyond 2.4 units from the center, the game is over:

The code here is adapted from the best entry at OpenAI for the CartPole environment:

We start with importing the necessary modules. We require gym obviously to provide us with the CartPole environment, and tensorflow to build our DQN network. Besides these we need random and numpy modules:

import random
import gym
import math
import numpy as np
from collections import deque
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

We set up the global values for the maximum episodes for which we will be training the agent (EPOCHS), the threshold value when we consider the environment solved (THRESHOLD) and a bool to indicate if we want to record the training or not (MONITOR). Please note that as per the official OpenAI documentation the CartPole environment is considered solved when the agent is able to maintain the pole in the vertical position for 195 time steps (ticks). In the following code for the sake of time we have reduced the THRESHOLD to 45:

EPOCHS = 1000

Now let us build our DQN. We declare a class DQN and in its __init__() function declare all the hyperparameters and our model. We are creating the environment also inside the DQN class. As you can see, the class is quite general, and you can use it to train for any Gym environment whose state space information can be encompassed in a 1D array:

def __init__(self, env_string, batch_size=64):
        self.memory = deque(maxlen=100000)
        self.env = gym.make(env_string)
        self.input_size = self.env.observation_space.shape[0]
        self.action_size = self.env.action_space.n
        self.batch_size = batch_size
        self.gamma = 1.0
        self.epsilon = 1.0
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        if MONITOR: self.env = gym.wrappers.Monitor(self.env, '../data/'+env_string, force=True)
        # Init model
        self.model = Sequential()
        self.model.add(Dense(24, input_dim=self.input_size, activation='tanh'))
        self.model.add(Dense(48, activation='tanh'))
        self.model.add(Dense(self.action_size, activation='linear'))
        self.model.compile(loss='mse', optimizer=Adam(lr=alpha, decay=alpha_decay))

The DQN that we have built is a three-layered perceptron; in the following output you can see the model summary. We use Adam optimizer with learning rate decay:

Figure 1: Summary of the DQN model

The variable list self.memory will contain our experience replay buffer. We need to add a method for saving the <S,A,R,S'> tuple into the memory and a method to get random samples from it in batches to train the agent. We perform these two functions by defining the class methods remember and replay:

def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
def replay(self, batch_size):
        x_batch, y_batch = [], []
        minibatch = random.sample(self.memory, min(len(self.memory), batch_size))
        for state, action, reward, next_state, done in minibatch:
             y_target = self.model.predict(state)
  y_target[0][action] = reward if done else reward + self.gamma * np.max(self.model.predict(next_state)[0])
       , np.array(y_batch), batch_size=len(x_batch), verbose=0)

Our agent will use the Epsilon Greedy policy when choosing the action. This is implemented in the following method:

def choose_action(self, state, epsilon):
        if np.random.random() <= epsilon:
            return self.env.action_space.sample()
            return np.argmax(self.model.predict(state))

Next, we write a method to train the agent. We define two lists to keep track of the scores. First, we fill the experience replay buffer and then we choose some samples from it to train the agent and hope that the agent will slowly learn to do better:

def train(self):
    scores = deque(maxlen=100)
    avg_scores = []
    for e in range(EPOCHS):
       state = self.env.reset()
       state = self.preprocess_state(state)
       done = False
       i = 0
       while not done:
          action = self.choose_action(state,self.epsilon)
          next_state, reward, done, _ = self.env.step(action)
          next_state = self.preprocess_state(next_state)
          self.remember(state, action, reward, next_state, done)
          state = next_state
          self.epsilon = max(self.epsilon_min, self.epsilon_decay*self.epsilon) # decrease epsilon
          i += 1
        mean_score = np.mean(scores)
        if mean_score >= THRESHOLD and e >= 100:
            print('Ran {} episodes. Solved after {} trials ✓'.format(e, e - 100))
            return avg_scores
        if e % 100 == 0:
            print('[Episode {}] - Mean survival time over last 100 episodes was {} ticks.'.format(e, mean_score))
    print('Did not solve after {} episodes :('.format(e))
    return avg_scores

Now all necessary functions are done, we just need one more helper function to reshape the state of the CartPole environment so that the input to the model is in the correct shape. The state of the environment is described by four continuous variables: cart position ([-2.4-2.4]), cart velocity , pole angle ([-41.8o-41.8o]) and pole velocity :

  def preprocess_state(self, state):
        return np.reshape(state, [1, self.input_size])

Let us now instantiate our agent for the CartPole environment and train it:

env_string = 'CartPole-v0'
agent = DQN(env_string)
scores = agent.train()

In the following screenshot you can see the agent being trained on my system. The agent was able to achieve our set threshold of 45 in 254 steps:

Figure 2: Agent training for the CartPole environment, achieving the target treshold within 254 steps

And the average reward plot as the agent learns is:

import matplotlib.pyplot as plt

Figure 3: Average agent reward plot

Once the training is done you can close the environment:


You can see starting from no information about how to balance the pole, the agent using DQN is able to balance the pole for more and more time (on average) as it learns. Starting from the blank state, the agent is able to build information/knowledge to fulfill the required goal. Remarkable!

DQN to play a game of Atari

In the preceding section we used DQN to train for balancing the CartPole. It was a simple problem, and thus we could solve it using a perceptron model. But imagine if the environment state was just the CartPole visual as we humans see it. With raw pixel values as the input state space, our previous DQN will not work. What we need is a convolutional neural network. Next, we build one based on the seminal paper on DQN, Playing Atari with Deep Reinforcement Learning.

Most of the code will be similar to the DQN for CartPole, but there will be significant changes in the DQN network itself, and how we preprocess the state that we obtain from the environment.

First, let us see the change in the way state space is processed. In the following screenshot you can see one of the Atari games, Breakout:

Figure 4: A screenshot of the Atari game, Breakout

Now, if you see the image, not all of it contains relevant information: the top part has redundant info about score, and bottom part has an unnecessary blank space, and the image is colored. To reduce the burden on our model, it is best to remove the unnecessary information, so we crop the image, convert it to grayscale, and make it a square of size 84 × 84 (as in the paper). Here is the code to preprocess the input raw pixels:

def preprocess_state(self, img):
    img_temp = img[31:195]  # Choose the important area of the image
    img_temp = tf.image.rgb_to_grayscale(img_temp)
    img_temp = tf.image.resize(img_temp, [self.IM_SIZE, self.IM_SIZE],
    img_temp = tf.cast(img_temp, tf.float32)
    return img_temp[:,:,0]

Another important issue is that just from the image at one time step, how can the agent know that the ball is going up or down? One way could be to use LSTM along with a CNN to keep a record of the past and hence ball movement. The paper, however, used a simple technique. Instead of single state frame, it concatenated the state space for the past four time steps together as one input to the CNN network, that is. the network sees four past frames of the environment as its input. The following is the code for combining the present and previous states:

def combine_images(self, img1, img2):
    if len(img1.shape) == 3 and img1.shape[0] == self.m:
       im = np.append(img1[1:,:, :],np.expand_dims(img2,0), axis=2)
       return tf.expand_dims(im, 0)
       im = np.stack([img1]*self.m, axis = 2)
       return tf.expand_dims(im, 0)

The model was defined in the __init__ function. We modify the function to now have a CNN network with input of (84 × 84 × 4) representing four state frames each of size 84 × 84:

def __init__(self, env_string,batch_size=64, IM_SIZE = 84, m = 4):
    self.memory = deque(maxlen=100000)
    self.env = gym.make(env_string)
    input_size = self.env.observation_space.shape[0]
    action_size = self.env.action_space.n
    self.batch_size = batch_size
    self.gamma = 1.0
    self.epsilon = 1.0
    self.epsilon_min = 0.01
    self.epsilon_decay = 0.995
    self.IM_SIZE = IM_SIZE
    self.m = m
    if MONITOR: self.env = gym.wrappers.Monitor(self.env, '../data/'+env_string, force=True)
    # Init model
    self.model = Sequential()
    self.model.add( Conv2D(32, 8, (4,4), activation='relu',padding='same', input_shape=(IM_SIZE, IM_SIZE, m)))
    self.model.add( Conv2D(64, 4, (2,2), activation='relu',padding='valid'))
    self.model.add( Conv2D(64, 3, (1,1), activation='relu',padding='valid'))
    self.model.add(Dense(512, activation='elu'))
    self.model.add(Dense(action_size, activation='linear'))
    self.model.compile(loss='mse', optimizer=Adam(lr=alpha, decay=alpha_decay))

Lastly, we will need to make a minor change in the train function, we will need to call the new preprocess function, along with the combine_images function to ensure that four frames are concatenated:

def train(self):
    scores = deque(maxlen=100)
    avg_scores = []
    for e in range(EPOCHS):
       state = self.env.reset()
       state = self.preprocess_state(state)
       state = self.combine_images(state, state)
       done = False
       i = 0
       while not done:
           action = self.choose_action(state,self.epsilon)
           next_state, reward, done, _ = self.env.step(action)
           next_state = self.preprocess_state(next_state)
           next_state = self.combine_images(next_state, state)
           self.remember(state, action, reward, next_state, done)
           state = next_state
           self.epsilon = max(self.epsilon_min, self.epsilon_decay*self.epsilon) # decrease epsilon
           i += reward
        mean_score = np.mean(scores)
        if mean_score >= THRESHOLD and e >= 100:
            print('Ran {} episodes. Solved after {} trials ✓'.format(e, e - 100))
            return avg_scores
        if e % 100 == 0:
            print('[Episode {}] - Score over last 100 episodes was {}.'.format(e, mean_score))
    print('Did not solve after {} episodes :('.format(e))
    return avg_scores

That's all – you can now train the agent for playing Breakout. The complete code is available on GitHub of this chapter in the file DQN_Atari.ipynb.

DQN variants

After the unprecedented success of DQN, the interest in RL increased and many new RL algorithms came into being. Next, we see some of the algorithms that are based on DQN. They all use a DQN as the base and modify upon it.

Double DQN

In DQN, the agent uses the same Q value to both select an action and evaluate an action. This can cause a maximization bias in learning. For example, let us consider that for a state, S, all possible actions have true Q values of zero. Now our DQN estimates will have some values above and some values below zero, and since we are choosing the action with the maximum Q value and later evaluating the Q value of each action using the same (maximized) estimated value function, we are overestimating Q – or in other words, our agent is over-optimistic. This can lead to unstable training and a low-quality policy. To deal with this issue Hasselt et al. from DeepMind proposed the Double DQN algorithm in their paper Deep Reinforcement Learning with Double Q-Learning. In Double DQN we have two Q-networks with the same architecture but different weights. One of the Q-networks is used to determine the action using the epsilon greedy policy and the other is used to determine its value (Q-target).

If you recall in DQN the Q-target was given by:

Here the action A was selected using the same DQN network Q(S,A; W) where W is the training parameters of the network; that is, we are writing the Q value function along with its training parameter to emphasize the difference between vanilla DQN and Double DQN:

In Double DQN, the equation for the target will now change. Now the DQN Q(S,A;W) is used for determining the action and DQN Q(S,A;W') is used for calculating the target (notice the different weights), then the preceding equation will change to:

This simple change reduces the overestimation and helps us to train the agent fast and more reliably.

Dueling DQN

This architecture was proposed by Wang et al. in their paper Dueling Network Architectures for Deep Reinforcement Learning in 2015. Like DQN and Double DQN it is also a model-free algorithm.

Dueling DQN decouples the Q-function into the value function and advantage function. The value function, which we have discussed earlier, represents the value of the state independent of any action. The advantage function, on the other hand, provides a relative measure of the utility (advantage/goodness) of action A in the state S. The Dueling DQN uses convolutional networks in the initial layers to extract the features from raw pixels. However, in the later stages it is separated into two different networks, one approximating the value and another approximating the advantage. This ensures that the network produces separate estimates for the value function and the advantage function:

Here is an array of the training parameters of the shared convolutional network (it is shared by both V and A), and are the training parameters for the Advantage and Value estimator networks. Later the two networks are recombined using an aggregating layer to estimate the Q-value.

In the following diagram, you can see the architecture of Dueling DQN:

Figure 5: Visualizing the architecture of a Dueling DQN

You may be wondering, what is the advantage of doing all of this? Why decompose Q if we will be just putting it back together? Well, decoupling the value and advantage functions allow us to know which states are valuable, without having to take into account the effect of each action for each state. There are many states that, irrespective of the action taken, are good or bad states: for example, having breakfast with your loved ones in a good resort is always a good state, and being admitted to a hospital emergency ward is always a bad state. Thus, separating value and advantage allows one to get a more robust approximation of the value function. Next, you can see a figure from the paper highlighting how in the Atari game Enduro, the value network learns to pay attention to the road, and the advantage network learns to pay attention only when there are cars immediately in front, so as to avoid collision:

Image source:

The aggregate layer is implemented in a manner that allows one to recover both V and A from the given Q. This is achieved by enforcing that the advantage function estimator has zero advantage at the chosen action:

In the paper, Wang et al. reported that the network is more stable if the max operation is replaced by the average operation. This is so because the speed of change in advantage is now the same as the change in average, instead of the optimal (max) value.


Rainbow is the current state of the art DQN variant. Technically, to call it a DQN variant would be wrong. In essence it is an ensemble of many DQN variants combined together into a single algorithm. It modifies the distributional RL [6] loss to multi-step loss and combines it with Double DQN using greedy action. Quoting from the paper:

"The network architecture is a dueling network architecture adapted for use with return distributions. The network has a shared representation fξ(s), which is then fed into a value stream vη with Natoms outputs, and into an advantage stream aξ with Natoms×Nactions outputs, where will denote the output corresponding to atom i and action a. For each atom zi, the value and advantage streams are aggregated, as in Dueling DQN, and then passed through a softmax layer to obtain the normalised parametric distributions used to estimate the returns' distributions."

Rainbow combines six different RL algorithms:

  • N-step returns
  • Distributional state-action value learning
  • Dueling networks
  • Noisy networks
  • Double DQN
  • Prioritized Experience Replay

Deep deterministic policy gradient

DQN and its variants have been very successful in solving problems where the state space is continuous and action space is discrete. For example, in Atari games, the input space consists of raw pixels, but actions are discrete - [up, down, left, right, no-op]. How do we solve a problem with continuous action space? For instance, say an RL agent driving a car needs to turn its wheels: this action has a continuous action space One way to handle this situation is by discretizing the action space and continuing with DQN or its variants. However, a better solution would be to use a policy gradient algorithm. In policy gradient methods the policy is approximated directly.

A neural network is used to approximate the policy; in the simplest form, the neural network learns a policy for selecting actions that maximize the rewards by adjusting its weights using steepest gradient ascent, hence, the name: policy gradients.

In this section we will focus on the Deep Deterministic Policy Gradient (DDPG) algorithm, another successful RL algorithm by Google's DeepMind in 2015. DDPG is implemented using two networks; one called the actor network and the other called the critic network.

The actor network approximates the optimal policy deterministically, that is it outputs the most preferred action for any given input state. In essence the actor is learning. The critic on the other hand evaluates the optimal action value function using the actor's most preferred action. Before going further, let us contrast this with the DQN algorithm that we discussed in the preceding section. In the following diagram, you can see the general architecture of DDPG:

The actor network outputs the most preferred action; the critic takes as input both the input state and action taken and evaluates its Q-value. To train the critic network we follow the same procedure as DQN, that is we try to minimize the difference between the estimated Q-value and the target Q-value. The gradient of the Q-value over actions is then propagated back to train the actor network. So, if the critic is good enough, it will force the actor to choose actions with optimal value functions.


Reinforcement learning has in recent years seen a lot of progress, to summarize all of that in a single chapter is not possible. However, in this chapter we focused on the recent successful RL algorithms. The chapter started by introducing the important concepts in the RL field, its challenges, and the solutions to move forward. Next, we delved into two important RL algorithms: the DQN and DDPG algorithms. Toward the end of this chapter the book covered the important topics in the field of deep learning. In the next chapter, we will move into applying what we have learned to production.


  2. Coggan, Melanie. Exploration and Exploitation in Reinforcement Learning. Research supervised by Prof. Doina Precup, CRA-W DMP Project at McGill University (2004).
  3. Lin, Long-Ji. Reinforcement learning for robots using neural networks. No. CMU-CS-93-103. Carnegie-Mellon University Pittsburgh PA School of Computer Science, 1993.
  4. Schaul, Tom, John Quan, Ioannis Antonoglou, and David Silver. Prioritized Experience Replay. arXiv preprint arXiv:1511.05952 (2015).
  5. Chapter 4, Reinforcement Learning, Richard Sutton and Andrew Barto, MIT Press.
  6. Dabney, Will, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional Reinforcement Learning with Quantile Regression. In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
  7. Hessel, Matteo, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in Deep Reinforcement Learning. In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
  8. Pittsburgh PA School of Computer Science, 1993.
  9. The details about different environments can be obtained from
  10. Wiki pages are maintained for some environments at
  11. Details regarding installation instructions and dependencies can be obtained from
  13. Xavier Glorot and Yoshua Bengio, Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010,
  14. A good read on why RL is still hard to crack: