Policy gradients is a family of algorithms for solving reinforcement learning problems by directly optimizing the policy in policy space. This is in stark contrast to value-based approaches (such as Q-learning used in Learning Atari games by DeepMind. Policy gradients have several appealing properties, for one they produce stochastic policies (by learning a probability distribution over actions given observations) whereas value-based approaches are deterministic as they will typically choose actions greedily with respect to the value function being learned which can lead to under-exploration (one needs to introduce exploration strategies such as \epsilonϵ-greedy explicitly to get around this). Another advantage of policy gradients is their ability to tackle continuous action spaces without discretization which is necessary for value-based methods. This is not to say that value-based approaches are useless, one of the biggest disadvantages of policy gradients is their high variance estimates of the gradient updates. This leads to very noisy gradient estimates and can de-stabilize the learning process. I think it is fair to say that a lot of reinforcement learning research with policy gradients in the past few years has been about trying to reduce the variance of these gradient updates to improve the trainability of these algorithms. It is also worth mentioning that in state-of-the-art implementations and tasks it is common to use the so-called actor-critic algorithms which include a policy gradient and a value function estimator to combine the best properties of both approaches.

Policy-Gradient methods are a subclass of Policy-Based methods that estimate an optimal policy’s weights through **gradient ascent.**

Intuitively, gradient ascent begins with an initial guess for the value of policy’s weights that maximizes the expected return, then, the algorithm evaluates the gradient at that point that indicates the direction of the steepest increase of the function of expected return, and so we can make a small step in that direction. We hope that we end up at a new value of policy’s weights for which the value of the expected return function is a little bit larger. The algorithm then repeats this process of evaluating the gradient and taking steps until it considers that it is eventually reached the maximum expected return.

Policy-based methods can learn either stochastic or deterministic policies. With a stochastic policy, our neural network’s output is an action vector that represents a probability distribution (rather than returning a single deterministic action).

The policy we will follow in the new method presented in this Post is selecting an action from this probability distribution. This means that if our Agent ends up in the same state twice, we may not end up taking the same action every time. Such representation of actions as probabilities has many advantages, for instance, the advantage of smooth representation: if we change our network weights a bit, the output of the neural network will change, but probably just a little bit.

In the case of a deterministic policy, with a discrete output, even a small adjustment of the weights can lead to a jump to a different action. However, if the output is a probability distribution, a small change of weights will usually lead to a small change in output distribution. This is a very important property due gradient optimization methods are all about tweaking the parameters of a model a bit to improve the results.

But how can be changed network’s parameters to improve the policy? If you remember from Post 6, we solved a very similar problem using the Cross-Entropy method: our network took observations as inputs and returned the probability distribution of the actions. In fact, the cross-entropy method is, somehow, a preliminary version of the methods that we will introduce in this Post.

The key idea underlying policy gradients is **reinforcing good actions:** to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to a lower return, until you arrive at the optimal policy. The policy gradient method will iteratively amend the policy network weights (with smooth updates) to make state-action pairs that resulted in positive return more likely, and make state-action pairs that resulted in negative return less likely.

To introduce this idea we will start with a vanilla version (the basic version) of the policy gradient method called **REINFORCE algorithm** ( original paper). This algorithm is the fundamental policy gradient algorithm on which nearly all the advanced policy gradient algorithms are based.

Let’s look at a more mathematical definition of the algorithm since it will be good for us in order to understand the most advanced algorithms in the following Posts.

The first thing we need to define is a **trajectory**, just a state-action-rewards sequence (but we ignore the reward). A trajectory is a little bit more flexible than an episode because there are no restrictions on its length; it can correspond to a full episode or just a part of an episode. We denote the length with a capital ** H**, where

The method REINFORCE is built upon **trajectories instead of episodes** because maximizing expected return over trajectories (instead of episodes) lets the method search for optimal policies for both episodic and continuing tasks.

Although for the vast majority of episodic tasks, where a reward is only delivered at the end of the episode, it only makes sense just to use the full episode as a trajectory; otherwise, we don’t have enough reward information to meaningfully estimate the expected return.

We denote the **return** **for a trajectory** ** τ **with

The parameter ** Gk** is called the total return, or future return, at time step

It is the return we expect to collect from time step *k *until the end of the trajectory, and it can be approximated by adding the rewards from some state in the episode until the end of the episode using gamma *γ:*

Gradient ascentis closely related togradient descent, where the differences are that gradient descent is designed to find theminimumof a function (steps in the direction of thenegative gradient), whereas gradient ascent will find themaximum (steps in the direction of thegradient). We will use this approach in our code in PyTorch.

To apply this method, we will need to be able to calculate the gradient ∇*U*(*θ*); however, we won’t be able to calculate the exact value of the gradient since that is computationally too expensive because, to calculate the gradient exactly, we’ll have to consider every possible trajectory, becoming an intractable problem in most cases.

Instead of doing this, the method samples trajectories using the policy and then uses those trajectories only to estimate the gradient. This sampling is equivalent to the approach of Monte Carlo presented in Post 13 of this series, and for this reason, the method REINFORCE is also known as Monte Carlo Policy Gradients.

In summary, the pseudocode that describes in more detail the behavior of this method can be written as:

In Gradient methods where we can formulate some probability p which should be maximized, we would actually optimize the log probability log(p) instead of the probability p for some parameters theta.

The reason is that generally, works better to optimize log(p) than p due to the gradient of log(p) is generally more *well-scaled*. Remember that probabilities are bounded by 0 and 1 by definition, so the range of values that the optimizer can operate over is limited and small.

In this case, sometimes probabilities may be extremely tiny or very close to 1, and this runs into numerical issues when optimizing on a computer with limited numerical precision. If we instead use a surrogate objective, namely log(p) (natural logarithm), we have an objective that has a larger “dynamic range” than raw probability space, since the log of probability space ranges from (-∞,0), and this makes the log probability easier to compute.

Now, we will explore an implementation of the REINFORCE to solve OpenAI Gym’s *Cartpole *environment.

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link.

First, we will import all necessary packages with the following lines of code:

importnumpyasnpimporttorchimportgymfrommatplotlibimportpyplotasplt

And also the OpenAI Gym’s Cartpole Environment:

env = gym.make('CartPole-v0')

We will build a neural network that serves as a policy network. The policy network will accept a state vectors as inputs, and it will produce a (discrete) probability distribution over the possible actions.

obs_size = env.observation_space.shape[0]

n_actions = env.action_space.n

HIDDEN_SIZE = 256model = torch.nn.Sequential(

torch.nn.Linear(obs_size, HIDDEN_SIZE),

torch.nn.ReLU(),

torch.nn.Linear(HIDDEN_SIZE, n_actions),

torch.nn.Softmax(dim=0)

)

The model is only two linear layers, with a ReLU activation function for the first layer, and the Softmax function for the last layer. By default, the initialization is with random weights).

print (model)

With the result of the neural network, the Agent samples from the probability distribution to take an action that will be executed in the Environment.

act_prob = model(torch.from_numpy(curr_state).float())

action = np.random.choice(np.array([0,1]),p=act_prob.data.numpy())

prev_state = curr_state

curr_state, _, done, info = env.step(action)

The second line of this code samples an action from the probability distribution produced by the policy network obtained in the first line. Then in the last line of this code, the Agent takes the action.

The training loop trains the policy network by updating the parameters *θ *to following the pseudocode steps describes in the previous section.

First we define the optimizer and initialize some variables:

learning_rate = 0.003

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)Horizon = 500

MAX_TRAJECTORIES = 500

gamma = 0.99

score = []

where is `learning_rate`

is the step size *α* , `Horizon`

is the *H *and `gamma`

is *γ *in the previous pseudocode*. *Using these variables, the main loop with the number of iterations is defined by `MAX_TRAJECTORIES`

is coded as:

fortrajectoryinrange(MAX_TRAJECTORIES):

curr_state = env.reset()

done =False

transitions = []

fortinrange(Horizon):

act_prob = model(torch.from_numpy(curr_state).float())

action = np.random.choice(np.array([0,1]),

p=act_prob.data.numpy())

prev_state = curr_state

curr_state, _, done, info = env.step(action)

transitions.append((prev_state, action, t+1))

ifdone:

break

score.append(len(transitions))

reward_batch = torch.Tensor([rfor(s,a,r)in

transitions]).flip(dims=(0,)) batch_Gvals =[]

foriinrange(len(transitions)):

new_Gval=0

power=0

forjinrange(i,len(transitions)):

new_Gval=new_Gval+

((gamma**power)*reward_batch[j]).numpy()

power+=1

batch_Gvals.append(new_Gval)

expected_returns_batch=torch.FloatTensor(batch_Gvals)

expected_returns_batch /= expected_returns_batch.max() state_batch = torch.Tensor([sfor(s,a,r)intransitions])

action_batch = torch.Tensor([afor(s,a,r)intransitions]) pred_batch = model(state_batch)

prob_batch = pred_batch.gather(dim=1,index=action_batch

.long().view(-1,1)).squeeze()

loss=-torch.sum(torch.log(prob_batch)*expected_returns_batch)

optimizer.zero_grad()

loss.backward()

optimizer.step()

With `score`

list we will keep track of the trajectory length over training time. We keep track of the actions and states in the list `transactions`

for the transactions of the current trajectory.

Following we compute the expected return for each transaction (code snippet from the previous listing):

batch_Gvals =[]foriinrange(len(transitions)):

new_Gval=0

power=0

forjinrange(i,len(transitions)):

new_Gval=new_Gval+((gamma**power)*reward_batch[j]).numpy()

power+=1

batch_Gvals.append(new_Gval)

expected_returns_batch=torch.FloatTensor(batch_Gvals)

expected_returns_batch /= expected_returns_batch.max()

The list`batch_Gvals`

is used to compute the expected return for each transaction as it is indicated in the previous pseudocode. The list `expected_return`

stores the expected returns for all the transactions of the current trajectory. Finally, this code normalizes the rewards to be within the [0,1] interval to improve numerical stability.

The loss function requires an array of action probabilities, `prob_batch`

, for the actions that were taken and the discounted rewards:

loss =-torch.sum(torch.log(prob_batch) * expected_returns_batch)

For this purpose we recompute the action probabilities for all the states in the trajectory and subsets the action-probabilities associated with the actions that were actually taken with the following two lines of code:

pred_batch = model(state_batch)

prob_batch = pred_batch.gather(dim=1,index=action_batch

.long().view(-1,1)).squeeze()

An important detail is the **minus sign **in the loss function of this code:

loss=-torch.sum(torch.log(prob_batch)*expected_returns_batch)

Why we introduced a

in the *-*`log_prob`

? In general, we prefer to set things up so that we are minimizing an objective function instead of maximizing since it plays nicely with PyTorch’s built-in optimizers (using stochastic gradient descend) . We should instead tell PyTorch to minimize 1-π . This loss function approaches 0 as π nears 1, so we are encouraging the gradients to maximize π for the action we took.

Also, let’s remember that we use a surrogate objective, namely –log π (where log is the natural logarithm), because we have an objective that has a larger dynamic range than raw probability space (bounded by 0 and 1 by definition), since the log of probability space ranges from (–∞,0), and this makes the log probability easier to compute.

Finally, mention that we included in the code the following lines of code to control the progress of the training loop:

iftrajectory % 50 == 0andtrajectory>0:

print('Trajectory{}\tAverage Score:{:.2f}'

.format(trajectory, np.mean(score[-50:-1])))

We can visualise the results of this code running the following code:

defrunning_mean(x):

N=50

kernel = np.ones(N)

conv_len = x.shape[0]-N

y = np.zeros(conv_len)

foriinrange(conv_len):

y[i] = kernel @ x[i:i+N]

y[i] /= N

returny

score = np.array(score)

avg_score = running_mean(score)plt.figure(figsize=(15,7))

plt.ylabel("Trajectory Duration",fontsize=12)

plt.xlabel("Training Epochs",fontsize=12)

plt.plot(score, color='gray' , linewidth=1)

plt.plot(avg_score, color='blue', linewidth=3)

plt.scatter(np.arange(score.shape[0]),score,

color='green' , linewidth=0.3)

You should be able to obtain a plot with a nicely increasing trend of the trajectory duration.

We also can render how the Agent applies the policy with the following code:

defwatch_agent():

env = gym.make('CartPole-v0')

state = env.reset()

rewards = []

img = plt.imshow(env.render(mode='rgb_array'))

fortinrange(2000):

pred = model(torch.from_numpy(state).float())

action = np.random.choice(np.array([0,1]),

p=pred.data.numpy()) img.set_data(env.render(mode='rgb_array'))

plt.axis('off')

display.display(plt.gcf())

display.clear_output(wait=True) state, reward, done, _ = env.step(action)

rewards.append(reward)

ifdone:

print("Reward:", sum([rforrinrewards]))

break

env.close()watch_agent()

Now that we know the two families of methods, what are the main differences between them?

- Policy methods, such REINFORCE, directly optimize the policy. Value methods, such as DQN, do the same indirectly, learning the value first and, obtaining the policy based on this value.
- Policy methods are on-policy and require fresh samples from the Environment (obtained with the policy). Instead, Value methods can benefit from old data obtained from the old policy.
- Policy methods are usually less sample-efficient, which means they require more interaction with the Environment. Remember that value methods can benefit from large replay buffers.

Policy methods will be the more natural choice in some situations, and in other situations, value methods will be a better option. In any case, and as we will see from the next post, both families of methods can be combined to achieve hybrid methods that take advantage of each of them’ properties.

In this post, we have explained in detail the REINFORCE algorithm, and we have coded it. As a stochastic gradient method, REINFORCE works well in simple problems and has good theoretical convergence properties.

References:

https://www.janisklaise.com/post/rl-policy-gradients/

https://towardsdatascience.com/policy-gradient-reinforce-algorithm-with-baseline-e95ace11c1c4

https://medium.com/@thechrisyoon/deriving-policy-gradients-and-implementing-reinforce-f887949bd63

https://julien-vitay.net/deeprl/PolicyGradient.html

https://towardsdatascience.com/policy-gradient-methods-104c783251e0