Deep Reinforcement Learning: Policy Gradient and Actor-Critic

9 minute read


In this post, we review the basic policy gradient algorithm for deep reinforcement learning and the actor-critic algorithm. Most of the contents are derived from CS 285 at UC Berkeley.

graph TD; id1[fit a model/estimate the return]-->id2[improve the policy]; id2-->id3[generate samples]; id3-->id1;

Policy Gradient Introduction

In this part, we will first focus on the fully observable model with finite horizon, i.e., and is bounded. At a very high level, the policy gradient method is very simple: The policy gradient algorithm parametrized the policy by , and use samples (interactions with the environment) to update the parameter .

Next, we first derive the policy gradient algorithm (REINFORCE) and introduce some techniques to improve the performance of policy graident.

Evaluating the objective

First, we use to denote the trajectory of 1 trial, to denote the reward of trajectory , and to denote the probability that trajectory appears under the policy .

Recall that our reinforcement learning objective can be written as

We define , and the objective becomes . We want to compute the gradient of , and then we can update our policy .

First, we have

From the basic calculus computations, we have

Then, plug into the the derivative of , we can get

Then, recall that from the Markovian assumption,we have

Taking the logarithms on both sides, we can get

Taking the gradient, the first term and the third term are thrown away, and we have the following formula for the policy gradient.

Evaluating the policy graident and REINFORCE algorithm

Now given the formula for the policy gradient, we can estimate the expectation in policy gradient by samples. Specifically, we have the following formula for the estimatation.

where denotes the index of the trajectory.

Then, given an estimation of the policy gradient, we can then improve our policy by the gradient method .

Putting them all together, we have a simple REINFORCE algorithm (Monte-Carlo policy gradient).

  1. For
    1. sample from (run the policy)

Note that this algorithm may behave not very well because of the high variance. We need to use variance reduction techniques to improve this algorithm. But before we introduce the variance reduction techniques, we may first see an example of the REINFORCE algorithm and see its generalization into the partially observable case.

Now, we compute the policy gradient for Gaussian policies

\begin{align} \pi_{\theta}(a_{t}|s_{t}) =& \mathcal N(f_{\theta}(s_{t});\Sigma), \newline \log\pi_{\theta}(a_{t}|s_{t}) =& -\frac{1}{2}(f_{\theta}(s_{t})-a_{t})^T\Sigma^{-1}(f_{\theta}(s_{t})-a_{t}) + const, \newline \nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t}) =& -\Sigma^{-1}(f_{\theta}(s_{t})-a_{t})\frac{df}{d\theta}. \end{align}

The REINFORCE algorithm simply tends to increase the probability of a good trajectory with high reward, and decrease the probability of a bad trajectory with low reward. The intuition behind REINFORCE can be summarized as: good stuff is made more likely, bad stuff is mode less likely, simply formalized the notion of “trial and error”.

It is also very easy to generalize the REINFORCE algorithm into the partially observable case. For partially observable cases, we just change into in the policy part. The estimation of the policy gradient is given as follow

Variance Reduction Technqiues

What is the problem with the policy gradient? As mentioned before, the policy gradient suffers from very high variance.

Recall that the policy gradient is

The variance is vary large because is very large. Moreover, adding the reward function by a constant will change the result. For example if we add a large constant to the reward function and let all rewards to be positive, then in each iteration, the policy gradient method will try to increase the probability of all sampled trajectories.

There are 2 methods that can reduce the variance of the policy gradient method, causality and baseline, and we will introduce them one by one.


The intuition of causality is very simple: the policy at time cannot affect reward at time . Then instead of using the whole to update the at time , we can only focus on the reward that happens after time . Formally, we have the following formula

\begin{align} \nabla_{\theta} J(\theta)\approx & \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right) \newline = & \frac{1}{N}\sum_{i=1}^N\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\left(\sum_{t’=t}^T r(s_{i,t’},a_{i,t’})\right) \end{align}

The last term can be viewed as “reward to go”, and we use to denote that quantity.

It is easy to know that the causality helps to reduce the variance, because smaller number leads to smaller variance. In practice this basically always help.


The baselines technique simply means to add constants to the reward function.

We want

where is some constant. We may simply choose .

Then, we show that adding constants to the reward function does not change the policy gradient. We have \begin{align} \mathbb E\left[\nabla_{\theta}\log\pi_{\theta}(\tau)b\right] = & \int\pi_{\theta}(\tau)\nabla_{\theta}\log\pi_{\theta}(\tau)bd\tau \newline = & b\nabla_{\theta}\int \pi_{\theta}(\tau)d\tau \newline = & b\nabla_{\theta}1 = 0. \end{align}

The previous argument shows that subtracting a baseline is unbiased in expectation. The average reward is not the best baseline, but it is often good in practice. A more careful analysis will show that is the best baseline, where .

Off-policy learning and importance sampling

Policy gradient is on-policy

Changing the policy will change the probability distribution for the expectation. Neural networks change only a litle bit with rach gradient step, and on-policy learning can be extremely inefficient.

Now recall that , what if we don’t have samples from but have samples from $$\bar\pi(\tau) instead?

Importance sampling \begin{align} \mathbb E_{s\sim p(x)}[f(x)] =& \int p(x)f(x)dx \newline =& \int\frac{q(x)}{q(x)}p(x)f(x)dx \newline =& \int q(x)\frac{p(x)}{q(x)}f(x)dx \newline =& \mathbb E_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right] \end{align}

Now using the importance sampling,


Now recall that $$\pi_{\theta}(\tau) = p(s_1)\prod_{t=1}^T \pi_{\theta}(a_{t}s_{t})p(s_{t+1}s_{t},a_{t})$$, we know that

Deriving the policy gradient with importance sampling

can we estimate the value of some new parameters ?

Taking the gradient

When , it is exactly the policy gradient. If ,

where the last line uses causality.

Policy gradient with automatic differentiation

Just implement a “pseudo-loss” as a weighted maximum likelihood:

Actor-Critic Introduction

Review of policy gradient and intuition for actor-critic

: estimate of expected reward if we take action in state . Can we get a better estimate?

$$Q(s_{t},a_{t}) = \sum_{t’=t}^{T}\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})s_{t},a_{t}]$$ is the true expected reward to go.

What about the baseline? Previously, we set , and now we want to set the baseline .

Note that the Q-functions, the value functions are all related to a policy , and we have

$$Q^{\pi}(s_{t},a_{t}) = \sum_{t’=t}^T\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})s_{t},a_{t}]$$,
$$V^{\pi}(s_{t}) = \mathbb E_{a_{t}\sim \pi_{\theta}(a_{t}s_{t})}[Q^{\pi}(s_{t},a_{t})]$$,

: the advantage, how much is.


The better the estimate, the lower the variance.

Value function fitting What to fit?

$$Q^{\pi}(s_{t},a_{t}) = r(s_{t},a_{t}) + \sum_{t’=t+1}^T\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})s_{t},a_{t}]\approx r(s_{t},a_{t}) + V^{\pi}(s_{t+1})$$,

where the last step is unbiased. Then, we can write out the advantage as

Let’s fit . We use a neural network to represent with parameter . Currently, and , the parameters for the policy and the advantage are not the same (Later, combine them).

Policy evaluation

$$V^{\pi}(s_{t}) = \sum_{t’=t}^T\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})s_{t}]$$,

how can we perform policy evaluation? Monte Carlo policy evaluation (this is what policy gradient does)

Need to “reset” the simulators.

Monte Carlo evaluation with function approximation

not as good as but is still pretty good.

Training data:

Supervised regression: $$\mathcal L(\phi) = \frac{1}{2}\sum_{j} \hat V^{\pi}_{\phi}(s_j)-y_j ^2$$.

Can we do better?

Ideal target: $$y_{i,t} = \sum_{t’=t}^T \mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})s_{i,t}] \approx r(s_{i,t},a_{i,t}) + V^{\pi}(s_{i,t+1})\approx r(s_{i,t},a_{i,t})+V^{\pi}{\phi}(s{i,t+1})$$

The last step means to directly use the previous fitted value function. Smaller variance.

Monte Carlo target:

Training data:

Supervised regression: $$\mathcal L(\phi) = \frac{1}{2}\sum_{j} \hat V^{\pi}_{\phi}(s_j)-y_j ^2$$.

This is sometimes referred to as a “bootstrapped” estimate.

Actor-critic algorithm

batch actor-critic algorithm:

  1. sample from $$\pi_{\theta}(a_{t}s_{t})$$ (run the policy)
  2. fit to sampled reward sums
  3. evaluate

Here, recall that we have training data:

Supervised regression: $$\mathcal L(\phi) = \frac{1}{2}\sum_{j} \hat V^{\pi}_{\phi}(s_j)-y_j ^2$$.

Aside: discount factors

When the trajectories are likely to go to infinity or become very large, use a discout factor. We set

, discount factor (and 0.99 works well)

with critic

batch actor-critic algorithm with discount:

  1. sample from $$\pi_{\theta}(a_{t}s_{t})$$ (run the policy)
  2. fit to sampled reward sums
  3. evaluate

online actor-critic algorithm with discount:

  1. take action $$a\sim \pi_{\theta}(as)(s,a,a’,r)$$
  2. update using target
  3. evaluate

Architecture design

two network design: simple and stable, but not feature sharing

one network design: the output contains 2 networks, harder to tune the hyper-parameters

online actor-critic algorithm in practice

online actor-critic will not work for deep neural networks, in step 2, would like to update the critic by a batch of data. In practice, get the data in parallel.

Critics as state-dependent baselines.

For the actor-critic algorithm, the gradient is estimated by

The good thing is this representation has low variance (due to the critic), but it is not unbiased (if the critic is not perfect).

For the original policy gradient

This method has no bias, but has a higher variance (because of the single-sample estimate). Combine them together

There is no bias and has a lower variance.

n-step estimation

choosing often works better

Generalized advantage estimation


can choose


[1] CS 285 at UC Berkeley