policy gradient

简介

本文主要介绍了deterministic policy gradient算法。它属于policy gradient的一种,只不过是将stochastic policy改成了deterministic policy,和stochastic policy一样,deterministic policy也有相应的deterministic policy gradient theorem。
实际上deterministic policy gradient是action value function的gradient的期望,这让deterministic policy gradient比stochastic policy gradient要更efficiently。还介绍了保证足够的exploration的off-policy actor-critic算法学习determinitic target policy。

stochastic policy gradient vs deterministic policy gradiet

sotchastic policy gradient

Policy gradient的basic idea是用参数化的probability distribution $\pi_\theta(a|s) = \mathbb{P}\left[a|s,\theta\right]$表示policy,在state $s$处根据$\theta$表示的policy $\pi$随机选择action $a$。Policy gradient通过对policy 进行采样,根据stochastic policy gradient theorem近似计算$J$相对于参数$\theta$的累计梯度然后调整policy的参数$\theta$朝着使$J$更大的方向移动。

deterministic policy gradient

用$a=\mu_\theta(s)$表示deterministic policy,根据stochastic policy gradient theorm,很容易想到是否有deterministic policy gardient theorem,即朝着使得cumulative reward更大的方向更新policy的参数。deterministic policy gradient证明了deterministic policy gradient是存在的,可以使用action value function的梯度近似得到。在某些情况下,deterministic policy gradient还可以看成stochastic policy gradient的特殊情况。

spg vs dpg

spg和dpg的区别和联系

  1. spg和dpg的第一个显著区别就是积分的space是不同的。Spg中policy gradient是在action和state spaces上进行积分的,而dpg的policy gradient仅仅在state space上进行积分。因此,计算spg需要更多samples,尤其是action spaces维度很高的情况下。
  2. 为了充分explore整个state和action space,需要使用stochastic policy。而在使用deterministic policy时,为了确保能够持续的进行explore,就需要使用off-policy的算法了,behaviour policy使用stochastic policy进行采样,target policy是deterministic policy。作者使用deterministic policy gradient推导出了一个off-policy的actor-critic方法,使用可导的function approximators估计action values,然后利用这个function的梯度更新policy的参数,同时为了确保policy gradient没有bias,使用而来compatible function。

stochastic policy gradient theorem

详细的介绍可以查看policy gradient。spg的基本想法就是朝着$J$的梯度方向更新policy的参数。对$J(\pi_{\theta})$对$\theta$求导,得到:
\begin{align*}
\nabla_{\theta} J(\pi_{\theta})&=\int_S\rho^{\pi} (s) \int_A\nabla_\theta\pi_\theta (a|s)Q^{\pi} (s,a) dads \tag{1}\\
&=\mathbb{E}_{s\sim \rho^{\pi} , a\sim \pi_\theta}\left[\nabla_\theta \log\pi_\theta(a|s)Q^{\pi} (s,a)\right] \tag{2}
\end{align*}
这就是policy gradient theorem。从这个theorem中,我们得到了一个有用的信息,state distribution $\rho\pi(s)$取决于policy parameters。我们计算$J$关于$\theta$的梯度时,需要计算statedistribution,很难实现,policy gradient theorem消去了对$\rho^{\pi} $的依赖,具有很重要的实用价值。同时利用log derivative trick将performance gradient的估计变成了一个期望,可以通过sampling估计。这个方法中需要使用$Q^{\pi} (s,a)$,估计$Q$不同方法就是不同的算法。比如最简单的使用sample return $G_t$估计$Q^{\pi} (s, a)$,就是REINFORCE算法。

stochastic actor-critic 算法

Actor-critic是一个基于policy gradient theorem的结构,包含policy和value function两个部分。Actors通过stochastic gradient ascent调整stochastic policy $\pi_\theta(s)$的参数$\theta$;同时critic用一个action-value function $Q^w (s,a)$近似$Q^\pi (s,a)$, $w$是function approximation的参数。Critic一般使用policy evaluation方法进行学习,比如使用td和mc等估计action value function $Q^w (s,a)\approx Q^\pi (s,a)$。一般来说,使用$Q^w (s,a)$代替真实的$Q^\pi (s,a)$会引入bias,但是,如果function approximator是compatible,即满足以下两个条件,就会是无偏的:

  1. $Q^w (s,a) = \nabla_\theta \log\pi_\theta(a|s)^T w$
  2. 参数$w$最小化mse: $\epsilon^2 (w)=\mathbb{E}_{s\sim \rho^\pi ,a\sim \pi_\theta}\left[(Q^w (s, a)-Q^\pi (s,a))^2 \right]$,这样就没有bias了,即:
    $$\nabla_\theta J(\pi_\theta)=\mathbb{E}_{s\sim \rho^\pi, a\sim \pi_\theta}\left[\nabla_\theta \log\pi_\theta(a|s)Q^w (s,a)\right] \tag{3}$$

直观上来说,条件1说的是compatible function approximators是stochastic policy梯度$\nabla_\theta \log\pi_\theta(a|s)$的线性features,条件2需要满足$Q^w (s,a)$是$Q^\pi (s,a)$的linear regression soulution。在实际应用中,条件2会放宽,使用如TD之类policy evaluation算法更efficiently的估计value function。事实上,如果条件1和2都满足的话,整个算法等价于没有使用critic,和REINFORCE算法很像。

off-policy actor critic

在off-policy设置中,performance objective通常改成target policy的value function在behaviour policy的state distribution上进行平均,用$\beta(a|s)$表示behaviour policy:
\begin{align*}
J_\beta(\pi_\theta) &= \int_S\rho^\beta (s) V^\pi (s)ds\tag{4}\\
&=\int_S\int_A\rho^\beta (s)\pi_\theta(a|s)Q^\pi (s,a)dads \tag{5}\\
\end{align*}
对其求导和近似,得到:
\begin{align*}
\nabla_\theta J_\beta(\pi_\theta) &\approx \int_S\int_A\rho^\beta (s)\nabla_\theta\pi_\theta(a|s) Q^\pi (s,a)dads\tag{6} \\
&=\mathbb{E}_{s\sim \rho\beta, a\sim \beta}\left[\frac{\pi_\theta(a|s)}{\beta_\theta(a|s)}\nabla_\theta \log\pi_\theta(a|s) Q^\pi (s,a) \right]\tag{7}
\end{align*}
这个近似去掉了一项:$\nabla_\theta Q^\pi (s,a)$,已经有人证明去掉这一项之后仍然会收敛到局部最优。Off-policy actor-critic算法使用behaviour policy $\beta$生成trajectories,critic off-policy的从那些trajectories中估计state value function $V^v (s)\approx v^\pi (s)$。actor使用sgd从这些trajectories中off policy的更新policy paramters $\theta$,同时使用TD-error估计$Q^\pi (s,a)$。actor和critic都是用importance sampling ratio $\frac{\pi_\theta(a|s)}{\beta_\theta(a|s)}$。

action value gradients

绝大多数的model-free rl算法都属于GPI,迭代的policy evaluation和policy improvement。在contious action spaces中,greedy policy improvement是不可行的,因为在每一步都需要计算一个全局最大值。一个替代的方法是让policy朝着$Q$的gradient方向移动,而不是全局最大化$Q$。具体而言,对每一个访问到的state $s$,policy parameters $\theta^{k+1} $的更新正比于$\nabla_{\theta} Q^{ {\mu}^k } (s, \mu_{\theta}(s) )$。每一个state给出一个不同的方向,可以使用state distribution $\rho^{\mu} (s)$求期望,对最终的方向进行平均:
$$\theta^{k+1} = \theta^k + \alpha \mathbb{E}_{s\sim \rho^{ {\mu}^k } }\left[\nabla_\theta Q^{ {\mu}^k } (s, \mu_{\theta}(s))\right] \tag{8}$$
通过使用chain rule,我们可以看到policy improvement可以被分解成action-value对于action的gradient和policy相对于policy parameters的gradient:
$$\theta_{k+1} = \theta^k + \alpha \mathbb{E}_{s\sim \rho^{ {\mu}^k } }\left[\nabla_{\theta}\mu_{\theta}(s)\nabla_a Q^{ {\mu}^k } (s,a)|_{a=\mu_\theta(s_0)}\right] \tag{9}$$
按照惯例来说,$\nabla_{\theta}\mu_{\theta}(s)$是一个jacobian matrix,每一列是policy的$dth$ action对$\theta$的gradient $\nabla_\theta\left[\mu_\theta(s)\right]_d$。如果改变了policy,state distribution $\rho^{\mu} $也会改变。如果不考虑distribution的变化的话,这个方法是保证收敛的。不过幸运的是,已经有人证明了和sgd一样,有deterministic policy gradient theorem,不需要计算state distributiond的gradient即可更新参数。

deterministic policy gradient theorem

新的术语定义

  • $\rho_0(s)$:初始状态分布
  • $\rho^{\mu} (s\rightarrow s’, k)$:在policy $\mu$下从state $s$出发经过$k$步到达$s’$的概率。
  • $\rho^{\mu} (s’)$:带有折扣因子的状态分布,定义为:
    $$\rho^{\mu} (s’) = \int_S \sum_{k=0}^{\infty} \gamma^{k} \rho_0(s) \rho^{\mu} (s\rightarrow s’, k)ds\tag{10}$$

证明

deterministic policy定义为:$\mu_\theta: S\rightarrow A, \theta \in \mathbb{R}^n $。定义performance objective $J(\mu_\theta) =\mathbb{E}\left[G_0|\mu\right]$,将performance objective写成expectation如下:
\begin{align*}
J(\mu_\theta) & = \int_S\rho^\mu (s) G_0 ds\tag{11}\\
&= \mathbb{E}_{s\sim \rho^\mu }\left[G_0\right] \tag{12}
\end{align*}
给出deterministic policy gradient theorem:
假设MDP满足以下条件,即$\nabla_\theta\mu_\theta(s)$和$\nabla_a Q^\mu (s,a)$存在,那么deterministic policy gradient存在,
\begin{align*}
\nabla_\theta J(\mu_\theta) &= \int_S\rho^\mu (s) \nabla_\theta\mu_\theta(s)\nabla_aQ^\mu (s,a)|_{a=\mu_\theta(s)}ds\tag{13}\\
&=\mathbb{E}_{s\sim \rho^\mu }\left[\nabla_\theta\mu_\theta(s)\nabla_aQ^\mu (s,a)|_{a=\mu_\theta(s)}\right] \tag{14}
\end{align*}
证明:
\begin{align*}
\nabla_{\theta}V^{\mu} (s) & = \nabla_{\theta} Q^{\mu} (s, \mu(s))\tag{15}\\
& = \nabla_{\theta} \left( R(s,\mu(s)) + \int_S \gamma p(s’|s,\mu(s))V^{\mu} (s’) ds’ \right)\tag{16}\\
& = \nabla_{\theta}\mu(s) \nabla_a R(s,a)|_{a=\mu(s)} + \nabla_{\theta}\int_S \gamma p(s’|s,\mu(s))V^{\mu} (s’) ds’\tag{17}\\
& = \nabla_{\theta}\mu(s) \nabla_a R(s,a)|_{a=\mu(s)} + \int_S \gamma \left(p(s’|s,\mu(s))\nabla_{\theta} V^{\mu} (s’) + \nabla_{\theta}\mu(s)\nabla_a p(s’|s, a)|_{a=\mu(s)} V^{\mu} (s’) \right) ds’\tag{18}\\
& = \nabla_{\theta}\mu(s) \nabla_a \left( R(s,a) + \nabla_{\theta}\mu(s)\nabla_a p(s’|s, a)|_{a=\mu(s)} V^{\mu} (s’)ds’ \right) |_{a=\mu(s)} + \int_S \gamma p(s’|s,\mu(s))\nabla_{\theta} V^{\mu} (s’) ds’\tag{19}\\
& = \nabla_{\theta}\mu(s) \nabla_a Q^{\mu}(s,a) |_{a=\mu(s)} + \int_S \gamma p(s\rightarrow s’, 1, \mu(s))\nabla_{\theta} V^{\mu} (s’) ds’\tag{20}\\
& = \nabla_{\theta}\mu(s) \nabla_a Q^{\mu}(s,a) |_{a=\mu(s)}\\
&\qquad+ \int_S \gamma p(s\rightarrow s’, 1, \mu) \left(\nabla_{\theta}\mu(s’) \nabla_a Q^{\mu} (s’, a’) |_{a’ = \mu(s’)} + \int_S \gamma p(s’ \rightarrow s^{’’}, 1, \mu(s’) ) \nabla_{\theta} V^{\mu} (s’’) ds^{’’} \right) ds’ \tag{21}\\
& = \nabla_{\theta}\mu(s) \nabla_a Q^{\mu}(s,a) |_{a=\mu(s)}\\
&\qquad + \int_S \gamma p(s\rightarrow s’, 1, \mu) \nabla_{\theta}\mu(s’) \nabla_a Q^{\mu} (s’, a’) |_{a’ = \mu(s’)}ds’ \\
&\qquad + \int_S \gamma p(s\rightarrow s’, 1, \mu) \int_S \gamma p(s’ \rightarrow s^{’’}, 1, \mu(s’) )\nabla_{\theta} V^{\mu} (s’)ds^{’’} ds’ \tag{22}\\
& = \nabla_{\theta}\mu(s) \nabla_a Q^{\mu}(s,a) |_{a=\mu(s)}\\
&\qquad + \int_S \gamma p(s\rightarrow s’, 1, \mu) \nabla_{\theta}\mu(s’) \nabla_a Q^{\mu} (s’, a’) |_{a’ = \mu(s’)} ds’\\
&\qquad + \int_S \gamma^2 p(s\rightarrow s’, 2, \mu)\nabla_{\theta} V^{\mu} (s’) ds’ \tag{23}\\
& = \int_S \sum_{t=0}^{\infty} \gamma^t p(s\rightarrow s’, t, \mu)\nabla_{\theta} \mu(s’)\nabla_{a} Q^{\mu} (s’, a)|_{a=\mu(s’)} ds’ \tag{24}\\
\end{align*}
所以:
\begin{align*}
\nabla_{\theta} J(\mu) & = \nabla_{\theta} \int_S \rho_0(s) V^{\mu} (s) ds \tag{25}\\
& = \int_S \rho_0(s) \nabla_{\theta} V^{\mu} (s) ds \tag{26}\\
& = \int_S\int_S \sum_{t=0}^{\infty} \gamma^t\rho_0(s) p(s\rightarrow s’, t, \mu)\nabla_{\theta} \mu(s’)\nabla_{a} Q^{\mu} (s’, a)|_{a=\mu(s’)} ds’ ds\tag{27}\\
& = \int_S\rho^{\mu} (s) \nabla_{\theta} \mu(s)\nabla_{a} Q^{\mu} (s, a)|_{a=\mu(s)} ds\tag{28}\\
& = \mathbb{E}_{s\sim \rho^\mu}\left[ \nabla_{\theta}\mu(s)\nabla_{a} Q^{\mu} (s, a)|_{a=\mu(s)} \right]\tag{30}\\
\end{align*}
公式$(25)$到公式$(26)$是因为$\rho_0(s_0)$和$\mu$无关。。

spg的limit

dpg实际上可以看成是spg的一个特殊情况。如果使用deterministic policy $\mu_\theta:S\rightarrow A$和variance parameter $\sigma$表示某些stochastic policy $\pi_{\mu_{\theta,\sigma}}$,比如$\sigma = 0$时,$\pi_{\mu_{\theta, 0}} \equiv \mu_\theta$,当$\sigma \rightarrow 0$时,stochastic policy gradient收敛于deterministic policy gradient。
考虑一个stochastic policy $\pi_{\mu_{\theta,\sigma}}$让$\pi_{\mu_{\theta,\sigma}}(s,a)=v_\sigma(\mu_\theta(s),a)$,其中$\sigma$是控制方差的参数,并且$v_\sigma$满足条件B.1,以及MDP满足条件A.1和A.2,那么
$$\lim_{\sigma\rightarrow 0}\nabla_\theta J(\pi_{\mu_{\theta, \sigma}}) = \nabla_\theta J(\mu_\theta) \tag{31} $$
其中左边的gradient是标准spg的gradient,右边是dpg的gradient。这就说明spg的很多方法同样也是适用于dpg的。

deterministic actor-critic

接下来使用dpg theorem推导on-policy和off-policy的actor-critic算法。从最简单的on-policy update开始,使用Sarsa critic,然后介绍off-policy算法,使用Q-learning critic介绍核心思想。这些算法在实践中可能会有收敛问题,因为function approximator引入了biases,off-policy引入了不稳定性,可以使用compatiable function approximation的方法以及gradient td learning解决这些问题。

on-policy deterministic actor-critic

使用deterministic behaviour policy不能确保足够的exploration,会产生suboptimal solutions。但是我们还是要首先介绍on-policy actor-critic算法,它对于我们理解算法背后的核心思路很有用。尤其是在环境有噪音,即使使用deterministic behaviour policy也能够保证充足的exploration时。
和stochastic actor-critic一样。deterministic actor-critic也由actor和critic两部分组成,actor使用梯度下降调整$\mu(s)$的参数$\theta$,critic使用policy evaluation方法估计$Q^w (s,a) \approx Q^{\mu} (s, a)$指导actor的更新。比如critic使用Sarsa估计action value:
\begin{align*}
\delta_t & = R_{t+1} + \gamma Q^w (s_{t+1}, a_{t+1}) - Q^w(s_t, a_t) \tag{32}\\
w_{t+1} & = w_t + \alpha_w \delta_t \nabla_w Q^w (s_t,a_t)\tag{33}\\
\theta_{t+1} & = \theta_t + \alpha_{\theta} \nabla_{\theta} \mu_{\theta}(s_t) \nabla_a Q^w(s_t, a_t)|_{a=\mu(s)} \tag{34}\\
\end{align*}

off-policy deterministic actor-critic

考虑off-policy actor-critic方法。目标函数是使用behaviour policy对target policy的value function求期望:
\begin{align*}
J_{\beta}(\mu) & = \int_S \rho^{\beta}(s) V^{\mu} (s) ds\\
& = \int_S \rho^{\beta} (s) Q^{\mu} (s, \mu(s,a) ) ds \tag{35}\\
\end{align*}
求梯度是:
\begin{align*}
\nabla_{\theta}J_{\beta}(\theta) & \approx \int_S\rho^\beta (s) \nabla_{\theta}\mu(s)\nabla_{a} Q^{\mu} (s, a) ds \\
& = \mathbb{E}_{s\sim \rho^\beta }\left[ \nabla_{\theta}\mu(s)\nabla_{a} Q^{\mu} (s, a)|_{a=\mu(s)} \right]\tag{36}\\
\end{align*}
critic使用Q-learning更新action value function:
\begin{align*}
\delta_t & = R_{t+1} + \gamma Q^w (s_{t+1}, \mu_{\theta}(s_{t+1})) - Q^w(s_t, a_t) \tag{37}\\
w_{t+1} & = w_t + \alpha_w \delta_t \nabla_w Q^w (s_t,a_t)\tag{38}\\
\theta_{t+1} & = \theta_t + \alpha_{\theta} \nabla_{\theta} \mu_{\theta}(s_t) \nabla_a Q^w(s_t, a_t)|_{a=\mu(s)} \tag{39}\\
\end{align*}
在stochastic off-policy actor-critic中,actor和critic都使用了importance sampling,而在deterministic off policy actor-critic中,deterministic去掉了对actions的积分,所以避免了actor的importance sampling,使用one step Q-learning去掉了critic中的importance sampling。

总结

Deterministic policy gradient比stochastic policy gradient要好,尤其是high dimensional tasks上。此外,dpg不需要消耗更多的计算资源。还有就是对于一些应用,有可导的policy,但是没有办法加noise,这种情况下dpg更合适。
目标函数:

stochastic policy gradient theorem:

具体证明可以见policy gradient
\begin{align*}
J(\pi_\theta) & = \int_S\rho^{\pi} (s)\int_A\pi(s,a) R(s,a) da ds \tag{40}\\
& = \mathbb{E}_{s\sim \rho^{\pi} , a\sim \pi_{\theta}} \left [R(s,a) \right]\tag{41}\\
\end{align*}

\begin{align*}
\nabla_{\theta}J(\pi_\theta) & = \int_S\rho^{\pi} (s)\int_A \nabla\pi(s,a) Q^{\pi} (s,a) da ds\tag{42}\\
& = \mathbb{E}_{s\sim \rho^{\pi} , a\sim \pi_{\theta}} \left [\log \nabla_{\theta} \pi(a|s)Q^{\pi} (s,a) \right]\tag{43}\\
\end{align*}

stochastic off-policy actor-crtic

具体证明可以参考Linear off-policy actor-critic
\begin{align*}
J(\pi_\theta) & = \int_S\rho^{\beta}(s) V^{\pi} (s) ds\tag{44}\\
& = \int_S \int_A\rho^{\beta}(s) \pi(a|s) Q^{\pi} (s, a) ds\tag{45}\\
\end{align*}

\begin{align*}
\nabla_{\theta}J(\pi_\theta) & \approx \int_S\int_A\rho^{\pi} (s) \nabla_{\theta} \pi_{\theta}(s,a) Q^{\pi} (s,a) da ds\tag{46}\\
& = \mathbb{E}_{s\sim \rho^{\beta}, a\sim \beta} \left[ \frac{\pi_{\theta}(a|s) }{\beta_{\theta}(a|s) } \nabla_{\theta}\log \pi_{\theta}(s,a) Q^{\pi} (s,a) \right]\tag{47}\\
\end{align*}

deterministic policy gradient theorem:

具体证明见本文上半部分。但是我还有一些疑问,在stochastic policy gradient theorem中,$p(s’, r|s, a)$是$a$的函数,$a = \pi(s)$,所以$p(s’,r|s, a)$不应该也是$\theta$的函数吗?
\begin{align*}
J(\mu_{\theta}) & = \int_S\rho^{\mu} (s) R(s, \mu_{\theta}(s)) da ds\tag{48}\\
& = \mathbb{E}_{s\sim \rho^{\mu} } \left[R(s, \mu_{\theta}(s) \right]\tag{49}\\
\end{align*}

\begin{align*}
\nabla_{\theta} J(\mu_\theta) & = \int_S\rho^{\mu} (s)\nabla_{\theta}\mu_{\theta}(s) \nabla_a Q^{\mu} (s, a)|_{\mu_{\theta}(s)} da ds\tag{50}\\
& = \mathbb{E}_{s\sim \rho^{\mu} (s)} \left[ \nabla_{\theta}\mu_{\theta}(s) \nabla_a Q^{\mu} (s, a)|_{\mu_{\theta}(s)} \right]\tag{51}\\
\end{align*}

deterministic off-policy actor-critic

具体证明可以参考Linear off-policy actor-critic
\begin{align*}
J_{\beta}(\mu_\theta) & = \int_S\rho^{\beta} (s) V^{\mu} (s) ds\tag{52}\\
& = \int_S\rho^{\beta} (s) Q^{\mu} (s, \mu_{\theta}(s)) ds\tag{53}\\
\end{align*}
\begin{align*}
\nabla_{\theta} J_{\beta} (\mu_\theta) & \approx \int_S\rho^{\beta} (s)\nabla_{\theta}\mu_{\theta}(a|s) \nabla_a Q^{\mu} (s, a) ds\tag{54}\\
& = \mathbb{E}_{s\sim \rho^{\beta}}\left[\nabla_{\theta}\mu_{\theta}(a|s) \nabla_a Q^{\mu} (s, a) \right]\tag{55}\\
\end{align*}

参考文献

Policy Gradient
1.https://arxiv.org/pdf/1509.02971.pdf
2.http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/deterministic-policy-gradients.pdf