在强化学习中的值函数近似算法 文章中有说怎么用参数方程去近似state value
,那policy
能不能被parametrize
呢?其实policy
可以被看成是从state
到action
的一个映射
a
←
π
(
s
)
a \leftarrow \pi(s)
a ← π ( s ) ,
Parametric Policy
我们可以参数化一个策略
π
θ
(
a
∣
s
)
\pi_{\theta}(a|s)
π θ ( a ∣ s ) ,它也可以变成一个确定性的策略,数学表示为
a
=
π
θ
(
s
)
a = \pi_{\theta}(s)
a = π θ ( s ) ,或者表示为随机的概率形式
π
θ
(
a
∣
s
)
=
P
(
a
∣
s
;
θ
)
\pi_{\theta}(a|s) = P(a|s;\theta)
π θ ( a ∣ s ) = P ( a ∣ s ; θ ) ,在stochastic policy
里面,参数化的policy
输出的就是action
的概率分布。其中
θ
\theta
θ 表示为policy
的参数,用参数化近似policy
可以使得整个模型具备更强的泛化能力(Generalize from seen states to unseen states )。
Policy-based RL
Policy-based RL
相对于Value-based RL
会有比较好的性质,比如:
Advantages
Better convergence properties (虽然policy
每次都改进一点点,但是它总是朝着好的方向进行改进,而值函数的方法是会有可能围绕最优价值函数持续小的震荡而不收敛。)
对于value function
的方法, 我们会去取max操作 (值函数需要取到下个状态
s
t
+
1
s_{t+1}
s t + 1 下选取哪个动作能够使得值函数最大),基于policy
的方法的的效率在continuous action space
上会比较高。
Can learn stochastic polices (值函数的方法中都是取max
或者贪婪策略,而policy
的方法就可以采用分布的思想)。
Disadvantages
Typically converge to a local rather than global optimum. (你能得到linear model
上面的全局最优,但是得不到像神经网络这种空间上的全局最优,但往往复杂模型上的local optimal
也比linear model
上的global optimal
要好。)
Evaluating a policy is typically inefficient and of high variance. (由于算法存在sample
的操作,和对下一个值函数的估计,因此方差也会比较高。)
stochastic policy
For stochastic policy
π
θ
(
a
)
∣
s
=
P
(
a
∣
s
;
θ
)
\pi_{\theta}(a)|s = P(a|s;\theta)
π θ ( a ) ∣ s = P ( a ∣ s ; θ )
Intuition
lower the probability of the action that leads to low value/reward
higher the probability of the action that leads to high value/reward
上述的过程就是:如果一个action
能够获得更多的奖励,那么这个action
会被加强,否者将会被削弱。这也是行为主义的思想。
Policy Gradient in One-Step MDPs
考虑这样一个环境:One-Step MDPs,在状态
s
∼
d
(
s
)
s \sim d(s)
s ∼ d ( s ) ,Terminating after one time-step with reward
r
s
a
r_{sa}
r s a 。
此时Policy expected value
可表达为如下形式:
J
(
θ
)
=
E
π
θ
[
r
]
=
∑
s
∈
S
d
(
s
)
∑
a
∈
A
π
θ
(
a
∣
s
)
r
s
a
J(\theta) = \mathbb{E_{\pi_{\theta}}}[r] = \sum_{s \in S}d(s)\sum_{a \in A}\pi_{\theta}(a|s)r_{sa}
J ( θ ) = E π θ [ r ] = s ∈ S ∑ d ( s ) a ∈ A ∑ π θ ( a ∣ s ) r s a
如果需要对参数
θ
\theta
θ 求偏导数的话,可以表达为如下形式:
∂
J
(
θ
)
∂
θ
=
∑
a
∈
S
d
(
s
)
∑
a
∈
A
∂
π
θ
(
a
∣
s
)
∂
θ
r
s
a
\frac{\partial J(\theta)} {\partial \theta} = \sum_{a \in S} d(s) \sum_{a \in A} \frac{\partial \pi_{\theta}(a|s)}{\partial \theta} r_{sa}
∂ θ ∂ J ( θ ) = a ∈ S ∑ d ( s ) a ∈ A ∑ ∂ θ ∂ π θ ( a ∣ s ) r s a
Likelihood Ratio
那
π
θ
\pi_{\theta}
π θ 是一个distribution,我们怎么来对一个distribution来求导呢?
这种方法在数学上是一种 tick,叫 Likelihood Ratio:
Likelihood ratios exploit the following identity
我们首先用一个完全衡等的数学公式去表达它:
∂
π
θ
(
a
∣
s
)
∂
θ
=
π
θ
(
a
∣
s
)
1
π
θ
(
a
∣
s
)
∂
π
θ
(
a
∣
s
)
∂
θ
=
π
θ
(
a
∣
s
)
∂
log
π
θ
(
a
∣
s
)
∂
θ
\begin{aligned} \frac{\partial \pi_{\theta}(a | s)}{\partial \theta} &=\pi_{\theta}(a | s) \frac{1}{\pi_{\theta}(a | s)} \frac{\partial \pi_{\theta}(a | s)}{\partial \theta} \\ &=\pi_{\theta}(a | s) \frac{\partial \log \pi_{\theta}(a | s)}{\partial \theta} \end{aligned}
∂ θ ∂ π θ ( a ∣ s ) = π θ ( a ∣ s ) π θ ( a ∣ s ) 1 ∂ θ ∂ π θ ( a ∣ s ) = π θ ( a ∣ s ) ∂ θ ∂ log π θ ( a ∣ s )
Thus the policy’s expected value
上述过程还是One-Step MDPs
的过程。
Policy Gradient Theorem
但强化学习还是会存在非常多步的MDP的情况。并且如果我们将及时奖励
r
s
a
r_{sa}
r s a 换成 value function上述过程仍然会成立。
∂
J
(
θ
)
∂
θ
=
E
π
θ
[
∂
log
π
θ
(
a
∣
s
)
∂
θ
Q
π
θ
(
s
,
a
)
]
\frac{\partial J(\theta)}{\partial \theta} = \mathbb{E}\pi_{\theta}[\frac{\partial \text{log} \pi_{\theta}(a|s)}{\partial \theta}Q^{\pi_{\theta}}(s,a)]
∂ θ ∂ J ( θ ) = E π θ [ ∂ θ ∂ log π θ ( a ∣ s ) Q π θ ( s , a ) ]
上述定理说的就是策略
π
\pi
π 里面含有参数
θ
\theta
θ ,而
Q
π
θ
(
s
,
a
)
Q^{\pi_{\theta}}(s,a)
Q π θ ( s , a ) 也与参数
θ
\theta
θ 有关,那为什么不对
Q
π
θ
(
s
,
a
)
Q^{\pi_{\theta}}(s,a)
Q π θ ( s , a ) 也求导呢?定理说的就是对任何的policy objective function
J
=
J
1
,
J
a
v
R
,
J
a
v
V
J=J_{1}, J_{avR},J_{avV}
J = J 1 , J a v R , J a v V 都会有上述等式成立。
我们先定义baseline
J
(
π
)
J(\pi)
J ( π ) ,follow 当前的策略
π
\pi
π ,与环境进行互动,所得到的奖励的平均定义为
J
a
v
R
(
π
)
J_{avR}(\pi)
J a v R ( π ) :
J
(
π
)
=
lim
n
→
∞
1
n
E
[
r
1
+
r
2
+
⋯
+
r
n
∣
π
]
=
∑
s
d
π
(
s
)
∑
a
π
(
a
∣
s
)
r
(
s
,
a
)
J(\pi)=\lim _{n \rightarrow \infty} \frac{1}{n} \mathbb{E}\left[r_{1}+r_{2}+\cdots+r_{n} | \pi\right]=\sum_{s} d^{\pi}(s) \sum_{a} \pi(a | s) r(s, a)
J ( π ) = n → ∞ lim n 1 E [ r 1 + r 2 + ⋯ + r n ∣ π ] = s ∑ d π ( s ) a ∑ π ( a ∣ s ) r ( s , a )
就是每一个 episode 里面的每一个 time step 所得到的奖励的平均值的期望。
d
π
(
s
)
d^{\pi}(s)
d π ( s ) 表示的是在当前策略下
s
s
s 被采样得到的概率。
state action value 可表示为如下形式:
Q
π
(
s
,
a
)
=
∑
t
=
1
∞
E
[
r
t
−
J
(
π
)
∣
s
0
=
s
,
a
0
=
a
,
π
]
Q^{\pi}(s, a)=\sum_{t=1}^{\infty} \mathbb{E}\left[r_{t}-J(\pi) | s_{0}=s, a_{0}=a, \pi\right]
Q π ( s , a ) = t = 1 ∑ ∞ E [ r t − J ( π ) ∣ s 0 = s , a 0 = a , π ]
可以推导出:
∂
V
π
(
s
)
∂
θ
=
def
∂
∂
θ
∑
a
π
(
a
∣
s
)
Q
π
(
s
,
a
)
,
∀
s
=
∑
a
[
∂
π
(
a
∣
s
)
∂
θ
Q
π
(
s
,
a
)
+
π
(
a
∣
s
)
∂
∂
θ
Q
π
(
s
,
a
)
]
=
∑
a
[
∂
π
(
a
∣
s
)
∂
θ
Q
π
(
s
,
a
)
+
π
(
a
∣
s
)
∂
∂
θ
(
r
(
s
,
a
)
−
J
(
π
)
+
∑
s
′
P
s
s
′
a
V
π
(
s
′
)
)
]
=
∑
a
[
∂
π
(
a
∣
s
)
∂
θ
Q
π
(
s
,
a
)
+
π
(
a
∣
s
)
(
−
∂
J
(
π
)
∂
θ
+
∂
∂
θ
∑
s
′
P
s
s
′
a
V
π
(
s
′
)
)
]
⇒
∂
J
(
π
)
∂
θ
=
∑
a
[
∂
π
(
a
∣
s
)
∂
θ
Q
π
(
s
,
a
)
+
π
(
a
∣
s
)
∑
s
′
P
s
s
′
a
∂
V
π
(
s
′
)
∂
θ
]
−
∂
V
π
(
s
)
∂
θ
\begin{aligned} \frac{\partial V^{\pi}(s)}{\partial \theta} & \stackrel{\text { def }}{=} \frac{\partial}{\partial \theta} \sum_{a} \pi(a | s) Q^{\pi}(s, a), \quad \forall s \\ &=\sum_{a}\left[\frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\pi(a | s) \frac{\partial}{\partial \theta} Q^{\pi}(s, a)\right] \\ &=\sum_{a}\left[\frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\pi(a | s) \frac{\partial}{\partial \theta}\left(r(s, a)-J(\pi)+\sum_{s^{\prime}} P_{s s^{\prime}}^{a} V^{\pi}\left(s^{\prime}\right)\right)\right] \\ &=\sum_{a}\left[\frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\pi(a | s)\left(-\frac{\partial J(\pi)}{\partial \theta}+\frac{\partial}{\partial \theta} \sum_{s^{\prime}} P_{s s^{\prime}}^{a} V^{\pi}\left(s^{\prime}\right)\right)\right] \\ \Rightarrow \frac{\partial J(\pi)}{\partial \theta} &=\sum_{a}\left[\frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\pi(a | s) \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}\right]-\frac{\partial V^{\pi}(s)}{\partial \theta} \end{aligned}
∂ θ ∂ V π ( s ) ⇒ ∂ θ ∂ J ( π ) = def ∂ θ ∂ a ∑ π ( a ∣ s ) Q π ( s , a ) , ∀ s = a ∑ [ ∂ θ ∂ π ( a ∣ s ) Q π ( s , a ) + π ( a ∣ s ) ∂ θ ∂ Q π ( s , a ) ] = a ∑ [ ∂ θ ∂ π ( a ∣ s ) Q π ( s , a ) + π ( a ∣ s ) ∂ θ ∂ ( r ( s , a ) − J ( π ) + s ′ ∑ P s s ′ a V π ( s ′ ) ) ] = a ∑ [ ∂ θ ∂ π ( a ∣ s ) Q π ( s , a ) + π ( a ∣ s ) ( − ∂ θ ∂ J ( π ) + ∂ θ ∂ s ′ ∑ P s s ′ a V π ( s ′ ) ) ] = a ∑ [ ∂ θ ∂ π ( a ∣ s ) Q π ( s , a ) + π ( a ∣ s ) s ′ ∑ P s s ′ a ∂ θ ∂ V π ( s ′ ) ] − ∂ θ ∂ V π ( s )
也就是得到:
∂
J
(
π
)
∂
θ
=
∑
a
[
∂
π
(
a
∣
s
)
∂
θ
Q
π
(
s
,
a
)
+
π
(
a
∣
s
)
∑
s
′
P
s
s
′
a
∂
V
π
(
s
′
)
∂
θ
]
−
∂
V
π
(
s
)
∂
θ
\frac{\partial J(\pi)}{\partial \theta} =\sum_{a}\left[\frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\pi(a | s) \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}\right]-\frac{\partial V^{\pi}(s)}{\partial \theta}
∂ θ ∂ J ( π ) = a ∑ [ ∂ θ ∂ π ( a ∣ s ) Q π ( s , a ) + π ( a ∣ s ) s ′ ∑ P s s ′ a ∂ θ ∂ V π ( s ′ ) ] − ∂ θ ∂ V π ( s )
之后我们需要运用一个简单的变换,
我们可以对
∂
J
(
π
)
∂
θ
\frac{\partial J(\pi)}{\partial \theta}
∂ θ ∂ J ( π ) 做一个状态
s
s
s 的求和,因为
∂
J
(
π
)
∂
θ
\frac{\partial J(\pi)}{\partial \theta}
∂ θ ∂ J ( π ) 已经加和了所有的
s
s
s 和
a
a
a 。所以
∂
J
(
π
)
∂
θ
=
∑
s
d
π
(
s
)
∂
J
(
π
)
∂
θ
\frac{\partial J(\pi)}{\partial \theta} = \sum_{s}d^{\pi}(s)\frac{\partial J(\pi)}{\partial \theta}
∂ θ ∂ J ( π ) = ∑ s d π ( s ) ∂ θ ∂ J ( π ) ,由此可以得到:
∑
s
d
π
(
s
)
∂
J
(
π
)
∂
θ
=
∑
s
d
π
(
s
)
∑
a
∂
π
(
a
∣
s
)
∂
θ
Q
π
(
s
,
a
)
+
∑
s
d
π
(
s
)
∑
a
π
(
a
∣
s
)
∑
s
′
P
s
s
′
a
∂
V
π
(
s
′
)
∂
θ
−
∑
s
d
π
(
s
)
∂
V
π
(
s
)
∂
θ
\sum_{s} d^{\pi}(s) \frac{\partial J(\pi)}{\partial \theta}=\sum_{s} d^{\pi}(s) \sum_{a} \frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\sum_{s} d^{\pi}(s) \sum_{a} \pi(a | s) \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}-\sum_{s} d^{\pi}(s) \frac{\partial V^{\pi}(s)}{\partial \theta}
s ∑ d π ( s ) ∂ θ ∂ J ( π ) = s ∑ d π ( s ) a ∑ ∂ θ ∂ π ( a ∣ s ) Q π ( s , a ) + s ∑ d π ( s ) a ∑ π ( a ∣ s ) s ′ ∑ P s s ′ a ∂ θ ∂ V π ( s ′ ) − s ∑ d π ( s ) ∂ θ ∂ V π ( s )
而后面
∑
s
d
π
(
s
)
∑
a
π
(
a
∣
s
)
∑
s
′
P
s
s
′
a
∂
V
π
(
s
′
)
∂
θ
−
∑
s
d
π
(
s
)
∂
V
π
(
s
)
∂
θ
\sum_{s} d^{\pi}(s) \sum_{a} \pi(a | s) \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}-\sum_{s} d^{\pi}(s) \frac{\partial V^{\pi}(s)}{\partial \theta}
∑ s d π ( s ) ∑ a π ( a ∣ s ) ∑ s ′ P s s ′ a ∂ θ ∂ V π ( s ′ ) − ∑ s d π ( s ) ∂ θ ∂ V π ( s ) 这一项其实是等于0的,其证明如下:
∑
s
d
π
(
s
)
∑
a
π
(
a
∣
s
)
∑
s
′
P
s
s
′
a
∂
V
π
(
s
′
)
∂
θ
=
∑
s
∑
a
∑
s
′
d
π
(
s
)
π
(
a
∣
s
)
P
s
s
′
a
∂
V
π
(
s
′
)
∂
θ
=
∑
s
∑
s
′
d
π
(
s
)
(
∑
a
π
(
a
∣
s
)
P
s
s
′
a
)
∂
V
π
(
s
′
)
∂
θ
=
∑
s
∑
s
′
d
π
(
s
)
P
s
s
′
∂
V
π
(
s
′
)
∂
θ
=
∑
s
′
(
∑
s
d
π
(
s
)
P
s
s
′
)
∂
V
π
(
s
′
)
∂
θ
=
∑
s
′
d
π
(
s
′
)
∂
V
π
(
s
′
)
∂
θ
\begin{aligned} &\sum_{s} d^{\pi}(s) \sum_{a} \pi(a | s) \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}=\sum_{s} \sum_{a} \sum_{s^{\prime}} d^{\pi}(s) \pi(a | s) P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}\\ &=\sum_{s} \sum_{s^{\prime}} d^{\pi}(s)\left(\sum_{a} \pi(a | s) P_{s s^{\prime}}^{a}\right) \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}=\sum_{s} \sum_{s^{\prime}} d^{\pi}(s) P_{s s^{\prime}} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}\\ &=\sum_{s^{\prime}}\left(\sum_{s} d^{\pi}(s) P_{s s^{\prime}}\right) \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}=\sum_{s^{\prime}} d^{\pi}\left(s^{\prime}\right) \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta} \end{aligned}
s ∑ d π ( s ) a ∑ π ( a ∣ s ) s ′ ∑ P s s ′ a ∂ θ ∂ V π ( s ′ ) = s ∑ a ∑ s ′ ∑ d π ( s ) π ( a ∣ s ) P s s ′ a ∂ θ ∂ V π ( s ′ ) = s ∑ s ′ ∑ d π ( s ) ( a ∑ π ( a ∣ s ) P s s ′ a ) ∂ θ ∂ V π ( s ′ ) = s ∑ s ′ ∑ d π ( s ) P s s ′ ∂ θ ∂ V π ( s ′ ) = s ′ ∑ ( s ∑ d π ( s ) P s s ′ ) ∂ θ ∂ V π ( s ′ ) = s ′ ∑ d π ( s ′ ) ∂ θ ∂ V π ( s ′ )
所以:
⇒
∑
s
d
π
(
s
)
∂
J
(
π
)
∂
θ
=
∑
s
d
π
(
s
)
∑
a
∂
π
(
a
∣
s
)
∂
θ
Q
π
(
s
,
a
)
+
∑
s
′
d
π
(
s
′
)
∂
V
π
(
s
′
)
∂
θ
−
∑
s
d
π
(
s
)
∂
V
π
(
s
)
∂
θ
⇒
∂
J
(
π
)
∂
θ
=
∑
s
d
π
(
s
)
∑
a
∂
π
(
a
∣
s
)
∂
θ
Q
π
(
s
,
a
)
\begin{aligned} &\Rightarrow \sum_{s} d^{\pi}(s) \frac{\partial J(\pi)}{\partial \theta}=\sum_{s} d^{\pi}(s) \sum_{a} \frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\sum_{s^{\prime}} d^{\pi}\left(s^{\prime}\right) \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}-\sum_{s} d^{\pi}(s) \frac{\partial V^{\pi}(s)}{\partial \theta}\\ &\Rightarrow \frac{\partial J(\pi)}{\partial \theta}=\sum_{s} d^{\pi}(s) \sum_{a} \frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a) \end{aligned}
⇒ s ∑ d π ( s ) ∂ θ ∂ J ( π ) = s ∑ d π ( s ) a ∑ ∂ θ ∂ π ( a ∣ s ) Q π ( s , a ) + s ′ ∑ d π ( s ′ ) ∂ θ ∂ V π ( s ′ ) − s ∑ d π ( s ) ∂ θ ∂ V π ( s ) ⇒ ∂ θ ∂ J ( π ) = s ∑ d π ( s ) a ∑ ∂ θ ∂ π ( a ∣ s ) Q π ( s , a )
Monte-Carlo Policy Gradient (REINFORCE)
那这个
Q
Q
Q value function 怎么计算?最简单的方式就是蒙特卡洛采样:
Using return
G
t
G_{t}
G t as an unbiased sample of
Q
π
θ
(
s
,
a
)
Q^{\pi_{\theta}}(s,a)
Q π θ ( s , a ) :
Δ
θ
t
=
α
∂
log
π
θ
(
a
t
∣
s
t
)
∂
θ
G
t
\Delta \theta_{t} = \alpha \frac{\partial \text{log} \pi_{\theta}(a_{t}|s_{t})}{\partial \theta} G_{t}
Δ θ t = α ∂ θ ∂ log π θ ( a t ∣ s t ) G t
如果用另外一个模型去approximate
Q
π
θ
(
s
,
a
)
Q^{\pi_{\theta}}(s,a)
Q π θ ( s , a ) ,那就叫做Actor-Critic
算法。
Softmax Stochastic Policy
那
∂
log
π
θ
(
a
∣
s
)
∂
θ
\frac{\partial \text{log}\pi_{\theta}(a|s)}{\partial \theta}
∂ θ ∂ log π θ ( a ∣ s ) 怎么求呢?
比如我们用Softmax
去构建policy
:
Softmax policy is a very commonly used stochastic policy
π
θ
(
a
∣
s
)
=
e
f
θ
(
s
,
a
)
∑
a
′
e
f
θ
(
s
,
a
′
)
\pi_{\theta}(a | s)=\frac{e^{f_{\theta}(s, a)}}{\sum_{a^{\prime}} e^{f_{\theta}\left(s, a^{\prime}\right)}}
π θ ( a ∣ s ) = ∑ a ′ e f θ ( s , a ′ ) e f θ ( s , a )
其中
f
θ
(
s
,
a
)
f_{\theta}(s,a)
f θ ( s , a ) 是 state-action pair
的score function
,parametrized by
θ
\theta
θ , which can be defined with domain knowledge。
The gradient of its log-likelihood:
∂
log
π
θ
(
a
∣
s
)
∂
θ
=
∂
f
θ
(
s
,
a
)
∂
θ
−
1
∑
a
′
e
f
θ
,
a
′
)
∑
a
′
′
e
f
θ
(
s
,
a
′
′
)
∂
f
θ
(
s
,
a
′
′
)
∂
θ
=
∂
f
θ
(
s
,
a
)
∂
θ
−
E
a
′
∼
π
θ
(
a
′
∣
s
)
[
∂
f
θ
(
s
,
a
′
)
∂
θ
]
\begin{aligned} \frac{\partial \log \pi_{\theta}(a | s)}{\partial \theta} &=\frac{\partial f_{\theta}(s, a)}{\partial \theta}-\frac{1}{\left.\sum_{a^{\prime}} e^{f_{\theta}, a^{\prime}}\right)} \sum_{a^{\prime \prime}} e^{f_{\theta}\left(s, a^{\prime \prime}\right)} \frac{\partial f_{\theta}\left(s, a^{\prime \prime}\right)}{\partial \theta} \\ &=\frac{\partial f_{\theta}(s, a)}{\partial \theta}-\mathbb{E}_{a^{\prime} \sim \pi_{\theta}\left(a^{\prime} | s\right)}\left[\frac{\partial f_{\theta}\left(s, a^{\prime}\right)}{\partial \theta}\right] \end{aligned}
∂ θ ∂ log π θ ( a ∣ s ) = ∂ θ ∂ f θ ( s , a ) − ∑ a ′ e f θ , a ′ ) 1 a ′ ′ ∑ e f θ ( s , a ′ ′ ) ∂ θ ∂ f θ ( s , a ′ ′ ) = ∂ θ ∂ f θ ( s , a ) − E a ′ ∼ π θ ( a ′ ∣ s ) [ ∂ θ ∂ f θ ( s , a ′ ) ]
For example, we define the linear score function
f
θ
(
s
,
a
)
=
θ
⊤
x
(
s
,
a
)
\begin{aligned} f_{\theta}(s, a) &=\theta^{\top} x(s, a) \end{aligned}
f θ ( s , a ) = θ ⊤ x ( s , a )
∂
log
π
θ
(
a
∣
s
)
∂
θ
=
∂
f
θ
(
s
,
a
)
∂
θ
−
E
a
′
∼
π
θ
(
a
′
∣
s
)
[
∂
f
θ
(
s
,
a
′
)
∂
θ
]
=
x
(
s
,
a
)
−
E
a
′
∼
π
θ
(
a
′
∣
s
)
[
x
(
s
,
a
′
)
]
\begin{aligned} \frac{\partial \log \pi_{\theta}(a | s)}{\partial \theta} &=\frac{\partial f_{\theta}(s, a)}{\partial \theta}-\mathbb{E}_{a^{\prime} \sim \pi_{\theta}\left(a^{\prime} | s\right)}\left[\frac{\partial f_{\theta}\left(s, a^{\prime}\right)}{\partial \theta}\right] \\ &=x(s, a)-\mathbb{E}_{a^{\prime} \sim \pi_{\theta}\left(a^{\prime} | s\right)}\left[x\left(s, a^{\prime}\right)\right] \end{aligned}
∂ θ ∂ log π θ ( a ∣ s ) = ∂ θ ∂ f θ ( s , a ) − E a ′ ∼ π θ ( a ′ ∣ s ) [ ∂ θ ∂ f θ ( s , a ′ ) ] = x ( s , a ) − E a ′ ∼ π θ ( a ′ ∣ s ) [ x ( s , a ′ ) ]
APPENDIX
Policy gradient theorem: Start Value Setting
Start state value objective
J
(
π
)
=
E
[
∑
t
=
1
∞
γ
t
−
1
r
t
∣
s
0
,
π
]
Q
π
(
s
,
a
)
=
E
[
∑
k
=
1
∞
γ
k
−
1
r
t
+
k
∣
s
t
=
s
,
a
t
=
a
,
π
]
\begin{aligned} J(\pi) &=\mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_{t} | s_{0}, \pi\right] \\ Q^{\pi}(s, a) &=\mathbb{E}\left[\sum_{k=1}^{\infty} \gamma^{k-1} r_{t+k} | s_{t}=s, a_{t}=a, \pi\right] \\ \end{aligned}
J ( π ) Q π ( s , a ) = E [ t = 1 ∑ ∞ γ t − 1 r t ∣ s 0 , π ] = E [ k = 1 ∑ ∞ γ k − 1 r t + k ∣ s t = s , a t = a , π ]
∂
V
π
(
s
)
∂
θ
=
def
∂
∂
θ
∑
a
π
(
s
,
a
)
Q
π
(
s
,
a
)
,
∀
s
=
∑
a
[
∂
π
(
s
,
a
)
∂
θ
Q
π
(
s
,
a
)
+
π
(
s
,
a
)
∂
∂
θ
Q
π
(
s
,
a
)
]
=
∑
a
[
∂
π
(
s
,
a
)
∂
θ
Q
π
(
s
,
a
)
+
π
(
s
,
a
)
∂
∂
θ
(
r
(
s
,
a
)
+
∑
s
′
γ
P
s
s
′
a
V
π
(
s
′
)
)
]
=
∑
a
∂
π
(
s
,
a
)
∂
θ
Q
π
(
s
,
a
)
+
∑
a
π
(
s
,
a
)
γ
∑
s
′
P
s
s
′
a
∂
V
π
(
s
′
)
∂
θ
\begin{aligned} \frac{\partial V^{\pi}(s)}{\partial \theta} & \stackrel{\text { def }}{=} \frac{\partial}{\partial \theta} \sum_{a} \pi(s, a) Q^{\pi}(s, a), \quad \forall s \\ &=\sum_{a}\left[\frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a)+\pi(s, a) \frac{\partial}{\partial \theta} Q^{\pi}(s, a)\right] \\ &=\sum_{a}\left[\frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a)+\pi(s, a) \frac{\partial}{\partial \theta}\left(r(s, a)+\sum_{s^{\prime}} \gamma P_{s s^{\prime}}^{a} V^{\pi}\left(s^{\prime}\right)\right)\right] \\ &=\sum_{a} \frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a)+\sum_{a} \pi(s, a) \gamma \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta} \end{aligned}
∂ θ ∂ V π ( s ) = def ∂ θ ∂ a ∑ π ( s , a ) Q π ( s , a ) , ∀ s = a ∑ [ ∂ θ ∂ π ( s , a ) Q π ( s , a ) + π ( s , a ) ∂ θ ∂ Q π ( s , a ) ] = a ∑ [ ∂ θ ∂ π ( s , a ) Q π ( s , a ) + π ( s , a ) ∂ θ ∂ ( r ( s , a ) + s ′ ∑ γ P s s ′ a V π ( s ′ ) ) ] = a ∑ ∂ θ ∂ π ( s , a ) Q π ( s , a ) + a ∑ π ( s , a ) γ s ′ ∑ P s s ′ a ∂ θ ∂ V π ( s ′ )
∂
V
π
(
s
)
∂
θ
=
∑
a
∂
π
(
s
,
a
)
∂
θ
Q
π
(
s
,
a
)
+
∑
a
π
(
s
,
a
)
γ
∑
a
P
s
s
1
a
∂
V
π
(
s
1
)
∂
θ
\begin{aligned} \frac{\partial V^{\pi}(s)}{\partial \theta}=\sum_{a} \frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a)+\sum_{a} \pi(s, a) \gamma \sum_{a} P_{s s_{1}}^{a} \frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta} \end{aligned}
∂ θ ∂ V π ( s ) = a ∑ ∂ θ ∂ π ( s , a ) Q π ( s , a ) + a ∑ π ( s , a ) γ a ∑ P s s 1 a ∂ θ ∂ V π ( s 1 )
∑
a
∂
π
(
s
,
a
)
a
∂
θ
Q
π
(
s
,
a
)
=
γ
0
Pr
(
s
→
s
,
0
,
π
)
∑
a
∂
π
(
s
,
a
)
∂
θ
Q
π
(
s
,
a
)
∑
a
π
(
s
,
a
)
γ
∑
s
1
P
s
s
1
a
∂
V
π
(
s
1
)
∂
θ
=
∑
s
1
∑
a
π
(
s
,
a
)
γ
P
s
s
1
a
∂
V
π
(
s
1
)
∂
θ
=
∑
s
1
γ
P
s
s
1
∂
V
π
(
s
1
)
∂
θ
=
γ
1
∑
s
1
Pr
(
s
→
s
1
,
1
,
π
)
∂
V
π
(
s
1
)
∂
θ
\begin{aligned} \sum_{a} \frac{\partial \pi(s, a)^{a}}{\partial \theta} Q^{\pi}(s, a)=& \gamma^{0} \operatorname{Pr}(s \rightarrow s, 0, \pi) \sum_{a} \frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a) \\ \sum_{a} \pi(s, a) \gamma \sum_{s_{1}} P_{s s_{1}}^{a} \frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta} &=\sum_{s_{1}} \sum_{a} \pi(s, a) \gamma P_{s s_{1}}^{a} \frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta} \\ &=\sum_{s_{1}} \gamma P_{s s_{1}} \frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta}=\gamma^{1} \sum_{s_{1}} \operatorname{Pr}\left(s \rightarrow s_{1}, 1, \pi\right) \frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta} \end{aligned}
a ∑ ∂ θ ∂ π ( s , a ) a Q π ( s , a ) = a ∑ π ( s , a ) γ s 1 ∑ P s s 1 a ∂ θ ∂ V π ( s 1 ) γ 0 P r ( s → s , 0 , π ) a ∑ ∂ θ ∂ π ( s , a ) Q π ( s , a ) = s 1 ∑ a ∑ π ( s , a ) γ P s s 1 a ∂ θ ∂ V π ( s 1 ) = s 1 ∑ γ P s s 1 ∂ θ ∂ V π ( s 1 ) = γ 1 s 1 ∑ P r ( s → s 1 , 1 , π ) ∂ θ ∂ V π ( s 1 )
∂
V
π
(
s
1
)
∂
θ
=
∑
a
∂
π
(
s
,
a
)
∂
θ
Q
π
(
s
,
a
)
+
γ
1
∑
s
2
Pr
(
s
1
→
s
2
,
1
,
π
)
∂
V
π
(
s
2
)
∂
θ
\frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta}=\sum_{a} \frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a)+\gamma^{1} \sum_{s_{2}} \operatorname{Pr}\left(s_{1} \rightarrow s_{2}, 1, \pi\right) \frac{\partial V^{\pi}\left(s_{2}\right)}{\partial \theta}
∂ θ ∂ V π ( s 1 ) = a ∑ ∂ θ ∂ π ( s , a ) Q π ( s , a ) + γ 1 s 2 ∑ P r ( s 1 → s 2 , 1 , π ) ∂ θ ∂ V π ( s 2 )
我的微信公众号名称 :深度学习与先进智能决策 微信公众号ID :MultiAgent1024 公众号介绍 :主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!