论文题目 :Deterministic Policy Gradient Algorithms
所解决的问题?
stochastic policy
的方法由于含有部分随机,所以效率不高,方差大,采用deterministic policy
方法比stochastic policy
的采样效率高,但是没有办法探索环境,因此只能采用off-policy
的方法来进行了。
背景
以往的action
是一个动作分布
π
θ
(
a
∣
s
)
\pi_{\theta}(a|s)
π θ ( a ∣ s ) ,作者所提出的是输出一个确定性的策略(deterministic policy
)
a
=
μ
θ
(
s
)
a =\mu_{\theta}(s)
a = μ θ ( s ) 。
In the stochastic case ,the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state space.
Stochastic Policy Gradient
前人采用off-policy
的随机策略方法, behaviour policy
β
(
a
∣
s
)
≠
π
θ
(
a
∣
s
)
\beta(a|s) \neq \pi_{\theta}(a|s)
β ( a ∣ s ) = π θ ( a ∣ s ) :
J
β
(
π
θ
)
=
∫
S
ρ
β
(
s
)
V
π
(
s
)
d
s
=
∫
S
∫
A
ρ
β
(
s
)
π
θ
(
a
∣
s
)
Q
π
(
s
,
a
)
d
a
d
s
\begin{aligned} J_{\beta}\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\pi}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \int_{\mathcal{A}} \rho^{\beta}(s) \pi_{\theta}(a | s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \end{aligned}
J β ( π θ ) = ∫ S ρ β ( s ) V π ( s ) d s = ∫ S ∫ A ρ β ( s ) π θ ( a ∣ s ) Q π ( s , a ) d a d s
Differentiating the performance objective and applying an approximation gives the off-policy policy-gradient (Degris et al., 2012b)
∇
θ
J
β
(
π
θ
)
≈
∫
S
∫
A
ρ
β
(
s
)
∇
θ
π
θ
(
a
∣
s
)
Q
π
(
s
,
a
)
d
a
d
s
=
E
s
∼
ρ
β
,
a
∼
β
[
π
θ
(
a
∣
s
)
β
θ
(
a
∣
s
)
∇
θ
log
π
θ
(
a
∣
s
)
Q
π
(
s
,
a
)
]
\begin{aligned} \nabla_{\theta} J_{\beta}\left(\pi_{\theta}\right) & \approx \int_{\mathcal{S}} \int_{\mathcal{A}} \rho^{\beta}(s) \nabla_{\theta} \pi_{\theta}(a | s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}, a \sim \beta}\left[\frac{\pi_{\theta}(a | s)}{\beta_{\theta}(a | s)} \nabla_{\theta} \log \pi_{\theta}(a | s) Q^{\pi}(s, a)\right] \end{aligned}
∇ θ J β ( π θ ) ≈ ∫ S ∫ A ρ β ( s ) ∇ θ π θ ( a ∣ s ) Q π ( s , a ) d a d s = E s ∼ ρ β , a ∼ β [ β θ ( a ∣ s ) π θ ( a ∣ s ) ∇ θ log π θ ( a ∣ s ) Q π ( s , a ) ]
This approximation drops a term that depends on the action-value gradient
∇
θ
Q
π
(
s
,
a
)
\nabla_{\theta}Q^{\pi}(s,a)
∇ θ Q π ( s , a ) ; (Degris et al., 2012b)
μ
θ
(
s
)
\mu_{\theta}(s)
μ θ ( s ) 更新公式:
θ
k
+
1
=
θ
k
+
α
E
s
∼
ρ
μ
k
[
∇
θ
Q
μ
k
(
s
,
μ
θ
(
s
)
)
]
\theta^{k+1}=\theta^{k}+\alpha \mathbb{E}_{s \sim \rho^{\mu^{k}}} \left[\nabla_{\theta} Q^{\mu^{k}}\left(s, \mu_{\theta}(s)\right)\right]
θ k + 1 = θ k + α E s ∼ ρ μ k [ ∇ θ Q μ k ( s , μ θ ( s ) ) ]
引入链导法则:
θ
k
+
1
=
θ
k
+
α
E
s
∼
ρ
μ
k
[
∇
θ
μ
θ
(
s
)
∇
a
Q
μ
k
(
s
,
a
)
∣
a
=
μ
θ
(
s
)
]
\theta^{k+1}=\theta^{k}+\alpha \mathbb{E}_{s \sim \rho^{\mu^{k}}} \left[\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}Q^{\mu^{k}}\left(s, a\right) |_{a=\mu_{\theta}(s)} \right]
θ k + 1 = θ k + α E s ∼ ρ μ k [ ∇ θ μ θ ( s ) ∇ a Q μ k ( s , a ) ∣ a = μ θ ( s ) ]
所采用的方法?
On-Policy Deterministic Actor-Critic
如果环境有大量噪声帮助智能体做exploration
的话,这个算法还是可以的,使用sarsa
更新critic
,使用
Q
w
(
s
,
a
)
Q^{w}(s,a)
Q w ( s , a ) 近似true action-value
Q
μ
Q^{\mu}
Q μ :
δ
t
=
r
t
+
γ
Q
w
(
s
t
+
1
,
a
t
+
1
)
−
Q
w
(
s
t
,
a
t
)
w
t
+
1
=
w
t
+
α
w
δ
t
∇
w
Q
w
(
s
t
,
a
t
)
θ
t
+
1
=
θ
t
+
α
θ
∇
θ
μ
θ
(
s
t
)
∇
a
Q
w
(
s
t
,
a
t
)
∣
a
=
μ
θ
(
s
)
\begin{aligned} \delta_{t} &=r_{t}+\gamma Q^{w}\left(s_{t+1}, a_{t+1}\right)-Q^{w}\left(s_{t}, a_{t}\right) \\ w_{t+1} &=w_{t}+\alpha_{w} \delta_{t} \nabla_{w} Q^{w}\left(s_{t}, a_{t}\right) \\ \theta_{t+1} &=\theta_{t}+\left.\alpha_{\theta} \nabla_{\theta} \mu_{\theta}\left(s_{t}\right) \nabla_{a} Q^{w}\left(s_{t}, a_{t}\right)\right|_{a=\mu_{\theta}(s)} \end{aligned}
δ t w t + 1 θ t + 1 = r t + γ Q w ( s t + 1 , a t + 1 ) − Q w ( s t , a t ) = w t + α w δ t ∇ w Q w ( s t , a t ) = θ t + α θ ∇ θ μ θ ( s t ) ∇ a Q w ( s t , a t ) ∣ a = μ θ ( s )
Off-Policy Deterministic Actor-Critic
we modify the performance objective to be the value function of the target policy , averaged over the state distribution of the behaviour policy
J
β
(
μ
θ
)
=
∫
S
ρ
β
(
s
)
V
μ
(
s
)
d
s
=
∫
S
ρ
β
(
s
)
Q
μ
(
s
,
μ
θ
(
s
)
)
d
s
\begin{aligned} J_{\beta}\left(\mu_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\mu}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \rho^{\beta}(s) Q^{\mu}\left(s, \mu_{\theta}(s)\right) \mathrm{d} s \end{aligned}
J β ( μ θ ) = ∫ S ρ β ( s ) V μ ( s ) d s = ∫ S ρ β ( s ) Q μ ( s , μ θ ( s ) ) d s
∇
θ
J
β
(
μ
θ
)
≈
∫
S
ρ
β
(
s
)
∇
θ
μ
θ
(
a
∣
s
)
Q
μ
(
s
,
a
)
d
s
=
E
s
∼
ρ
β
[
∇
θ
μ
θ
(
s
)
∇
a
Q
μ
(
s
,
a
)
∣
a
=
μ
θ
(
s
)
]
\begin{aligned} \nabla_{\theta} J_{\beta}\left(\mu_{\theta}\right) & \approx \int_{\mathcal{S}} \rho^{\beta}(s) \nabla_{\theta} \mu_{\theta}(a | s) Q^{\mu}(s, a) \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}} [\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}Q^{\mu}(s,a)|_{a =\mu_{\theta}(s)}] \end{aligned}
∇ θ J β ( μ θ ) ≈ ∫ S ρ β ( s ) ∇ θ μ θ ( a ∣ s ) Q μ ( s , a ) d s = E s ∼ ρ β [ ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) ]
得到off-policy deterministic actorcritic
(OPDAC) 算法:
δ
t
=
r
t
+
γ
Q
w
(
s
t
+
1
,
μ
θ
(
s
t
+
1
)
)
−
Q
w
(
s
t
,
a
t
)
w
t
+
1
=
w
t
+
α
w
δ
t
∇
w
Q
w
(
s
t
,
a
t
)
θ
t
+
1
=
θ
t
+
α
θ
∇
θ
μ
θ
(
s
t
)
∇
a
Q
w
(
s
t
,
a
t
)
∣
a
=
μ
θ
(
s
)
\begin{aligned} \delta_{t} &=r_{t}+\gamma Q^{w}\left(s_{t+1}, \mu_{\theta}\left(s_{t+1}\right)\right)-Q^{w}\left(s_{t}, a_{t}\right) \\ w_{t+1} &=w_{t}+\alpha_{w} \delta_{t} \nabla_{w} Q^{w}\left(s_{t}, a_{t}\right) \\ \theta_{t+1} &=\theta_{t}+\left.\alpha_{\theta} \nabla_{\theta} \mu_{\theta}\left(s_{t}\right) \nabla_{a} Q^{w}\left(s_{t}, a_{t}\right)\right|_{a=\mu_{\theta}(s)} \end{aligned}
δ t w t + 1 θ t + 1 = r t + γ Q w ( s t + 1 , μ θ ( s t + 1 ) ) − Q w ( s t , a t ) = w t + α w δ t ∇ w Q w ( s t , a t ) = θ t + α θ ∇ θ μ θ ( s t ) ∇ a Q w ( s t , a t ) ∣ a = μ θ ( s )
与stochastic off policy
算法不同的是由于这里是deterministic policy
,所以不需要用重要性采样(importance sampling
)。
取得的效果?
所出版信息?作者信息?
这篇文章是ICML2014
上面的一篇文章。第一作者David Silver
是Google DeepMind
的research Scientist
,本科和研究生就读于剑桥大学,博士于加拿大阿尔伯特大学就读,2013
年加入DeepMind
公司,AlphaGo
创始人之一,项目领导者。
参考链接
参考文献 :Degris, T., White, M., and Sutton, R. S. (2012b). Linear off-policy actor-critic . In 29th International Conference on Machine Learning.
扩展阅读
假定真实的action-value function
为
Q
π
(
s
,
a
)
Q^{\pi}(s,a)
Q π ( s , a ) ,用一个function
近似它
Q
w
(
s
,
a
)
≈
Q
π
(
s
,
a
)
Q^{w}(s,a) \approx Q^{\pi}(s,a)
Q w ( s , a ) ≈ Q π ( s , a ) 。However, if the function approximator is compatible such that 1.
Q
w
(
s
,
a
)
=
∇
θ
log
π
θ
(
a
∣
s
)
⊤
w
Q^{w}(s, a)=\nabla_{\theta} \log \pi_{\theta}(a | s)^{\top} w
Q w ( s , a ) = ∇ θ log π θ ( a ∣ s ) ⊤ w (linear in “fearure” ) 2. the parameters
w
w
w are chosen to minimise the mean-squared error
ε
2
(
w
)
=
E
s
∼
ρ
π
,
a
∼
π
θ
[
(
Q
w
(
s
,
a
)
−
Q
π
(
s
,
a
)
)
2
]
\varepsilon^{2}(w) = \mathbb{E}_{s \sim \rho^{\pi},a \sim \pi_{\theta}}[(Q^{w}(s,a)-Q^{\pi}(s,a))^{2}]
ε 2 ( w ) = E s ∼ ρ π , a ∼ π θ [ ( Q w ( s , a ) − Q π ( s , a ) ) 2 ] (linear regression problem form these feature ),then there is no bias (Sutton et al., 1999),
∇
θ
J
(
π
θ
)
=
E
s
∼
ρ
π
,
a
∼
π
θ
[
∇
θ
log
π
θ
(
a
∣
s
)
Q
w
(
s
,
a
)
]
\nabla_{\theta} J\left(\pi_{\theta}\right)=\mathbb{E}_{s \sim \rho^{\pi}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a | s) Q^{w}(s, a)\right]
∇ θ J ( π θ ) = E s ∼ ρ π , a ∼ π θ [ ∇ θ log π θ ( a ∣ s ) Q w ( s , a ) ]
最后,论文给出了DPG
的采用线性函数逼近定理,以及一些理论证明基础。
参考文献 :Sutton, R.S., McAllester D. A., Singh, S. P., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation . In Neural Information Processing Systems 12, pages 1057–1063.
这篇文章以后有时间再读一遍吧,里面还是有些证明需要仔细推敲一下。
我的微信公众号名称 :深度学习与先进智能决策 微信公众号ID :MultiAgent1024 公众号介绍 :主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!