RL note

问题

  • 强化学习四元组 E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R> x ∈ X x\in X xX是状态, a ∈ A a\in A aA是动作, P : X × A × X → R P:X\times A\times X\rightarrow R P:X×A×XR是状态转移概率, R : X × A × X → R R:X\times A\times X\rightarrow R R:X×A×XR是奖赏。
  • π \pi π是策略, π ( x , a ) \pi(x,a) π(x,a)表示策略 π \pi π在状态 x x x时选择动作 a a a的概率,有 ∑ a π ( x , a ) = 1 \sum_a\pi(x,a)=1 aπ(x,a)=1
  • 强化学习任务是学习策略 π \pi π,就能计算执行的动作 a = π ( x ) a=\pi(x) a=π(x)。学习的目标是积累奖赏最大化,常用的累积奖赏有: T T T步累积奖赏 = E ( 1 T ∑ t = 1 T r t ) =E(\frac{1}{T}\sum_{t=1}^{T}r_t) =E(T1t=1Trt) γ \gamma γ折扣累积奖赏 = E ( ∑ t = 0 ∞ γ t r t + 1 ) =E(\sum_{t=0}^{\infty}\gamma^tr_{t+1}) =E(t=0γtrt+1),其中 r t r_t rt为第 t t t步获得的奖赏值。

摇臂赌博机

ϵ \epsilon ϵ-贪心算法
输入:摇臂数 K K K,奖赏 R R R,尝试数 T T T,探索概率 ϵ \epsilon ϵ
过程:
r = 0 , Q ( i ) = 0 , c n t ( i ) = 0 r=0,Q(i)=0,cnt(i)=0 r=0,Q(i)=0,cnt(i)=0
for t = 1 , 2 , ⋯   , T t=1,2,\cdots,T t=1,2,,T:
\quad if r a n d ( ) < ϵ rand() < \epsilon rand()<ϵ:
k = r a n d i n t ( 1 , K ) \qquad k=randint(1,K) k=randint(1,K) #仅探索
\quad else:
k = arg max ⁡ i Q ( i ) \qquad k=\argmax_iQ(i) k=iargmaxQ(i) #仅利用
v = R ( k ) \quad v=R(k) v=R(k)
r = r + v \quad r=r+v r=r+v
Q ( k ) = Q ( k ) ∗ c n t ( k ) + v c n t ( k ) + 1 \quad Q(k) =\frac{Q(k)*cnt(k)+v}{cnt(k)+1} Q(k)=cnt(k)+1Q(k)cnt(k)+v
c n t ( k ) = c n t ( k ) + 1 \quad cnt(k)=cnt(k)+1 cnt(k)=cnt(k)+1
输出: r r r

随着时间推移,策略越来越好,需要探索的概率 ϵ \epsilon ϵ可以随时间减小,比如取 ϵ = 1 t \epsilon=\frac{1}{\sqrt{t}} ϵ=t 1。也可以直接根据 Q ( k ) Q(k) Q(k)的概率进行采样,即softmax算法,采到 k k k的概率:
P ( k ) = e Q ( k ) / τ ∑ i = 1 K e Q ( i ) / τ P(k)=\frac{e^{Q(k)/\tau}}{\sum_{i=1}^K e^{Q(i)/\tau}} P(k)=i=1KeQ(i)/τeQ(k)/τ

softmax算法
输入:摇臂数 K K K,奖赏 R R R,尝试数 T T T,温度参数 τ \tau τ
过程:
for t = 1 , 2 , ⋯   , T t=1,2,\cdots,T t=1,2,,T:
\quad 根据 P ( k ) P(k) P(k)采样得到 k k k:
v = R ( k ) \quad v=R(k) v=R(k)
r = r + v \quad r=r+v r=r+v
Q ( k ) = Q ( k ) ∗ c n t ( k ) + v c n t ( k ) + 1 \quad Q(k) =\frac{Q(k)*cnt(k)+v}{cnt(k)+1} Q(k)=cnt(k)+1Q(k)cnt(k)+v
c n t ( k ) = c n t ( k ) + 1 \quad cnt(k)=cnt(k)+1 cnt(k)=cnt(k)+1
输出: r r r

有模型学习

四元组 E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>已知,有模型学习。状态值函数:
{ V T π ( x ) = E π [ 1 T ∑ t = 1 T r t ∣ x 0 = x ] V γ π ( x ) = E π [ ∑ t = 1 ∞ γ t r t + 1 ∣ x 0 = x ] \left\{\begin{array}{l} V_T^\pi(x)=\Bbb E_\pi[\frac{1}{T}\sum_{t=1}^Tr_t|x_0=x]\\ V_\gamma^\pi(x)=\Bbb E_\pi[\sum_{t=1}^\infty \gamma^tr_{t+1}|x_0=x] \end{array} \right. { VTπ(x)=Eπ[T1t=1Trtx0=x]Vγπ(x)=Eπ[t=1γtrt+1x0=x]
状态-动作值函数:
{ Q T π ( x , a ) = E π [ 1 T ∑ t = 1 T r t ∣ x 0 = x , a 0 = a ] Q γ π ( x , a ) = E π [ ∑ t = 1 ∞ γ t r t + 1 ∣ x 0 = x , a 0 = a ] \left\{\begin{array}{l} Q_T^\pi(x,a)=\Bbb E_\pi[\frac{1}{T}\sum_{t=1}^Tr_t|x_0=x,a_0=a]\\ Q_\gamma^\pi(x,a)=\Bbb E_\pi[\sum_{t=1}^\infty \gamma^tr_{t+1}|x_0=x,a_0=a] \end{array} \right. { QTπ(x,a)=Eπ[T1t=1Trtx0=x,a0=a]Qγπ(x,a)=Eπ[t=1γtrt+1x0=x,a0=a]
Bellman等式(以 γ \gamma γ折扣为例):
{ Q ( x , a ) = ∑ x ′ P x → x ′ a ( R x → x ′ a + γ V ( x ′ ) ) V ( x ) = ∑ a π ( x , a ) Q ( x , a ) \left\{\begin{array}{l} Q(x,a)=\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x'))\\ V(x)=\sum_a\pi(x,a)Q(x,a) \end{array}\right. { Q(x,a)=xPxxa(Rxxa+γV(x))V(x)=aπ(x,a)Q(x,a)

策略评估算法
输入: E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>;被评估策略 π \pi π
过程:
∀ x , V ( x ) = 0 \forall x,V(x)=0 x,V(x)=0
for t = 1 , 2 , ⋯ t=1,2,\cdots t=1,2,:
∀ x , V ′ ( x ) = ∑ a π ( x , a ) ∑ x ′ P x → x ′ a ( R x → x ′ a + γ V ( x ′ ) ) \quad \forall x,V'(x)=\sum_a\pi(x,a)\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x')) x,V(x)=aπ(x,a)xPxxa(Rxxa+γV(x))
\quad if max ⁡ x ∣ V ′ ( x ) − V ( x ) ∣ < t h r \max_x|V'(x)-V(x)|<thr maxxV(x)V(x)<thr:
b r e a k \qquad break break
输出:状态值函数 V V V

最优Bellman等式(以 γ \gamma γ折扣为例):
{ V ( x ) = max ⁡ a Q ( x , a ) Q ( x , a ) = ∑ x ′ P x → x ′ a ( R x → x ′ a + γ max ⁡ a ′ Q ( x ′ , a ′ ) ) \left\{\begin{array}{l} V(x)=\max_aQ(x,a)\\ Q(x,a)=\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma \max_{a'}Q(x',a')) \end{array}\right. { V(x)=maxaQ(x,a)Q(x,a)=xPxxa(Rxxa+γmaxaQ(x,a))
策略改进: π ′ ( x ) = arg max ⁡ a Q ( x , a ) \pi'(x)=\argmax_aQ(x,a) π(x)=aargmaxQ(x,a)

策略迭代算法
输入: E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>
过程:
∀ x , V ( x ) = 0 , π ( x , a ) = 1 / ∣ A ∣ \forall x,V(x)=0,\pi(x,a)=1/|A| x,V(x)=0,π(x,a)=1/A
while True:
\quad for t = 1 , 2 , ⋯ t=1,2,\cdots t=1,2,:
∀ x , V ′ ( x ) = ∑ a π ( x , a ) ∑ x ′ P x → x ′ a ( R x → x ′ a + γ V ( x ′ ) ) \qquad \forall x,V'(x)=\sum_a\pi(x,a)\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x')) x,V(x)=aπ(x,a)xPxxa(Rxxa+γV(x))
\qquad if max ⁡ x ∣ V ′ ( x ) − V ( x ) ∣ < t h r \max_x|V'(x)-V(x)|<thr maxxV(x)V(x)<thr:
b r e a k \quad\qquad break break
∀ x , π ′ ( x ) = arg max ⁡ a Q ( x , a ) \quad \forall x,\pi'(x)=\argmax_aQ(x,a) x,π(x)=aargmaxQ(x,a) #用Bellman等式计算 Q Q Q
\quad if π ′ ( x ) = π ( x ) , ∀ x \pi'(x)=\pi(x),\forall x π(x)=π(x),x:
b r e a k \qquad break break
π = π ′ \quad \pi=\pi' π=π
输出:最优策略 π \pi π

策略迭代算法中,策略的更新太慢。策略的迭代可以和值函数的迭代一起进行:

值迭代算法
输入: E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>;被评估策略 π \pi π
过程:
∀ x , V ( x ) = 0 \forall x,V(x)=0 x,V(x)=0
while True:
\quad for t = 1 , 2 , ⋯ t=1,2,\cdots t=1,2,:
∀ x , V ′ ( x ) = max ⁡ a ∑ x ′ P x → x ′ a ( R x → x ′ a + γ V ( x ′ ) ) \qquad \forall x,V'(x)=\max_a\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x')) x,V(x)=maxaxPxxa(Rxxa+γV(x))
\qquad if max ⁡ x ∣ V ′ ( x ) − V ( x ) ∣ < t h r \max_x|V'(x)-V(x)|<thr maxxV(x)V(x)<thr:
b r e a k \quad\qquad break break
输出:最优策略 π = arg max ⁡ a Q ( x , a ) \pi=\argmax_aQ(x,a) π=aargmaxQ(x,a)

免模型学习

实际中, P , R P,R P,R很难知道,而且有多少状态也很难得知,此时学习算法不依赖于环境建模,为免模型学习。模型未知,我们从起始状态出发,使用某种策略采样,得到: < x 0 , a 0 , r 1 , x 1 , a 1 , r 2 , ⋯   , x T − 1 , a T − 1 , r T , x T > <x_0,a_0,r_1,x_1,a_1,r_2,\cdots,x_{T-1},a_{T-1},r_T,x_T> <x0,a0,r1,x1,a1,r2,,xT1,aT1,rT,xT>

蒙特卡洛强化学习

同策略(on-policy)蒙特卡洛算法
输入: A , x 0 , T A,x_0,T A,x0,T
过程:
Q ( x , a ) = 0 , c n t ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ Q(x,a)=0,cnt(x,a)=0,\pi(x,a)=\frac{1}{|A|} Q(x,a)=0,cnt(x,a)=0,π(x,a)=A1
for s = 1 , 2 , ⋯ : s=1,2,\cdots: s=1,2,:
\quad 执行策略 π ϵ \pi^\epsilon πϵ,得到轨迹 < x 0 , a 0 , r 1 , x 1 , a 1 , r 2 , ⋯   , x T − 1 , a T − 1 , r T , x T > <x_0,a_0,r_1,x_1,a_1,r_2,\cdots,x_{T-1},a_{T-1},r_T,x_T> <x0,a0,r1,x1,a1,r2,,xT1,aT1,rT,xT>
\quad for t = 0 , ⋯   , T − 1 : t=0,\cdots,T-1: t=0,,T1:
R = 1 T − t ∑ i = t + 1 T r i \qquad R=\frac{1}{T-t}\sum_{i=t+1}^{T}r_i R=Tt1i=t+1Tri
Q ( x t , a t ) = Q ( x t , a t ) ∗ c n t ( x t , a t ) + R c n t ( x t , a t ) + 1 \qquad Q(x_t,a_t)=\frac{Q(x_t,a_t)*cnt(x_t,a_t) + R}{cnt(x_t,a_t)+1} Q(xt,at)=cnt(xt,at)+1Q(xt,at)cnt(xt,at)+R
c n t ( x t , a t ) = c n t ( x t , a t ) + 1 \qquad cnt(x_t,a_t)=cnt(x_t,a_t)+1 cnt(xt,at)=cnt(xt,at)+1
π ϵ ( x ) = { arg max ⁡ a Q ( x , a ) r a n d ( ) > ϵ 1 / ∣ A ∣ r a n d ( ) < ϵ \quad \pi^\epsilon(x)=\left\{\begin{array}{ll} \argmax_aQ(x,a) & rand()>\epsilon\\ 1/|A| & rand() < \epsilon \end{array}\right. πϵ(x)={ aargmaxQ(x,a)1/Arand()>ϵrand()<ϵ
输出:策略 π ϵ \pi^\epsilon πϵ

同策略蒙特卡洛产生的是 ϵ \epsilon ϵ-贪心策略,我们需要在评估时引入 ϵ \epsilon ϵ-贪心策略,而在改进时改进原始策略。
函数 f f f在概率 p p p下的期望:
E ( f ) = ∫ x f ( x ) p ( x ) d x = ∫ x f ( x ) p ( x ) q ( x ) q ( x ) d x \Bbb E(f)=\int_xf(x)p(x)dx=\int_xf(x)\frac{p(x)}{q(x)}q(x)dx E(f)=xf(x)p(x)dx=xf(x)q(x)p(x)q(x)dx

用概率 p p p采样得到: ( x 1 , x 2 , ⋯   , x m ) (x_1,x_2,\cdots,x_m) (x1,x2,,xm),则可估计 f f f在概率 p p p下的期望:
E ^ ( f ) = 1 m ∑ i = 1 m f ( x i ) \hat{\Bbb E}(f)=\frac{1}{m}\sum_{i=1}^mf(x_i) E^(f)=m1i=1mf(xi)

用概率 q q q采样得到: ( x 1 ′ , x 2 ′ , ⋯   , x m ′ ) (x_1',x_2',\cdots,x_m') (x1,x2,,xm),则可估计 f f f在概率 p p p下的期望(重要性采样,importance sampling):
E ^ ( f ) = 1 m ∑ i = 1 m f ( x i ′ ) p ( x i ′ ) q ( x i ′ ) \hat{\Bbb E}(f)=\frac{1}{m}\sum_{i=1}^mf(x_i')\frac{p(x_i')}{q(x_i')} E^(f)=m1i=1mf(xi)q(xi)p(xi)

同理,可以用 π ϵ \pi^\epsilon πϵ采样,去估计 π \pi π Q Q Q的期望:
Q ( x , a ) = 1 m ∑ i = 1 m R i P i π p i π ϵ Q(x,a)=\frac{1}{m}\sum_{i=1}^mR_i\frac{P_i^\pi}{p_i^{\pi^\epsilon}} Q(x,a)=m1i=1mRipiπϵPiπ

P π = ∏ i = 0 T − 1 π ( x i , a i ) P x i → x i + 1 a i , P π ϵ = ∏ i = 0 T − 1 π ϵ ( x i , a i ) P x i → x i + 1 a i P^\pi=\prod_{i=0}^{T-1}\pi(x_i,a_i)P_{x_i\to x_{i+1}}^{a_i},P^{\pi^\epsilon}=\prod_{i=0}^{T-1}\pi^\epsilon(x_i,a_i)P_{x_i\to x_{i+1}}^{a_i} Pπ=i=0T1π(xi,ai)Pxixi+1ai,Pπϵ=i=0T1πϵ(xi,ai)Pxixi+1ai,所以:
P π P π ϵ = ∏ i = 0 T − 1 π ( x i , a i ) π ϵ ( x i , a i ) \frac{P^\pi}{P^{\pi^\epsilon}}=\prod_{i=0}^{T-1}\frac{\pi(x_i,a_i)}{\pi^\epsilon(x_i,a_i)} PπϵPπ=i=0T1πϵ(xi,ai)π(xi,ai)

其中, π ( x i , a i ) = I ( a i = π ( x i ) ) , π ϵ ( x i , a i ) = { 1 − ϵ + ϵ ∣ A ∣ a i = π ( x i ) ϵ ∣ A ∣ a i ≠ π ( x i ) \pi(x_i,a_i)=\Bbb I(a_i=\pi(x_i)), \pi^\epsilon(x_i,a_i)=\left\{\begin{array}{ll} 1-\epsilon+\frac{\epsilon}{|A|} & a_i=\pi(x_i)\\ \frac{\epsilon}{|A|} & a_i\ne\pi(x_i) \end{array}\right. π(xi,ai)=I(ai=π(xi)),πϵ(xi,ai)={ 1ϵ+AϵAϵai=π(xi)ai=π(xi),所以,这边的连乘计算很容易为0,下面的异策略蒙特卡洛算法只是参考,实际不能这样计算。

异策略(off-policy)蒙特卡洛算法
输入: A , x 0 , T A,x_0,T A,x0,T
过程:
Q ( x , a ) = 0 , c n t ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ Q(x,a)=0,cnt(x,a)=0,\pi(x,a)=\frac{1}{|A|} Q(x,a)=0,cnt(x,a)=0,π(x,a)=A1
for s = 1 , 2 , ⋯ : s=1,2,\cdots: s=1,2,:
\quad 执行策略 π ϵ \pi^\epsilon πϵ,得到轨迹 < x 0 , a 0 , r 1 , x 1 , a 1 , r 2 , ⋯   , x T − 1 , a T − 1 , r T , x T > <x_0,a_0,r_1,x_1,a_1,r_2,\cdots,x_{T-1},a_{T-1},r_T,x_T> <x0,a0,r1,x1,a1,r2,,xT1,aT1,rT,xT>
\quad for t = 0 , ⋯   , T − 1 : t=0,\cdots,T-1: t=0,,T1:
R = ( 1 T − t ∑ i = t + 1 T r i ) ( ∏ i = t T − 1 π ( x i , a i ) π ϵ ( x i , a i ) ) \qquad R=(\frac{1}{T-t}\sum_{i=t+1}^{T}r_i)(\prod_{i=t}^{T-1}\frac{\pi(x_i,a_i)}{\pi^\epsilon(x_i,a_i)}) R=(Tt1i=t+1Tri)(i=tT1πϵ(xi,ai)π(xi,ai))
Q ( x t , a t ) = Q ( x t , a t ) ∗ c n t ( x t , a t ) + R c n t ( x t , a t ) + 1 \qquad Q(x_t,a_t)=\frac{Q(x_t,a_t)*cnt(x_t,a_t) + R}{cnt(x_t,a_t)+1} Q(xt,at)=cnt(xt,at)+1Q(xt,at)cnt(xt,at)+R
c n t ( x t , a t ) = c n t ( x t , a t ) + 1 \qquad cnt(x_t,a_t)=cnt(x_t,a_t)+1 cnt(xt,at)=cnt(xt,at)+1
π ( x ) = arg max ⁡ a Q ( x , a ) \quad \pi(x)=\argmax_aQ(x,a) π(x)=aargmaxQ(x,a)
输出:策略 π \pi π

时序差分学习

蒙特卡洛算法没有利用MDP,效率比较低,时序差分(TD)结合了动态规划和蒙特卡洛思想,更加高效。蒙特卡洛中 Q Q Q的迭代可写为:
Q ( x , a ) = Q ( x , a ) + 1 c + 1 ( R − Q ( x , a ) ) = Q ( x , a ) + α c ( R − Q ( x , a ) ) Q(x,a)=Q(x,a)+\frac{1}{c+1}(R-Q(x,a))=Q(x,a)+\alpha_c(R-Q(x,a)) Q(x,a)=Q(x,a)+c+11(RQ(x,a))=Q(x,a)+αc(RQ(x,a))

可令 α c = α \alpha_c=\alpha αc=α,且采样 < x , a , r , x ′ , a ′ > <x,a,r,x',a'> <x,a,r,x,a>,则:
Q ( x , a ) = Q ( x , a ) + α ( r + γ Q ( x ′ , a ′ ) − Q ( x , a ) ) Q(x,a)=Q(x,a)+\alpha(r+\gamma Q(x',a')-Q(x,a)) Q(x,a)=Q(x,a)+α(r+γQ(x,a)Q(x,a))

Sarsa(on-policy)算法
输入: A , x 0 , γ , α A,x_0,\gamma, \alpha A,x0,γ,α
过程:
Q ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ , x = x 0 , a = π ( x ) Q(x,a)=0,\pi(x,a)=\frac{1}{|A|},x=x_0,a=\pi(x) Q(x,a)=0,π(x,a)=A1,x=x0,a=π(x)
for t = 1 , 2 , ⋯ : t=1,2,\cdots: t=1,2,:
执 行 a ⇒ r , x ′ \quad 执行a\Rightarrow r,x' ar,x
a ′ = π ϵ ( x ′ ) \quad a'=\pi^\epsilon(x') a=πϵ(x)
Q ( x , a ) = Q ( x , a ) + α ( r + γ Q ( x ′ , a ′ ) − Q ( x , a ) ) \quad Q(x,a)=Q(x,a)+\alpha(r+\gamma Q(x',a')-Q(x,a)) Q(x,a)=Q(x,a)+α(r+γQ(x,a)Q(x,a))
π ( x ) = arg max ⁡ a ′ ′ Q ( x , a ′ ′ ) \quad \pi(x)=\argmax_{a''}Q(x,a'') π(x)=aargmaxQ(x,a)
x = x ′ , a = a ′ \quad x=x', a=a' x=x,a=a
输出:策略 π \pi π

Q-Learning(off-policy)算法
输入: A , x 0 , γ , α A,x_0,\gamma, \alpha A,x0,γ,α
过程:
Q ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ , x = x 0 , a = π ( x ) Q(x,a)=0,\pi(x,a)=\frac{1}{|A|},x=x_0,a=\pi(x) Q(x,a)=0,π(x,a)=A1,x=x0,a=π(x)
for t = 1 , 2 , ⋯ : t=1,2,\cdots: t=1,2,:
执 行 a ⇒ r , x ′ \quad 执行a\Rightarrow r,x' ar,x
a ′ = π ( x ′ ) \quad a'=\pi(x') a=π(x)
Q ( x , a ) = Q ( x , a ) + α ( r + γ Q ( x ′ , a ′ ) − Q ( x , a ) ) \quad Q(x,a)=Q(x,a)+\alpha(r+\gamma Q(x',a')-Q(x,a)) Q(x,a)=Q(x,a)+α(r+γQ(x,a)Q(x,a))
π ( x ) = arg max ⁡ a ′ ′ Q ( x , a ′ ′ ) \quad \pi(x)=\argmax_{a''}Q(x,a'') π(x)=aargmaxQ(x,a)
x = x ′ , a = π ϵ ( x ′ ) \quad x=x', a=\pi^\epsilon(x') x=x,a=πϵ(x)
输出:策略 π \pi π

Policy Gradient

以下的算法都是Deep RL,(PPT),actor network a = π θ ( x ) a=\pi_\theta(x) a=πθ(x) a a a看成动作的概率分布向量,与环境互动得到一条轨迹后,可以获得训练数据:
{ { x t , a t } , A t ∣ t = 0 , ⋯   , T − 1 } \{\{x_t,a_t\},A_t|t=0,\cdots,T-1\} { { xt,at},Att=0,,T1}

其中, A t = ∑ i = t T − 1 γ i − t r i + 1 − b A_t=\sum_{i=t}^{T-1}\gamma^{i-t}r_{i+1}-b At=i=tT1γitri+1b,用累计奖赏表示这条样本的权重。 a t a_t at看成one-hot的表示形式,交叉熵 e t = C E ( π θ ( x t ) , a t ) e_t=CE(\pi_\theta(x_t),a_t) et=CE(πθ(xt),at),则loss:
L = ∑ t = 0 T − 1 A t e t L=\sum_{t=0}^{T-1}A_te_t L=t=0T1Atet

求偏导:
▽ θ L = − ∑ t = 0 T − 1 A t ▽ ln ⁡ ( π θ ( x t , a t ) ) \triangledown_\theta L=-\sum_{t=0}^{T-1}A_t \triangledown\ln(\pi_\theta(x_t,a_t)) θL=t=0T1Atln(πθ(xt,at))

Policy Gradient(on-policy)算法
过程:
初始化 θ = θ 0 \theta=\theta_0 θ=θ0
for i = 1 , 2 , ⋯   , N : i=1,2,\cdots,N: i=1,2,,N:
\quad 训练数据: π θ \pi_\theta πθ与环境互动得到 { { x t , a t } , A t ∣ t = 0 , ⋯   , T − 1 } \{\{x_t,a_t\},A_t|t=0,\cdots,T-1\} { { xt,at},Att=0,,T1}
\quad 计算Loss: L = ∑ t = 0 T − 1 A t e t L=\sum_{t=0}^{T-1}A_te_t L=t=0T1Atet
\quad 更新参数: θ = θ − η ▽ θ L = θ + η ∑ t = 0 T − 1 A t ▽ θ ln ⁡ ( π θ ( x t , a t ) ) \theta=\theta-\eta\triangledown_\theta L=\theta+\eta\sum_{t=0}^{T-1}A_t \triangledown_\theta\ln(\pi_\theta(x_t,a_t)) θ=θηθL=θ+ηt=0T1Atθln(πθ(xt,at))
输出:网络参数 θ \theta θ

Proximal Policy Optimization

PPO=Policy Gradient的off-policy形式+参数约束

off-policy PG

p θ ( a ∣ x ) = π θ ( x , a ) , θ p_\theta(a|x)=\pi_\theta(x,a),\theta pθ(ax)=πθ(x,a),θ为更新的策略参数, θ ′ \theta' θ为采样的策略参数,则:
− ▽ θ L = E x , a ∼ π θ [ A ( x , a ) ▽ ln ⁡ ( π θ ( x , a ) ) ] = E x , a ∼ π θ ′ [ p θ ( x , a ) p θ ′ ( x , a ) A ( x , a ) ▽ ln ⁡ ( π θ ( x , a ) ) ] = E x , a ∼ π θ ′ [ p θ ( a ∣ x ) p θ ′ ( a ∣ x ) A ( x , a ) ▽ ln ⁡ ( p θ ( a ∣ x ) ) ] = E x , a ∼ π θ ′ [ ▽ p θ ( a ∣ x ) p θ ′ ( a ∣ x ) A ( x , a ) ] = ▽ θ J θ ′ ( θ ) -\triangledown_\theta L=\Bbb E_{x,a\sim\pi_\theta}[A(x,a)\triangledown \ln(\pi_\theta(x,a))]\\ =\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{p_\theta(x,a)}{p_{\theta'}(x,a)}A(x,a)\triangledown \ln(\pi_\theta(x,a))]\\ =\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a)\triangledown \ln(p_\theta(a|x))]\\ =\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{\triangledown p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a)]\\ =\triangledown_\theta J^{\theta'}(\theta) θL=Ex,aπθ[A(x,a)ln(πθ(x,a))]=Ex,aπθ[pθ(x,a)pθ(x,a)A(x,a)ln(πθ(x,a))]=Ex,aπθ[pθ(ax)pθ(ax)A(x,a)ln(pθ(ax))]=Ex,aπθ[pθ(ax)pθ(ax)A(x,a)]=θJθ(θ)

得到优化的目标函数:
J θ ′ ( θ ) = E x , a ∼ π θ ′ [ p θ ( a ∣ x ) p θ ′ ( a ∣ x ) A ( x , a ) ] J^{\theta'}(\theta)=\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a)] Jθ(θ)=Ex,aπθ[pθ(ax)pθ(ax)A(x,a)]

constraint

在优化时,不希望 θ \theta θ θ ′ \theta' θ相差太多,可以增加参数约束:
J P P O θ ′ ( θ ) = J θ ′ ( θ ) − β K L ( θ , θ ′ ) J_{PPO}^{\theta'}(\theta)=J^{\theta'}(\theta)-\beta KL(\theta,\theta') JPPOθ(θ)=Jθ(θ)βKL(θ,θ)

另外一种用clip来限制参数的方法:
J P P O 2 θ ′ ( θ ) = E x , a ∼ π θ ′ [ min ⁡ ( p θ ( a ∣ x ) p θ ′ ( a ∣ x ) A ( x , a ) , c l i p ( p θ ( a ∣ x ) p θ ′ ( a ∣ x ) , 1 − ϵ , 1 + ϵ ) A ( x , a ) ) ] J_{PPO2}^{\theta'}(\theta)=\Bbb E_{x,a\sim\pi_{\theta'}}[\min(\frac{p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a),clip(\frac{p_\theta(a|x)}{p_{\theta'}(a|x)},1-\epsilon,1+\epsilon)A(x,a))] JPPO2θ(θ)=Ex,aπθ[min(pθ(ax)pθ(ax)A(x,a),clip(pθ(ax)pθ(ax),1ϵ,1+ϵ)A(x,a))]

PPO算法

PPO(off-policy)算法
过程:
初始化 θ = θ 0 , θ ′ = θ \theta=\theta_0,\theta'=\theta θ=θ0,θ=θ
for i = 1 , 2 , ⋯   , N : i=1,2,\cdots,N: i=1,2,,N:
\quad 训练数据: π θ ′ \pi_{\theta'} πθ与环境互动得到 { { x t , a t } , A t ∣ t = 0 , ⋯   , T − 1 } \{\{x_t,a_t\},A_t|t=0,\cdots,T-1\} { { xt,at},Att=0,,T1}
\quad 更新参数: θ = arg max ⁡ θ J P P O θ ′ ( θ ) \theta=\argmax_\theta J_{PPO}^{\theta'}(\theta) θ=θargmaxJPPOθ(θ)
θ ′ = θ \quad \theta'=\theta θ=θ
输出:网络参数 θ \theta θ

这边有一个代码实现

Actor-Critic

状态值函数估计

状态空间离散时,前面有值函数策略评估算法来计算 V V V;当状态空间连续时,用网络 V ϕ ( x ) V^\phi(x) Vϕ(x)来表示值函数,有蒙特卡洛(MC)和时序差分(TD)两种方法来评估得到 V ϕ ( x ) V^\phi(x) Vϕ(x)

  • MC:采样得到训练数据 { x t , R t ∣ t = 0 , ⋯   , T − 1 } , V ϕ ( x t ) = R t = ∑ i = t T − 1 γ i − t r i + 1 \{x_t, R_t|t=0,\cdots,T-1\},V^\phi(x_t)=R_t=\sum_{i=t}^{T-1}\gamma^{i-t}r_{i+1} { xt,Rtt=0,,T1},Vϕ(xt)=Rt=i=tT1γitri+1,训练 V ϕ ( x ) V^\phi(x) Vϕ(x)网络
  • TD:用 V ϕ ( x t ) = r t + 1 + γ V ϕ ( x t + 1 ) V^\phi(x_t)=r_{t+1}+\gamma V^\phi(x_{t+1}) Vϕ(xt)=rt+1+γVϕ(xt+1)来训练

MC方差大,精度高,TD方差小,精度低。

Actor-Critic

A t A_t At中的 b b b换成 V ϕ ( x t ) V^\phi(x_t) Vϕ(xt),表示当前的动作获得的奖赏比平均值大多少,如果大于平均值,则当前动作应受到鼓励,弱小于平均值,则当前动作不可采取。在TD中, A t A_t At中的第一项累计奖赏可以用 r t + 1 + γ V ϕ ( x t + 1 ) r_{t+1}+\gamma V^\phi(x_{t+1}) rt+1+γVϕ(xt+1)来近似,所以可得:
A t = r t + 1 + γ V ϕ ( x t + 1 ) − V ϕ ( x t ) A_t=r_{t+1}+\gamma V^\phi(x_{t+1})-V^\phi(x_t) At=rt+1+γVϕ(xt+1)Vϕ(xt)

π θ \pi^\theta πθ是Actor,Loss函数如上 L θ = A t e t L^\theta=A_te_t Lθ=Atet V ϕ V^\phi Vϕ是Critic,Loss函数为 L ϕ = 1 2 ∣ A t ∣ 2 L^\phi=\frac{1}{2}|A_t|^2 Lϕ=21At2

Actor-Critic算法
过程:
初始化 θ = θ 0 , ϕ = ϕ 0 , x = x 0 \theta=\theta_0,\phi=\phi_0,x=x_0 θ=θ0,ϕ=ϕ0,x=x0
for i = 1 , 2 , ⋯ : i=1,2,\cdots: i=1,2,:
\quad 选择动作 a ∼ π θ ( x ) a\sim\pi^\theta(x) aπθ(x)
\quad 得到奖赏和下一个状态: r , x ′ r,x' r,x
A = r + γ V ϕ ( x ′ ) − V ϕ ( x ) \quad A=r+\gamma V^\phi(x')-V^\phi(x) A=r+γVϕ(x)Vϕ(x)
\quad 更新参数: ϕ = ϕ − η ▽ ϕ L ϕ = ϕ + η A ▽ ϕ V ϕ ( x ) \phi=\phi-\eta\triangledown_\phi L^\phi=\phi+\eta A\triangledown_\phi V^\phi(x) ϕ=ϕηϕLϕ=ϕ+ηAϕVϕ(x)
\quad 更新参数: θ = θ − η ▽ θ L θ = θ + η A ▽ θ ln ⁡ ( π θ ( x , a ) ) \theta=\theta-\eta\triangledown_\theta L^\theta=\theta+\eta A\triangledown_\theta\ln(\pi^\theta(x,a)) θ=θηθLθ=θ+ηAθln(πθ(x,a))
x = x ′ \quad x=x' x=x
输出:网络参数 θ , ϕ \theta,\phi θ,ϕ

这边有一个代码实现

DQN

DQN算法
过程:
初始化 Q , Q ^ Q,\hat Q Q,Q^的参数 θ = θ 0 , θ ^ = θ \theta=\theta_0,\hat\theta=\theta θ=θ0,θ^=θ,队列 q q q
for i = 1 , 2 , ⋯   , : i=1,2,\cdots,: i=1,2,,:
执 行 a ⇒ r , x ′ \quad 执行a\Rightarrow r,x' ar,x
q . a p p e n d ( ( x , a , r , x ′ ) ) \quad q.append((x,a,r,x')) q.append((x,a,r,x))
\quad q q q中采样 { ( x t , a t , r t , x t ′ ) ∣ t = 1 , ⋯   , B } \{(x_t,a_t,r_t,x_t')|t=1,\cdots,B\} { (xt,at,rt,xt)t=1,,B}
y t = r t + γ max ⁡ a Q ^ ( x t ′ , a ) \quad y_t=r_t+\gamma \max_a\hat Q(x_t',a) yt=rt+γmaxaQ^(xt,a)
θ = θ + α ∑ t ▽ θ Q ( x t , a t ) ∗ ( y t − Q ( x t , a t ) ) \quad \theta=\theta+\alpha\sum_t\triangledown_\theta Q(x_t,a_t)*(y_t-Q(x_t,a_t)) θ=θ+αtθQ(xt,at)(ytQ(xt,at))
\quad 每隔 C C C步更新: θ ^ = θ \hat\theta=\theta θ^=θ
π ( x ) = arg max ⁡ a ′ ′ Q ( x , a ′ ′ ) \quad \pi(x)=\argmax_{a''}Q(x,a'') π(x)=aargmaxQ(x,a)
x = x ′ , a = π ϵ ( x ′ ) \quad x=x', a=\pi^\epsilon(x') x=x,a=πϵ(x)
输出:网络参数 θ \theta θ

Tips:

  • Double DQN
    • DQN倾向于高估Q值
    • DDQN只改动一行: y t = r t + γ Q ^ ( x t ′ , arg max ⁡ a Q ( x t ′ , a ) ) y_t=r_t+\gamma \hat Q(x_t',\argmax_aQ(x_t',a)) yt=rt+γQ^(xt,aargmaxQ(xt,a))
  • Dueling DQN
    • only change the network structure
      在这里插入图片描述
  • Prioritized Reply
    • 队列中更大TD error( y t − Q ( x t , a t ) y_t-Q(x_t,a_t) ytQ(xt,at))的样本被选择概率更高
  • Multi-step
    • 采样 ( x t , a t , r t , ⋯   , x t + N , a t + N ) (x_t,a_t,r_t,\cdots,x_{t+N},a_{t+N}) (xt,at,rt,,xt+N,at+N)
    • Q ( x t , a t ) = ∑ i = 0 N − 1 γ i r t + i + Q ^ ( x t + N , a t + N ) Q(x_t,a_t)=\sum_{i=0}^{N-1}\gamma^i r_{t+i}+\hat Q(x_{t+N},a_{t+N}) Q(xt,at)=i=0N1γirt+i+Q^(xt+N,at+N)
  • Noisy Net
    • noisy on action (Epsilon Greedy)
    • noisy on parameters
      • a = arg max ⁡ a Q ~ ( x , a ) a=\argmax_a\tilde Q(x,a) a=aargmaxQ~(x,a)
  • Distributed DQN
  • Rainbow
    • 综合所有的tips

reward shaping

imitation

猜你喜欢

转载自blog.csdn.net/dragonchow123/article/details/127626672
RL