问题
- 强化学习四元组 E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>, x ∈ X x\in X x∈X是状态, a ∈ A a\in A a∈A是动作, P : X × A × X → R P:X\times A\times X\rightarrow R P:X×A×X→R是状态转移概率, R : X × A × X → R R:X\times A\times X\rightarrow R R:X×A×X→R是奖赏。
- π \pi π是策略, π ( x , a ) \pi(x,a) π(x,a)表示策略 π \pi π在状态 x x x时选择动作 a a a的概率,有 ∑ a π ( x , a ) = 1 \sum_a\pi(x,a)=1 ∑aπ(x,a)=1。
- 强化学习任务是学习策略 π \pi π,就能计算执行的动作 a = π ( x ) a=\pi(x) a=π(x)。学习的目标是积累奖赏最大化,常用的累积奖赏有: T T T步累积奖赏 = E ( 1 T ∑ t = 1 T r t ) =E(\frac{1}{T}\sum_{t=1}^{T}r_t) =E(T1∑t=1Trt), γ \gamma γ折扣累积奖赏 = E ( ∑ t = 0 ∞ γ t r t + 1 ) =E(\sum_{t=0}^{\infty}\gamma^tr_{t+1}) =E(∑t=0∞γtrt+1),其中 r t r_t rt为第 t t t步获得的奖赏值。
摇臂赌博机
ϵ \epsilon ϵ-贪心算法
输入:摇臂数 K K K,奖赏 R R R,尝试数 T T T,探索概率 ϵ \epsilon ϵ
过程:
r = 0 , Q ( i ) = 0 , c n t ( i ) = 0 r=0,Q(i)=0,cnt(i)=0 r=0,Q(i)=0,cnt(i)=0
for t = 1 , 2 , ⋯ , T t=1,2,\cdots,T t=1,2,⋯,T:
\quad if r a n d ( ) < ϵ rand() < \epsilon rand()<ϵ:
k = r a n d i n t ( 1 , K ) \qquad k=randint(1,K) k=randint(1,K) #仅探索
\quad else:
k = arg max i Q ( i ) \qquad k=\argmax_iQ(i) k=iargmaxQ(i) #仅利用
v = R ( k ) \quad v=R(k) v=R(k)
r = r + v \quad r=r+v r=r+v
Q ( k ) = Q ( k ) ∗ c n t ( k ) + v c n t ( k ) + 1 \quad Q(k) =\frac{Q(k)*cnt(k)+v}{cnt(k)+1} Q(k)=cnt(k)+1Q(k)∗cnt(k)+v
c n t ( k ) = c n t ( k ) + 1 \quad cnt(k)=cnt(k)+1 cnt(k)=cnt(k)+1
输出: r r r
随着时间推移,策略越来越好,需要探索的概率 ϵ \epsilon ϵ可以随时间减小,比如取 ϵ = 1 t \epsilon=\frac{1}{\sqrt{t}} ϵ=t1。也可以直接根据 Q ( k ) Q(k) Q(k)的概率进行采样,即softmax算法,采到 k k k的概率:
P ( k ) = e Q ( k ) / τ ∑ i = 1 K e Q ( i ) / τ P(k)=\frac{e^{Q(k)/\tau}}{\sum_{i=1}^K e^{Q(i)/\tau}} P(k)=∑i=1KeQ(i)/τeQ(k)/τ
softmax算法
输入:摇臂数 K K K,奖赏 R R R,尝试数 T T T,温度参数 τ \tau τ
过程:
for t = 1 , 2 , ⋯ , T t=1,2,\cdots,T t=1,2,⋯,T:
\quad 根据 P ( k ) P(k) P(k)采样得到 k k k:
v = R ( k ) \quad v=R(k) v=R(k)
r = r + v \quad r=r+v r=r+v
Q ( k ) = Q ( k ) ∗ c n t ( k ) + v c n t ( k ) + 1 \quad Q(k) =\frac{Q(k)*cnt(k)+v}{cnt(k)+1} Q(k)=cnt(k)+1Q(k)∗cnt(k)+v
c n t ( k ) = c n t ( k ) + 1 \quad cnt(k)=cnt(k)+1 cnt(k)=cnt(k)+1
输出: r r r
有模型学习
四元组 E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>已知,有模型学习。状态值函数:
{ V T π ( x ) = E π [ 1 T ∑ t = 1 T r t ∣ x 0 = x ] V γ π ( x ) = E π [ ∑ t = 1 ∞ γ t r t + 1 ∣ x 0 = x ] \left\{\begin{array}{l} V_T^\pi(x)=\Bbb E_\pi[\frac{1}{T}\sum_{t=1}^Tr_t|x_0=x]\\ V_\gamma^\pi(x)=\Bbb E_\pi[\sum_{t=1}^\infty \gamma^tr_{t+1}|x_0=x] \end{array} \right. {
VTπ(x)=Eπ[T1∑t=1Trt∣x0=x]Vγπ(x)=Eπ[∑t=1∞γtrt+1∣x0=x]
状态-动作值函数:
{ Q T π ( x , a ) = E π [ 1 T ∑ t = 1 T r t ∣ x 0 = x , a 0 = a ] Q γ π ( x , a ) = E π [ ∑ t = 1 ∞ γ t r t + 1 ∣ x 0 = x , a 0 = a ] \left\{\begin{array}{l} Q_T^\pi(x,a)=\Bbb E_\pi[\frac{1}{T}\sum_{t=1}^Tr_t|x_0=x,a_0=a]\\ Q_\gamma^\pi(x,a)=\Bbb E_\pi[\sum_{t=1}^\infty \gamma^tr_{t+1}|x_0=x,a_0=a] \end{array} \right. {
QTπ(x,a)=Eπ[T1∑t=1Trt∣x0=x,a0=a]Qγπ(x,a)=Eπ[∑t=1∞γtrt+1∣x0=x,a0=a]
Bellman等式(以 γ \gamma γ折扣为例):
{ Q ( x , a ) = ∑ x ′ P x → x ′ a ( R x → x ′ a + γ V ( x ′ ) ) V ( x ) = ∑ a π ( x , a ) Q ( x , a ) \left\{\begin{array}{l} Q(x,a)=\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x'))\\ V(x)=\sum_a\pi(x,a)Q(x,a) \end{array}\right. {
Q(x,a)=∑x′Px→x′a(Rx→x′a+γV(x′))V(x)=∑aπ(x,a)Q(x,a)
策略评估算法
输入: E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>;被评估策略 π \pi π;
过程:
∀ x , V ( x ) = 0 \forall x,V(x)=0 ∀x,V(x)=0
for t = 1 , 2 , ⋯ t=1,2,\cdots t=1,2,⋯:
∀ x , V ′ ( x ) = ∑ a π ( x , a ) ∑ x ′ P x → x ′ a ( R x → x ′ a + γ V ( x ′ ) ) \quad \forall x,V'(x)=\sum_a\pi(x,a)\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x')) ∀x,V′(x)=∑aπ(x,a)∑x′Px→x′a(Rx→x′a+γV(x′))
\quad if max x ∣ V ′ ( x ) − V ( x ) ∣ < t h r \max_x|V'(x)-V(x)|<thr maxx∣V′(x)−V(x)∣<thr:
b r e a k \qquad break break
输出:状态值函数 V V V
最优Bellman等式(以 γ \gamma γ折扣为例):
{ V ( x ) = max a Q ( x , a ) Q ( x , a ) = ∑ x ′ P x → x ′ a ( R x → x ′ a + γ max a ′ Q ( x ′ , a ′ ) ) \left\{\begin{array}{l} V(x)=\max_aQ(x,a)\\ Q(x,a)=\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma \max_{a'}Q(x',a')) \end{array}\right. {
V(x)=maxaQ(x,a)Q(x,a)=∑x′Px→x′a(Rx→x′a+γmaxa′Q(x′,a′))
策略改进: π ′ ( x ) = arg max a Q ( x , a ) \pi'(x)=\argmax_aQ(x,a) π′(x)=aargmaxQ(x,a)
策略迭代算法
输入: E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>;
过程:
∀ x , V ( x ) = 0 , π ( x , a ) = 1 / ∣ A ∣ \forall x,V(x)=0,\pi(x,a)=1/|A| ∀x,V(x)=0,π(x,a)=1/∣A∣
while True:
\quad for t = 1 , 2 , ⋯ t=1,2,\cdots t=1,2,⋯:
∀ x , V ′ ( x ) = ∑ a π ( x , a ) ∑ x ′ P x → x ′ a ( R x → x ′ a + γ V ( x ′ ) ) \qquad \forall x,V'(x)=\sum_a\pi(x,a)\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x')) ∀x,V′(x)=∑aπ(x,a)∑x′Px→x′a(Rx→x′a+γV(x′))
\qquad if max x ∣ V ′ ( x ) − V ( x ) ∣ < t h r \max_x|V'(x)-V(x)|<thr maxx∣V′(x)−V(x)∣<thr:
b r e a k \quad\qquad break break
∀ x , π ′ ( x ) = arg max a Q ( x , a ) \quad \forall x,\pi'(x)=\argmax_aQ(x,a) ∀x,π′(x)=aargmaxQ(x,a) #用Bellman等式计算 Q Q Q
\quad if π ′ ( x ) = π ( x ) , ∀ x \pi'(x)=\pi(x),\forall x π′(x)=π(x),∀x:
b r e a k \qquad break break
π = π ′ \quad \pi=\pi' π=π′
输出:最优策略 π \pi π
策略迭代算法中,策略的更新太慢。策略的迭代可以和值函数的迭代一起进行:
值迭代算法
输入: E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>;被评估策略 π \pi π;
过程:
∀ x , V ( x ) = 0 \forall x,V(x)=0 ∀x,V(x)=0
while True:
\quad for t = 1 , 2 , ⋯ t=1,2,\cdots t=1,2,⋯:
∀ x , V ′ ( x ) = max a ∑ x ′ P x → x ′ a ( R x → x ′ a + γ V ( x ′ ) ) \qquad \forall x,V'(x)=\max_a\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x')) ∀x,V′(x)=maxa∑x′Px→x′a(Rx→x′a+γV(x′))
\qquad if max x ∣ V ′ ( x ) − V ( x ) ∣ < t h r \max_x|V'(x)-V(x)|<thr maxx∣V′(x)−V(x)∣<thr:
b r e a k \quad\qquad break break
输出:最优策略 π = arg max a Q ( x , a ) \pi=\argmax_aQ(x,a) π=aargmaxQ(x,a)
免模型学习
实际中, P , R P,R P,R很难知道,而且有多少状态也很难得知,此时学习算法不依赖于环境建模,为免模型学习。模型未知,我们从起始状态出发,使用某种策略采样,得到: < x 0 , a 0 , r 1 , x 1 , a 1 , r 2 , ⋯ , x T − 1 , a T − 1 , r T , x T > <x_0,a_0,r_1,x_1,a_1,r_2,\cdots,x_{T-1},a_{T-1},r_T,x_T> <x0,a0,r1,x1,a1,r2,⋯,xT−1,aT−1,rT,xT>
蒙特卡洛强化学习
同策略(on-policy)蒙特卡洛算法
输入: A , x 0 , T A,x_0,T A,x0,T
过程:
Q ( x , a ) = 0 , c n t ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ Q(x,a)=0,cnt(x,a)=0,\pi(x,a)=\frac{1}{|A|} Q(x,a)=0,cnt(x,a)=0,π(x,a)=∣A∣1
for s = 1 , 2 , ⋯ : s=1,2,\cdots: s=1,2,⋯:
\quad 执行策略 π ϵ \pi^\epsilon πϵ,得到轨迹 < x 0 , a 0 , r 1 , x 1 , a 1 , r 2 , ⋯ , x T − 1 , a T − 1 , r T , x T > <x_0,a_0,r_1,x_1,a_1,r_2,\cdots,x_{T-1},a_{T-1},r_T,x_T> <x0,a0,r1,x1,a1,r2,⋯,xT−1,aT−1,rT,xT>
\quad for t = 0 , ⋯ , T − 1 : t=0,\cdots,T-1: t=0,⋯,T−1:
R = 1 T − t ∑ i = t + 1 T r i \qquad R=\frac{1}{T-t}\sum_{i=t+1}^{T}r_i R=T−t1∑i=t+1Tri
Q ( x t , a t ) = Q ( x t , a t ) ∗ c n t ( x t , a t ) + R c n t ( x t , a t ) + 1 \qquad Q(x_t,a_t)=\frac{Q(x_t,a_t)*cnt(x_t,a_t) + R}{cnt(x_t,a_t)+1} Q(xt,at)=cnt(xt,at)+1Q(xt,at)∗cnt(xt,at)+R
c n t ( x t , a t ) = c n t ( x t , a t ) + 1 \qquad cnt(x_t,a_t)=cnt(x_t,a_t)+1 cnt(xt,at)=cnt(xt,at)+1
π ϵ ( x ) = { arg max a Q ( x , a ) r a n d ( ) > ϵ 1 / ∣ A ∣ r a n d ( ) < ϵ \quad \pi^\epsilon(x)=\left\{\begin{array}{ll} \argmax_aQ(x,a) & rand()>\epsilon\\ 1/|A| & rand() < \epsilon \end{array}\right. πϵ(x)={ aargmaxQ(x,a)1/∣A∣rand()>ϵrand()<ϵ
输出:策略 π ϵ \pi^\epsilon πϵ
同策略蒙特卡洛产生的是 ϵ \epsilon ϵ-贪心策略,我们需要在评估时引入 ϵ \epsilon ϵ-贪心策略,而在改进时改进原始策略。
函数 f f f在概率 p p p下的期望:
E ( f ) = ∫ x f ( x ) p ( x ) d x = ∫ x f ( x ) p ( x ) q ( x ) q ( x ) d x \Bbb E(f)=\int_xf(x)p(x)dx=\int_xf(x)\frac{p(x)}{q(x)}q(x)dx E(f)=∫xf(x)p(x)dx=∫xf(x)q(x)p(x)q(x)dx
用概率 p p p采样得到: ( x 1 , x 2 , ⋯ , x m ) (x_1,x_2,\cdots,x_m) (x1,x2,⋯,xm),则可估计 f f f在概率 p p p下的期望:
E ^ ( f ) = 1 m ∑ i = 1 m f ( x i ) \hat{\Bbb E}(f)=\frac{1}{m}\sum_{i=1}^mf(x_i) E^(f)=m1i=1∑mf(xi)
用概率 q q q采样得到: ( x 1 ′ , x 2 ′ , ⋯ , x m ′ ) (x_1',x_2',\cdots,x_m') (x1′,x2′,⋯,xm′),则可估计 f f f在概率 p p p下的期望(重要性采样,importance sampling):
E ^ ( f ) = 1 m ∑ i = 1 m f ( x i ′ ) p ( x i ′ ) q ( x i ′ ) \hat{\Bbb E}(f)=\frac{1}{m}\sum_{i=1}^mf(x_i')\frac{p(x_i')}{q(x_i')} E^(f)=m1i=1∑mf(xi′)q(xi′)p(xi′)
同理,可以用 π ϵ \pi^\epsilon πϵ采样,去估计 π \pi π下 Q Q Q的期望:
Q ( x , a ) = 1 m ∑ i = 1 m R i P i π p i π ϵ Q(x,a)=\frac{1}{m}\sum_{i=1}^mR_i\frac{P_i^\pi}{p_i^{\pi^\epsilon}} Q(x,a)=m1i=1∑mRipiπϵPiπ
P π = ∏ i = 0 T − 1 π ( x i , a i ) P x i → x i + 1 a i , P π ϵ = ∏ i = 0 T − 1 π ϵ ( x i , a i ) P x i → x i + 1 a i P^\pi=\prod_{i=0}^{T-1}\pi(x_i,a_i)P_{x_i\to x_{i+1}}^{a_i},P^{\pi^\epsilon}=\prod_{i=0}^{T-1}\pi^\epsilon(x_i,a_i)P_{x_i\to x_{i+1}}^{a_i} Pπ=∏i=0T−1π(xi,ai)Pxi→xi+1ai,Pπϵ=∏i=0T−1πϵ(xi,ai)Pxi→xi+1ai,所以:
P π P π ϵ = ∏ i = 0 T − 1 π ( x i , a i ) π ϵ ( x i , a i ) \frac{P^\pi}{P^{\pi^\epsilon}}=\prod_{i=0}^{T-1}\frac{\pi(x_i,a_i)}{\pi^\epsilon(x_i,a_i)} PπϵPπ=i=0∏T−1πϵ(xi,ai)π(xi,ai)
其中, π ( x i , a i ) = I ( a i = π ( x i ) ) , π ϵ ( x i , a i ) = { 1 − ϵ + ϵ ∣ A ∣ a i = π ( x i ) ϵ ∣ A ∣ a i ≠ π ( x i ) \pi(x_i,a_i)=\Bbb I(a_i=\pi(x_i)), \pi^\epsilon(x_i,a_i)=\left\{\begin{array}{ll} 1-\epsilon+\frac{\epsilon}{|A|} & a_i=\pi(x_i)\\ \frac{\epsilon}{|A|} & a_i\ne\pi(x_i) \end{array}\right. π(xi,ai)=I(ai=π(xi)),πϵ(xi,ai)={ 1−ϵ+∣A∣ϵ∣A∣ϵai=π(xi)ai=π(xi),所以,这边的连乘计算很容易为0,下面的异策略蒙特卡洛算法只是参考,实际不能这样计算。
异策略(off-policy)蒙特卡洛算法
输入: A , x 0 , T A,x_0,T A,x0,T
过程:
Q ( x , a ) = 0 , c n t ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ Q(x,a)=0,cnt(x,a)=0,\pi(x,a)=\frac{1}{|A|} Q(x,a)=0,cnt(x,a)=0,π(x,a)=∣A∣1
for s = 1 , 2 , ⋯ : s=1,2,\cdots: s=1,2,⋯:
\quad 执行策略 π ϵ \pi^\epsilon πϵ,得到轨迹 < x 0 , a 0 , r 1 , x 1 , a 1 , r 2 , ⋯ , x T − 1 , a T − 1 , r T , x T > <x_0,a_0,r_1,x_1,a_1,r_2,\cdots,x_{T-1},a_{T-1},r_T,x_T> <x0,a0,r1,x1,a1,r2,⋯,xT−1,aT−1,rT,xT>
\quad for t = 0 , ⋯ , T − 1 : t=0,\cdots,T-1: t=0,⋯,T−1:
R = ( 1 T − t ∑ i = t + 1 T r i ) ( ∏ i = t T − 1 π ( x i , a i ) π ϵ ( x i , a i ) ) \qquad R=(\frac{1}{T-t}\sum_{i=t+1}^{T}r_i)(\prod_{i=t}^{T-1}\frac{\pi(x_i,a_i)}{\pi^\epsilon(x_i,a_i)}) R=(T−t1∑i=t+1Tri)(∏i=tT−1πϵ(xi,ai)π(xi,ai))
Q ( x t , a t ) = Q ( x t , a t ) ∗ c n t ( x t , a t ) + R c n t ( x t , a t ) + 1 \qquad Q(x_t,a_t)=\frac{Q(x_t,a_t)*cnt(x_t,a_t) + R}{cnt(x_t,a_t)+1} Q(xt,at)=cnt(xt,at)+1Q(xt,at)∗cnt(xt,at)+R
c n t ( x t , a t ) = c n t ( x t , a t ) + 1 \qquad cnt(x_t,a_t)=cnt(x_t,a_t)+1 cnt(xt,at)=cnt(xt,at)+1
π ( x ) = arg max a Q ( x , a ) \quad \pi(x)=\argmax_aQ(x,a) π(x)=aargmaxQ(x,a)
输出:策略 π \pi π
时序差分学习
蒙特卡洛算法没有利用MDP,效率比较低,时序差分(TD)结合了动态规划和蒙特卡洛思想,更加高效。蒙特卡洛中 Q Q Q的迭代可写为:
Q ( x , a ) = Q ( x , a ) + 1 c + 1 ( R − Q ( x , a ) ) = Q ( x , a ) + α c ( R − Q ( x , a ) ) Q(x,a)=Q(x,a)+\frac{1}{c+1}(R-Q(x,a))=Q(x,a)+\alpha_c(R-Q(x,a)) Q(x,a)=Q(x,a)+c+11(R−Q(x,a))=Q(x,a)+αc(R−Q(x,a))
可令 α c = α \alpha_c=\alpha αc=α,且采样 < x , a , r , x ′ , a ′ > <x,a,r,x',a'> <x,a,r,x′,a′>,则:
Q ( x , a ) = Q ( x , a ) + α ( r + γ Q ( x ′ , a ′ ) − Q ( x , a ) ) Q(x,a)=Q(x,a)+\alpha(r+\gamma Q(x',a')-Q(x,a)) Q(x,a)=Q(x,a)+α(r+γQ(x′,a′)−Q(x,a))
Sarsa(on-policy)算法
输入: A , x 0 , γ , α A,x_0,\gamma, \alpha A,x0,γ,α
过程:
Q ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ , x = x 0 , a = π ( x ) Q(x,a)=0,\pi(x,a)=\frac{1}{|A|},x=x_0,a=\pi(x) Q(x,a)=0,π(x,a)=∣A∣1,x=x0,a=π(x)
for t = 1 , 2 , ⋯ : t=1,2,\cdots: t=1,2,⋯:
执 行 a ⇒ r , x ′ \quad 执行a\Rightarrow r,x' 执行a⇒r,x′
a ′ = π ϵ ( x ′ ) \quad a'=\pi^\epsilon(x') a′=πϵ(x′)
Q ( x , a ) = Q ( x , a ) + α ( r + γ Q ( x ′ , a ′ ) − Q ( x , a ) ) \quad Q(x,a)=Q(x,a)+\alpha(r+\gamma Q(x',a')-Q(x,a)) Q(x,a)=Q(x,a)+α(r+γQ(x′,a′)−Q(x,a))
π ( x ) = arg max a ′ ′ Q ( x , a ′ ′ ) \quad \pi(x)=\argmax_{a''}Q(x,a'') π(x)=a′′argmaxQ(x,a′′)
x = x ′ , a = a ′ \quad x=x', a=a' x=x′,a=a′
输出:策略 π \pi π
Q-Learning(off-policy)算法
输入: A , x 0 , γ , α A,x_0,\gamma, \alpha A,x0,γ,α
过程:
Q ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ , x = x 0 , a = π ( x ) Q(x,a)=0,\pi(x,a)=\frac{1}{|A|},x=x_0,a=\pi(x) Q(x,a)=0,π(x,a)=∣A∣1,x=x0,a=π(x)
for t = 1 , 2 , ⋯ : t=1,2,\cdots: t=1,2,⋯:
执 行 a ⇒ r , x ′ \quad 执行a\Rightarrow r,x' 执行a⇒r,x′
a ′ = π ( x ′ ) \quad a'=\pi(x') a′=π(x′)
Q ( x , a ) = Q ( x , a ) + α ( r + γ Q ( x ′ , a ′ ) − Q ( x , a ) ) \quad Q(x,a)=Q(x,a)+\alpha(r+\gamma Q(x',a')-Q(x,a)) Q(x,a)=Q(x,a)+α(r+γQ(x′,a′)−Q(x,a))
π ( x ) = arg max a ′ ′ Q ( x , a ′ ′ ) \quad \pi(x)=\argmax_{a''}Q(x,a'') π(x)=a′′argmaxQ(x,a′′)
x = x ′ , a = π ϵ ( x ′ ) \quad x=x', a=\pi^\epsilon(x') x=x′,a=πϵ(x′)
输出:策略 π \pi π
Policy Gradient
以下的算法都是Deep RL,(PPT),actor network a = π θ ( x ) a=\pi_\theta(x) a=πθ(x), a a a看成动作的概率分布向量,与环境互动得到一条轨迹后,可以获得训练数据:
{ { x t , a t } , A t ∣ t = 0 , ⋯ , T − 1 } \{\{x_t,a_t\},A_t|t=0,\cdots,T-1\} {
{
xt,at},At∣t=0,⋯,T−1}
其中, A t = ∑ i = t T − 1 γ i − t r i + 1 − b A_t=\sum_{i=t}^{T-1}\gamma^{i-t}r_{i+1}-b At=∑i=tT−1γi−tri+1−b,用累计奖赏表示这条样本的权重。 a t a_t at看成one-hot的表示形式,交叉熵 e t = C E ( π θ ( x t ) , a t ) e_t=CE(\pi_\theta(x_t),a_t) et=CE(πθ(xt),at),则loss:
L = ∑ t = 0 T − 1 A t e t L=\sum_{t=0}^{T-1}A_te_t L=t=0∑T−1Atet
求偏导:
▽ θ L = − ∑ t = 0 T − 1 A t ▽ ln ( π θ ( x t , a t ) ) \triangledown_\theta L=-\sum_{t=0}^{T-1}A_t \triangledown\ln(\pi_\theta(x_t,a_t)) ▽θL=−t=0∑T−1At▽ln(πθ(xt,at))
Policy Gradient(on-policy)算法
过程:
初始化 θ = θ 0 \theta=\theta_0 θ=θ0
for i = 1 , 2 , ⋯ , N : i=1,2,\cdots,N: i=1,2,⋯,N:
\quad 训练数据: π θ \pi_\theta πθ与环境互动得到 { { x t , a t } , A t ∣ t = 0 , ⋯ , T − 1 } \{\{x_t,a_t\},A_t|t=0,\cdots,T-1\} { { xt,at},At∣t=0,⋯,T−1}
\quad 计算Loss: L = ∑ t = 0 T − 1 A t e t L=\sum_{t=0}^{T-1}A_te_t L=∑t=0T−1Atet
\quad 更新参数: θ = θ − η ▽ θ L = θ + η ∑ t = 0 T − 1 A t ▽ θ ln ( π θ ( x t , a t ) ) \theta=\theta-\eta\triangledown_\theta L=\theta+\eta\sum_{t=0}^{T-1}A_t \triangledown_\theta\ln(\pi_\theta(x_t,a_t)) θ=θ−η▽θL=θ+η∑t=0T−1At▽θln(πθ(xt,at))
输出:网络参数 θ \theta θ
Proximal Policy Optimization
PPO=Policy Gradient的off-policy形式+参数约束
off-policy PG
令 p θ ( a ∣ x ) = π θ ( x , a ) , θ p_\theta(a|x)=\pi_\theta(x,a),\theta pθ(a∣x)=πθ(x,a),θ为更新的策略参数, θ ′ \theta' θ′为采样的策略参数,则:
− ▽ θ L = E x , a ∼ π θ [ A ( x , a ) ▽ ln ( π θ ( x , a ) ) ] = E x , a ∼ π θ ′ [ p θ ( x , a ) p θ ′ ( x , a ) A ( x , a ) ▽ ln ( π θ ( x , a ) ) ] = E x , a ∼ π θ ′ [ p θ ( a ∣ x ) p θ ′ ( a ∣ x ) A ( x , a ) ▽ ln ( p θ ( a ∣ x ) ) ] = E x , a ∼ π θ ′ [ ▽ p θ ( a ∣ x ) p θ ′ ( a ∣ x ) A ( x , a ) ] = ▽ θ J θ ′ ( θ ) -\triangledown_\theta L=\Bbb E_{x,a\sim\pi_\theta}[A(x,a)\triangledown \ln(\pi_\theta(x,a))]\\ =\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{p_\theta(x,a)}{p_{\theta'}(x,a)}A(x,a)\triangledown \ln(\pi_\theta(x,a))]\\ =\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a)\triangledown \ln(p_\theta(a|x))]\\ =\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{\triangledown p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a)]\\ =\triangledown_\theta J^{\theta'}(\theta) −▽θL=Ex,a∼πθ[A(x,a)▽ln(πθ(x,a))]=Ex,a∼πθ′[pθ′(x,a)pθ(x,a)A(x,a)▽ln(πθ(x,a))]=Ex,a∼πθ′[pθ′(a∣x)pθ(a∣x)A(x,a)▽ln(pθ(a∣x))]=Ex,a∼πθ′[pθ′(a∣x)▽pθ(a∣x)A(x,a)]=▽θJθ′(θ)
得到优化的目标函数:
J θ ′ ( θ ) = E x , a ∼ π θ ′ [ p θ ( a ∣ x ) p θ ′ ( a ∣ x ) A ( x , a ) ] J^{\theta'}(\theta)=\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a)] Jθ′(θ)=Ex,a∼πθ′[pθ′(a∣x)pθ(a∣x)A(x,a)]
constraint
在优化时,不希望 θ \theta θ与 θ ′ \theta' θ′相差太多,可以增加参数约束:
J P P O θ ′ ( θ ) = J θ ′ ( θ ) − β K L ( θ , θ ′ ) J_{PPO}^{\theta'}(\theta)=J^{\theta'}(\theta)-\beta KL(\theta,\theta') JPPOθ′(θ)=Jθ′(θ)−βKL(θ,θ′)
另外一种用clip来限制参数的方法:
J P P O 2 θ ′ ( θ ) = E x , a ∼ π θ ′ [ min ( p θ ( a ∣ x ) p θ ′ ( a ∣ x ) A ( x , a ) , c l i p ( p θ ( a ∣ x ) p θ ′ ( a ∣ x ) , 1 − ϵ , 1 + ϵ ) A ( x , a ) ) ] J_{PPO2}^{\theta'}(\theta)=\Bbb E_{x,a\sim\pi_{\theta'}}[\min(\frac{p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a),clip(\frac{p_\theta(a|x)}{p_{\theta'}(a|x)},1-\epsilon,1+\epsilon)A(x,a))] JPPO2θ′(θ)=Ex,a∼πθ′[min(pθ′(a∣x)pθ(a∣x)A(x,a),clip(pθ′(a∣x)pθ(a∣x),1−ϵ,1+ϵ)A(x,a))]
PPO算法
PPO(off-policy)算法
过程:
初始化 θ = θ 0 , θ ′ = θ \theta=\theta_0,\theta'=\theta θ=θ0,θ′=θ
for i = 1 , 2 , ⋯ , N : i=1,2,\cdots,N: i=1,2,⋯,N:
\quad 训练数据: π θ ′ \pi_{\theta'} πθ′与环境互动得到 { { x t , a t } , A t ∣ t = 0 , ⋯ , T − 1 } \{\{x_t,a_t\},A_t|t=0,\cdots,T-1\} { { xt,at},At∣t=0,⋯,T−1}
\quad 更新参数: θ = arg max θ J P P O θ ′ ( θ ) \theta=\argmax_\theta J_{PPO}^{\theta'}(\theta) θ=θargmaxJPPOθ′(θ)
θ ′ = θ \quad \theta'=\theta θ′=θ
输出:网络参数 θ \theta θ
这边有一个代码实现
Actor-Critic
状态值函数估计
状态空间离散时,前面有值函数策略评估算法来计算 V V V;当状态空间连续时,用网络 V ϕ ( x ) V^\phi(x) Vϕ(x)来表示值函数,有蒙特卡洛(MC)和时序差分(TD)两种方法来评估得到 V ϕ ( x ) V^\phi(x) Vϕ(x):
- MC:采样得到训练数据 { x t , R t ∣ t = 0 , ⋯ , T − 1 } , V ϕ ( x t ) = R t = ∑ i = t T − 1 γ i − t r i + 1 \{x_t, R_t|t=0,\cdots,T-1\},V^\phi(x_t)=R_t=\sum_{i=t}^{T-1}\gamma^{i-t}r_{i+1} { xt,Rt∣t=0,⋯,T−1},Vϕ(xt)=Rt=∑i=tT−1γi−tri+1,训练 V ϕ ( x ) V^\phi(x) Vϕ(x)网络
- TD:用 V ϕ ( x t ) = r t + 1 + γ V ϕ ( x t + 1 ) V^\phi(x_t)=r_{t+1}+\gamma V^\phi(x_{t+1}) Vϕ(xt)=rt+1+γVϕ(xt+1)来训练
MC方差大,精度高,TD方差小,精度低。
Actor-Critic
将 A t A_t At中的 b b b换成 V ϕ ( x t ) V^\phi(x_t) Vϕ(xt),表示当前的动作获得的奖赏比平均值大多少,如果大于平均值,则当前动作应受到鼓励,弱小于平均值,则当前动作不可采取。在TD中, A t A_t At中的第一项累计奖赏可以用 r t + 1 + γ V ϕ ( x t + 1 ) r_{t+1}+\gamma V^\phi(x_{t+1}) rt+1+γVϕ(xt+1)来近似,所以可得:
A t = r t + 1 + γ V ϕ ( x t + 1 ) − V ϕ ( x t ) A_t=r_{t+1}+\gamma V^\phi(x_{t+1})-V^\phi(x_t) At=rt+1+γVϕ(xt+1)−Vϕ(xt)
π θ \pi^\theta πθ是Actor,Loss函数如上 L θ = A t e t L^\theta=A_te_t Lθ=Atet; V ϕ V^\phi Vϕ是Critic,Loss函数为 L ϕ = 1 2 ∣ A t ∣ 2 L^\phi=\frac{1}{2}|A_t|^2 Lϕ=21∣At∣2
Actor-Critic算法
过程:
初始化 θ = θ 0 , ϕ = ϕ 0 , x = x 0 \theta=\theta_0,\phi=\phi_0,x=x_0 θ=θ0,ϕ=ϕ0,x=x0
for i = 1 , 2 , ⋯ : i=1,2,\cdots: i=1,2,⋯:
\quad 选择动作 a ∼ π θ ( x ) a\sim\pi^\theta(x) a∼πθ(x)
\quad 得到奖赏和下一个状态: r , x ′ r,x' r,x′
A = r + γ V ϕ ( x ′ ) − V ϕ ( x ) \quad A=r+\gamma V^\phi(x')-V^\phi(x) A=r+γVϕ(x′)−Vϕ(x)
\quad 更新参数: ϕ = ϕ − η ▽ ϕ L ϕ = ϕ + η A ▽ ϕ V ϕ ( x ) \phi=\phi-\eta\triangledown_\phi L^\phi=\phi+\eta A\triangledown_\phi V^\phi(x) ϕ=ϕ−η▽ϕLϕ=ϕ+ηA▽ϕVϕ(x)
\quad 更新参数: θ = θ − η ▽ θ L θ = θ + η A ▽ θ ln ( π θ ( x , a ) ) \theta=\theta-\eta\triangledown_\theta L^\theta=\theta+\eta A\triangledown_\theta\ln(\pi^\theta(x,a)) θ=θ−η▽θLθ=θ+ηA▽θln(πθ(x,a))
x = x ′ \quad x=x' x=x′
输出:网络参数 θ , ϕ \theta,\phi θ,ϕ
这边有一个代码实现
DQN
DQN算法
过程:
初始化 Q , Q ^ Q,\hat Q Q,Q^的参数 θ = θ 0 , θ ^ = θ \theta=\theta_0,\hat\theta=\theta θ=θ0,θ^=θ,队列 q q q
for i = 1 , 2 , ⋯ , : i=1,2,\cdots,: i=1,2,⋯,:
执 行 a ⇒ r , x ′ \quad 执行a\Rightarrow r,x' 执行a⇒r,x′
q . a p p e n d ( ( x , a , r , x ′ ) ) \quad q.append((x,a,r,x')) q.append((x,a,r,x′))
\quad 从 q q q中采样 { ( x t , a t , r t , x t ′ ) ∣ t = 1 , ⋯ , B } \{(x_t,a_t,r_t,x_t')|t=1,\cdots,B\} { (xt,at,rt,xt′)∣t=1,⋯,B}
y t = r t + γ max a Q ^ ( x t ′ , a ) \quad y_t=r_t+\gamma \max_a\hat Q(x_t',a) yt=rt+γmaxaQ^(xt′,a)
θ = θ + α ∑ t ▽ θ Q ( x t , a t ) ∗ ( y t − Q ( x t , a t ) ) \quad \theta=\theta+\alpha\sum_t\triangledown_\theta Q(x_t,a_t)*(y_t-Q(x_t,a_t)) θ=θ+α∑t▽θQ(xt,at)∗(yt−Q(xt,at))
\quad 每隔 C C C步更新: θ ^ = θ \hat\theta=\theta θ^=θ
π ( x ) = arg max a ′ ′ Q ( x , a ′ ′ ) \quad \pi(x)=\argmax_{a''}Q(x,a'') π(x)=a′′argmaxQ(x,a′′)
x = x ′ , a = π ϵ ( x ′ ) \quad x=x', a=\pi^\epsilon(x') x=x′,a=πϵ(x′)
输出:网络参数 θ \theta θ
Tips:
- Double DQN
- DQN倾向于高估Q值
- DDQN只改动一行: y t = r t + γ Q ^ ( x t ′ , arg max a Q ( x t ′ , a ) ) y_t=r_t+\gamma \hat Q(x_t',\argmax_aQ(x_t',a)) yt=rt+γQ^(xt′,aargmaxQ(xt′,a))
- Dueling DQN
- only change the network structure
- only change the network structure
- Prioritized Reply
- 队列中更大TD error( y t − Q ( x t , a t ) y_t-Q(x_t,a_t) yt−Q(xt,at))的样本被选择概率更高
- Multi-step
- 采样 ( x t , a t , r t , ⋯ , x t + N , a t + N ) (x_t,a_t,r_t,\cdots,x_{t+N},a_{t+N}) (xt,at,rt,⋯,xt+N,at+N)
- Q ( x t , a t ) = ∑ i = 0 N − 1 γ i r t + i + Q ^ ( x t + N , a t + N ) Q(x_t,a_t)=\sum_{i=0}^{N-1}\gamma^i r_{t+i}+\hat Q(x_{t+N},a_{t+N}) Q(xt,at)=∑i=0N−1γirt+i+Q^(xt+N,at+N)
- Noisy Net
- noisy on action (Epsilon Greedy)
- noisy on parameters
- a = arg max a Q ~ ( x , a ) a=\argmax_a\tilde Q(x,a) a=aargmaxQ~(x,a)
- Distributed DQN
- Rainbow
- 综合所有的tips