RL-赵-(三)-基于模型:贝尔曼最优公式【Bellman Optim Equation】【BOE符合收缩映射理论-->因此可通过“迭代法”求解出最优State Values-->也就得到了最优策略】

在这里插入图片描述
强化学习的目的是寻找最优策略。这里学习贝尔曼最优公式需要重点关注两个概念和一个工具:

  • 两个概念:optimal state value和optimal policy
  • 一个基本工具:the Bellman optimality equation (BOE)

在这里插入图片描述
在这里插入图片描述

一、Motivating examples

在这里插入图片描述
First, we calculate the state values of the given policy. In particular, the Bellman equation of this policy is

υ π ( s 1 ) = − 1 + γ υ π ( s 2 ) , υ π ( s 2 ) = + 1 + γ υ π ( s 4 ) , υ π ( s 3 ) = + 1 + γ υ π ( s 4 ) , υ π ( s 4 ) = + 1 + γ v π ( s 4 ) . \begin{gathered} \upsilon_{\pi}(s_1) =-1+\gamma\upsilon_\pi(s_2), \\ \upsilon_{\pi}(s_{2}) =+1+\gamma\upsilon_\pi(s_4), \\ \upsilon_{\pi}(s_{3}) =+1+\gamma\upsilon_\pi(s_4), \\ \begin{aligned}\upsilon_{\pi}(s_{4})\end{aligned} =+1+\gamma v_\pi(s_4). \end{gathered} υπ(s1)=1+γυπ(s2),υπ(s2)=+1+γυπ(s4),υπ(s3)=+1+γυπ(s4),υπ(s4)=+1+γvπ(s4).

Let γ = 0.9. It can be easily solved that

v π ( s 4 ) = v π ( s 3 ) = v π ( s 2 ) = 10 , v π ( s 1 ) = 8. \begin{aligned}&v_\pi(s_4)=v_\pi(s_3)=v_\pi(s_2)=10,\\&v_\pi(s_1)=8.\end{aligned} vπ(s4)=vπ(s3)=vπ(s2)=10,vπ(s1)=8.
在这里插入图片描述
Second, we calculate the action values for state s 1 s_1 s1:

q π ( s 1 , a 1 ) = − 1 + γ v π ( s 1 ) = 6.2 , q π ( s 1 , a 2 ) = − 1 + γ v π ( s 2 ) = 8 , q π ( s 1 , a 3 ) = 0 + γ v π ( s 3 ) = 9 , q π ( s 1 , a 4 ) = − 1 + γ v π ( s 1 ) = 6.2 , q π ( s 1 , a 5 ) = 0 + γ v π ( s 1 ) = 7.2. \begin{gathered} q_\pi(s_1,a_1) =-1+\gamma v_\pi(s_1)=6.2, \\ q_\pi(s_1,a_2) =-1+\gamma v_\pi(s_2)=8, \\ q_\pi(s_1,a_3) =0+\gamma v_\pi(s_3)=9, \\ q_\pi(s_1,a_4) =-1+\gamma v_\pi(s_1)=6.2, \\ q_\pi(s_1,a_5) =0+\gamma v_\pi(s_1)=7.2. \end{gathered} qπ(s1,a1)=1+γvπ(s1)=6.2,qπ(s1,a2)=1+γvπ(s2)=8,qπ(s1,a3)=0+γvπ(s3)=9,qπ(s1,a4)=1+γvπ(s1)=6.2,qπ(s1,a5)=0+γvπ(s1)=7.2.
在这里插入图片描述
It is notable that action a 3 a_3 a3 has the greatest action value:

q π ( s 1 , a 3 ) ≥ q π ( s 1 , a i ) ,  for all  i ≠ 3. q_\pi(s_1,a_3)\geq q_\pi(s_1,a_i),\quad\text{ for all }i\neq3. qπ(s1,a3)qπ(s1,ai), for all i=3.

Therefore, we can update the policy to select a 3 a_3 a3 at s 1 s_1 s1.

在这里插入图片描述
每一次迭代的过程中,每一个state都选择action value最大的action,最终就会得到最优的Policy。

二、最优策略/optimal policy

state value可以用来衡量一个policy的好坏,对于策略 π 1 \pi_1 π1 和策略 π 2 \pi_2 π2 来说,倘若在所有的状态 s s s 下,都存在 v π 1 ( s ) ≥ v π 2 ( s ) v_{\pi_1}(s)\geq v_{\pi_2}(s) vπ1(s)vπ2(s) 那么可得策略 π 1 \pi_1 π1优于策略 π 2 \pi_2 π2 。因此最优策略 π ∗ \pi^* π 就是,在所有的状态 s s s 下,均优于其他所有的策略 π \pi π

在这里插入图片描述
上述定义表明,相对于所有其他策略,最优策略对于每个状态都具有最大的状态值。这个定义也引发了许多问题:

  • 存在性:最优策略是否存在?
  • 唯一性:最优策略是否唯一?
  • 随机性:最优策略是随机的还是确定性的?
  • 算法:如何获得最优策略和最优状态值?

三、贝尔曼最优公式【Bellman Optimality Equation】

1、贝尔曼公式/Bellman Equation

对于贝尔曼公式来说,求解state value时是依赖于一个给定的π;

在这里插入图片描述

2、贝尔曼最优公式/Bellman Optimality Equation

对于贝尔曼最优公式来说,π是不定的,是需要求解的参数;

在这里插入图片描述
v ( s ) = max ⁡ π ( s ) ∈ Π ( s ) ∑ a ∈ A π ( a ∣ s ) ( ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v ( s ′ ) ) \begin{aligned}v(s)&=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a|s)\left(\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v(s')\right)\end{aligned} v(s)=π(s)Π(s)maxaAπ(as)(rRp(rs,a)r+γsSp(ss,a)v(s)) 公式的含义是:当Policy π π π 取某一个最优化值时,State Value v ( s ) v(s) v(s) 可以取到最大值。

在这里插入图片描述
在这里插入图片描述
BOE 很棘手但又优雅!

  • 为什么优雅?它以一种优雅的方式描述了最优策略和最优状态值。
  • 为什么棘手?右侧有一个最大化,可能不容易看出如何计算。
  • 有许多问题需要回答:
    • 算法:如何解决这个方程?
    • 存在性:这个方程是否有解?
    • 唯一性:这个方程的解是否唯一?
    • 最优性:它与最优策略有什么关系?

3、压缩/收缩映射定理【Contraction mapping theorem】

不动点

压缩/收缩映射

在这里插入图片描述
∣ ∣ f ( x 1 ) − f ( x 2 ) ∣ ∣ ≤ γ ∣ ∣ x 1 − x 2 ∣ ∣ ||f(x_1)-f(x_2)||\leq\gamma||x_1-x_2|| ∣∣f(x1)f(x2)∣∣γ∣∣x1x2∣∣

  • 其中 γ ∈ ( 0 , 1 ) \gamma\in(0,1) γ(0,1)
  • γ \gamma γ必须是严格小于1,这样许多极限 γ k → 0 \gamma^k\to0 γk0 k → 0 k\to0 k0成立;
  • 这里 ∣ ∣ ⋅ ∣ ∣ ||\cdot|| ∣∣∣∣可以是任意vector norm;

示例:首先,对于标量来说

在这里插入图片描述
其次,对于向量来说

在这里插入图片描述

预备知识:Contraction mapping theorem

在这里插入图片描述
对于任意符合 x = f ( x ) x=f(x) x=f(x) 格式的方程,如果 f f f 是收缩映射,那么:

  • 存在性:存在一个不动点 x ∗ x^* x 满足 f ( x ∗ ) = x ∗ f(x^*)=x^* f(x)=x
  • 唯一性:这个不动点 x ∗ x^* x 是唯一的;
  • 算法:迭代式算法 x k + 1 = f ( x k ) x_{k+1}=f(x_k) xk+1=f(xk),最终可以收敛到不动点处;

在这里插入图片描述

两个例子

  • 对于标量来说: x = 0.5 x x=0.5x x=0.5x, 其中 f ( x ) = 0.5 x f(x)=0.5x f(x)=0.5x 并且 x ∈ R x\in R xR, x ∗ = 0 x^*=0 x=0是一个唯一 (unique) 的fixed point, 它可以通过迭代式的方式求解

x k + 1 = 0.5 x k x_{k+1}=0.5x_k xk+1=0.5xk

  • 对于向量来说, x = A x x=Ax x=Ax,其中 f ( x ) = A x f(x)=Ax f(x)=Ax,且 x ∈ R n x\in R^n xRn, A ∈ R n × n A\in R^{n\times n} ARn×n,并且 ∣ ∣ A ∣ ∣ < 1 ||A||<1 ∣∣A∣∣<1, x ∗ = 0 x^*=0 x=0是唯一的不动点,它可以通过迭代式求解

x k + 1 = A x k x_{k+1}=Ax_k xk+1=Axk

4、求解贝尔曼最优公式

4.1 最大化贝尔曼最优公式右侧

在这里插入图片描述
在这里插入图片描述

首先固定 υ ( s ′ ) \upsilon(s') υ(s),因为系统模型参数 p ( r ∣ s , a ) p(r|s,a) p(rs,a) p ( s ′ ∣ s , a ) p(s'|s,a) p(ss,a)都是已知的, r r r(reward)、 γ \gamma γ(discount rate)都是给定的,所以 q ( s , a ) q(s,a) q(s,a) 是常数,为了使得右侧取到最大,则使得右项最大时的 π π π 策略也可以确定下来了。

在这里插入图片描述

4.2 解贝尔曼最优公式

从4.1的分析中可知,如果固定 υ ( s ′ ) \upsilon(s') υ(s),那么贝尔曼最优公式的右侧的最大值就可以确定了。

可见右侧的最大值时取决于 υ ( s ′ ) \upsilon(s') υ(s),也就是说右侧项是 υ ( s ′ ) \upsilon(s') υ(s) 的函数。

在这里插入图片描述
上式中 f ( v ) f(v) f(v) 是一个向量, [ f ( v ) ] s [f(v)]_s [f(v)]s 表示向量中对应的元素state s s s 的值是 max ⁡ π ∑ a π ( a ∣ s ) q ( s , a ) \begin{aligned}\max_\pi\sum_a\pi(a|s)q(s,a)\end{aligned} πmaxaπ(as)q(s,a)

4.3 应用“压缩映射定理”解贝尔曼最优公式

首先要证明贝尔曼最优方程中的 v = f ( v ) v=f(v) v=f(v) 是一个 Contraction Mapping。

在这里插入图片描述
∵ ∵ 通过证明可以得到: ∥ f ( v 1 ) − f ( v 2 ) ∥ ≤ γ ∥ v 1 − v 2 ∥ \|f(v_1)-f(v_2)\|\leq\color{red}{\gamma}\|v_1-v_2\| f(v1)f(v2)γv1v2,其中 γ \gamma γ 是 discount rate;
∴ ∴ v = f ( v ) v=f(v) v=f(v) 是Contraction Mapping。

在这里插入图片描述
由于贝尔曼最优方程符合Contraction mapping theorem,所以:

  • 存在性:存在一个解 v ∗ v^* v
  • 唯一性: v ∗ v^* v是唯一的;
  • 算法:State Value 可以通过迭代式算法 v k + 1 = f ( v k ) = max ⁡ π ( r π + γ P π v k ) \begin{aligned}v_{k+1}=f(v_k)=\max_{\pi}(r_\pi+\gamma P_\pi v_k)\end{aligned} vk+1=f(vk)=πmax(rπ+γPπvk) 最终收敛到唯一解 v ∗ v^* v 处;

在这里插入图片描述
在这里插入图片描述
首先固定 υ ( s ′ ) \upsilon(s') υ(s),因为系统模型参数 p ( r ∣ s , a ) p(r|s,a) p(rs,a) p ( s ′ ∣ s , a ) p(s'|s,a) p(ss,a)都是已知的, r r r(reward)、 γ \gamma γ(discount rate)都是给定的,所以 q ( s , a ) q(s,a) q(s,a) 是常数,为了使得右侧取到最大,则使得右项最大时的 π π π 策略也可以确定下来了。

求解过程:

  • 对于任意 s s s,当前的估计值是 v k ( s ) v_k\left(s\right) vk(s)
  • 对于任意 a ∈ A ( s ) a\in A(s) aA(s),计算:
    q k ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) q_k\left(s,a\right)=\sum_rp(r|s,a)r+\gamma\sum_{s^{\prime}}p(s'|s,a)v_k\left(s'\right) qk(s,a)=rp(rs,a)r+γsp(ss,a)vk(s)
  • 计算贪心策略 π k + 1 \pi_{k+1} πk+1,对于 s s s,有:
    π k + 1 ( a ∣ s ) = { 1 a = a k ∗ ( s ) 0 a ≠ a k ∗ ( s ) \pi_{k+1}\left(a|s\right)=\begin{cases}1&a=a_k^*\left(s\right)\\0&a\neq a_k^*\left(s\right)\end{cases} πk+1(as)={ 10a=ak(s)a=ak(s)
    其中 a k ∗ ( s ) = a r g m a x a q k ( s , a ) a_k^*(s)=argmax_aq_k(s,a) ak(s)=argmaxaqk(s,a)【该公式表示在第 k k k次迭代, a a a 取使得在状态 s s s 的 Action Value q k ( s ) q_k(s) qk(s) 值最大的 a a a
  • 计算:
    v k + 1 ( s ) = max ⁡ a q k ( s , a ) v_{k+1}\left(s\right)=\max_aq_k\left(s,a\right) vk+1(s)=amaxqk(s,a)
    该公式表示:状态 s s s k + 1 k+1 k+1 迭代步骤时的 State Value v k + 1 ( s ) v_{k+1}\left(s\right) vk+1(s) 等于 action a a a 取使得 Action Value q ( a ) q(a) q(a) 达到最大值时的 Action Value 值。

4.4 案例:手动求解BOE(贝尔曼最优公式)

在这里插入图片描述
手动求解BOE:

  • Action: a l , a 0 , a r a_l,a_0,a_r al,a0,ar分别表示向左,保持不变,向右。
  • Reward: 进入target area: +1; 试图走出边界: -1

根据 Action Value q π ( s , a ) q_\pi(s,a) qπ(s,a)与 State Value v π ( s ′ ) v_\pi(s') vπ(s)的关系: q π ( s , a ) = ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) \color{red}q_\pi(s,a)=\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_\pi(s') qπ(s,a)=rRp(rs,a)r+γsSp(ss,a)vπ(s) 可得到如下表格:

在这里插入图片描述
例如:

  • agent在状态 s 1 s_1 s1 采取“向左走”的action,然后又弹回到 s 1 s_1 s1,则:
    q 0 ( s 1 , a l ) = p ( r ∣ s 1 , a l ) r + γ p ( s 1 ∣ s 1 , a l ) v 0 ( s 1 ) = 1 × ( − 1 ) + γ v 0 ( s 1 ) = − 1 + γ v 0 ( s 1 ) = − 1 + γ × 0 = − 1 q_0(s_1, a_l) = p(r|s_1,a_l)r + \gamma p(s_1|s_1,a_l)v_0(s_1)=1×(-1)+\gamma v_0(s_1)=-1+\gamma v_0(s_1)=-1+\gamma×0=-1 q0(s1,al)=p(rs1,al)r+γp(s1s1,al)v0(s1)=1×(1)+γv0(s1)=1+γv0(s1)=1+γ×0=1
  • agent在状态 s 1 s_1 s1 采取“保持不变”的action,则:
    q 0 ( s 1 , a 0 ) = p ( r ∣ s 1 , a 0 ) r + γ p ( s 1 ∣ s 1 , a 0 ) v 0 ( s 1 ) = 1 × ( 0 ) + γ v ( s 1 ) = 0 + γ v 0 ( s 1 ) = 0 + γ × 0 = 0 q_0(s_1, a_0) = p(r|s_1,a_0)r + \gamma p(s_1|s_1,a_0)v_0(s_1)=1×(0)+\gamma v(s_1)=0+\gamma v_0(s_1)=0+\gamma×0=0 q0(s1,a0)=p(rs1,a0)r+γp(s1s1,a0)v0(s1)=1×(0)+γv(s1)=0+γv0(s1)=0+γ×0=0
  • agent在状态 s 1 s_1 s1 采取“向右走”的action,进入 s 2 s_2 s2,则:
    q 0 ( s 1 , a r ) = p ( r ∣ s 1 , a r ) r + γ p ( s 2 ∣ s 1 , a r ) v 0 ( s 2 ) = 1 × ( 1 ) + γ v 0 ( s 2 ) = 1 + γ v 0 ( s 2 ) = 1 + γ × 0 = 1 q_0(s_1, a_r) = p(r|s_1,a_r)r + \gamma p(s_2|s_1,a_r)v_0(s_2)=1×(1)+\gamma v_0(s_2)=1+\gamma v_0(s_2)=1+\gamma×0=1 q0(s1,ar)=p(rs1,ar)r+γp(s2s1,ar)v0(s2)=1×(1)+γv0(s2)=1+γv0(s2)=1+γ×0=1

在这里插入图片描述
经过第0轮的迭代,

  • v k + 1 ( s 1 ) = v 1 ( s 1 ) = max ⁡ a q 0 ( s 1 , a ) = 当 a 取 a r 时, q 0 取最大值 1 q 0 ( s 1 , a r ) = 1 v_{k+1}(s_1)=v_1(s_1)=\max_aq_0(s_1,a)\xlongequal[]{当a取a_r时,q_0取最大值1}q_0(s_1,a_r)=1 vk+1(s1)=v1(s1)=maxaq0(s1,a)aar时,q0取最大值1 q0(s1,ar)=1
  • v k + 1 ( s 2 ) = v 1 ( s 2 ) = max ⁡ a q 0 ( s 2 , a ) = 当 a 取 a 0 时, q 0 取最大值 1 q 0 ( s 2 , a 0 ) = 1 v_{k+1}(s_2)=v_1(s_2)=\max_aq_0(s_2,a)\xlongequal[]{当a取a_0时,q_0取最大值1}q_0(s_2,a_0)=1 vk+1(s2)=v1(s2)=maxaq0(s2,a)aa0时,q0取最大值1 q0(s2,a0)=1
  • v k + 1 ( s 1 ) = v 1 ( s 3 ) = max ⁡ a q 0 ( s 3 , a ) = 当 a 取 a l 时, q 0 取最大值 1 q 0 ( s 3 , a l ) = 1 v_{k+1}(s_1)=v_1(s_3)=\max_aq_0(s_3,a)\xlongequal[]{当a取a_l时,q_0取最大值1}q_0(s_3,a_l)=1 vk+1(s1)=v1(s3)=maxaq0(s3,a)aal时,q0取最大值1 q0(s3,al)=1

虽然此时的policy π π π 已经达到的最优解,但是 State Value v v v 还没有达到贝尔曼最优方程的最优解,继续迭代…

继续根据 q π ( s , a ) = ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) \color{red}q_\pi(s,a)=\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_\pi(s') qπ(s,a)=rRp(rs,a)r+γsSp(ss,a)vπ(s) 计算下一步的 q π ( s , a ) q_π(s, a) qπ(s,a) 值,例如:

  • agent在状态 s 1 s_1 s1 采取“向左走”的action,然后又弹回到 s 1 s_1 s1,则:
    q 1 ( s 1 , a l ) = p ( r ∣ s 1 , a l ) r + γ p ( s 1 ∣ s 1 , a l ) v 1 ( s 1 ) = 1 × ( − 1 ) + γ v 0 ( s 1 ) = − 1 + γ v 1 ( s 1 ) = − 1 + 0.9 × 1 = − 0.1 q_1(s_1, a_l) = p(r|s_1,a_l)r + \gamma p(s_1|s_1,a_l)v_1(s_1)=1×(-1)+\gamma v_0(s_1)=-1+\gamma v_1(s_1)=-1+0.9×1=-0.1 q1(s1,al)=p(rs1,al)r+γp(s1s1,al)v1(s1)=1×(1)+γv0(s1)=1+γv1(s1)=1+0.9×1=0.1
  • agent在状态 s 1 s_1 s1 采取“保持不变”的action,则:
    q 1 ( s 1 , a 0 ) = p ( r ∣ s 1 , a 0 ) r + γ p ( s 1 ∣ s 1 , a 0 ) v 1 ( s 1 ) = 1 × ( 0 ) + γ v ( s 1 ) = 0 + γ v 0 ( s 1 ) = 0 + 0.9 × 1 = 0.9 q_1(s_1, a_0) = p(r|s_1,a_0)r + \gamma p(s_1|s_1,a_0)v_1(s_1)=1×(0)+\gamma v(s_1)=0+\gamma v_0(s_1)=0+0.9×1=0.9 q1(s1,a0)=p(rs1,a0)r+γp(s1s1,a0)v1(s1)=1×(0)+γv(s1)=0+γv0(s1)=0+0.9×1=0.9
  • agent在状态 s 1 s_1 s1 采取“向右走”的action,进入 s 2 s_2 s2,则:
    q 1 ( s 1 , a r ) = p ( r ∣ s 1 , a r ) r + γ p ( s 2 ∣ s 1 , a r ) v 1 ( s 2 ) = 1 × ( 1 ) + γ v 1 ( s 2 ) = 1 + γ v 1 ( s 2 ) = 1 + 0.9 × 1 = 1.9 q_1(s_1, a_r) = p(r|s_1,a_r)r + \gamma p(s_2|s_1,a_r)v_1(s_2)=1×(1)+\gamma v_1(s_2)=1+\gamma v_1(s_2)=1+0.9×1=1.9 q1(s1,ar)=p(rs1,ar)r+γp(s2s1,ar)v1(s2)=1×(1)+γv1(s2)=1+γv1(s2)=1+0.9×1=1.9

在这里插入图片描述

四、最优策略/Optimal Policy

贝尔曼最优公式是一个特殊的贝尔曼公式。

贝尔曼最优公式对应的策略是最优策略。

在这里插入图片描述
假设 v ∗ v^* v 是贝尔曼最优公式的解,它满足
v ∗ = max ⁡ π ( r π + γ P π v ∗ ) v^{*}=\max_{\pi}\left(r_{\pi}+\gamma P_{\pi}v^{*}\right) v=πmax(rπ+γPπv)
假设
π ∗ = a r g max ⁡ π ( r π + γ P π v ∗ ) \pi^*=arg\max_{\pi}(r_\pi+\gamma P_\pi v^*) π=argπmax(rπ+γPπv)
然后
v ∗ = r π ∗ + γ P π ∗ v ∗ v^*=r_{\pi^*}+\gamma P_{\pi^*}v^* v=rπ+γPπv

v ∗ = r π ∗ + γ P π ∗ v ∗ v^*=r_{\pi^*}+\gamma P_{\pi^*}v^* v=rπ+γPπv就是一个贝尔曼最优 公式,贝尔曼公式求State Value解时一定是在给定的一个策略条件下求解,此时 π ∗ \pi^* π就是这个特定的策略,并且 v ∗ = v π ∗ v^*=v_{\pi^*} v=vπ 是特定策略 π ∗ \pi^* π对应的状态值。

所以,贝尔曼最优公式是一个特殊的贝尔曼公式。

结论:

在这里插入图片描述

根据贝尔曼最优公式求解出来的 v ∗ v^* v 是最大的State Value。 π ∗ π^* π 是对应的最优的策略。

π ∗ π^* π 长什么样子:

在这里插入图片描述

π ∗ π^* π 就是在所有状态 s s s 选择使得Action Value q ( s ) q(s) q(s) 最大的action。

  • 策略是确定的(每一个state在每一个迭代步选择的action都是根据Action Value q ( s ) q(s) q(s) 唯一确定的);
  • 缺略是贪心的(每一个state在每一个迭代步都选择使得当前Action Value q ( s ) q(s) q(s) 最大的action,不考虑以后的影响);

五、最优策略的决定因素

在这里插入图片描述
什么因素决定了最优策略?答:可以从BOE的公式里边寻找。

有三个因素:

  • 奖励设定 (reward design) : r r r
  • 系统模型 (System model) : p ( s ′ ∣ s , a ) ,   p ( r ∣ s , a ) p(s^{\prime}|s,a),~p(r|s,a) p(ss,a), p(rs,a)
  • 折扣率 (discount rate) : γ \gamma γ

需要被计算的未知量:

  • v ( s ) v(s) v(s)
  • v ( s ′ ) v(s^{\prime}) v(s)
  • π ( a ∣ s ) \pi(a|s) π(as)

当γ比较大时,会比较远视,得到的return中远期的reward比重会相对大一些;

当γ比较小时,会比较短视,得到的return中近期的reward权重会相对大一些;

1、案例

现在使用一个例子描述如何改变 r r r γ γ γ 可以改变最优策略。这个最优策略和对应的最优状态值通过求解BOE得到。

在这里插入图片描述
最有策略尝试采取风险:进入forbidden area。

如果将 γ = 0.9 γ = 0.9 γ=0.9 改变为 γ = 0.5 γ = 0.5 γ=0.5

在这里插入图片描述
γ γ γ减小后,这个最优策略变的 短视(shorted-sighted),避免进入所有forbidden area。

现在将 γ γ γ 降为 0 0 0:

在这里插入图片描述

现在的最优策略变得 extremely short-sighted。选择具有最大的immediate reward的action,不能到达target!

如果当进入forbidden area的时候增加punishment( r f o r b i d d e n = − 1  to  r f o r b i d d e n = − 10 r_{forbidden}=-1 \text{ to } r_{forbidden}=-10 rforbidden=1 to rforbidden=10

在这里插入图片描述
现在的策略将会绕开forbidden area。(现在的 γ = 0.9 γ = 0.9 γ=0.9)。

在这里插入图片描述
再考虑一种情况,如果令 r → a r + b r\to ar+b rar+b,那么最优策略会改变吗? 例如:

r b o u n d a r y = r f o r b i d d e n = − 1 , r t a r g e t = 1 r_{boundary}=r_{forbidden}=-1,r_{target}=1 rboundary=rforbidden=1,rtarget=1

(这里隐含了 r o t h e r s t e p = 0 } r_{otherstep}=0\} rotherstep=0} 变为

r b o u n d a r y = r f o r b i d d e n = 0 , r t a r g e t = 2 , r o t h e r s t e p = 1 r_{boundary}=r_{forbidden}=0,r_{target}=2,r_{otherstep}=1 rboundary=rforbidden=0,rtarget=2,rotherstep=1

最优策略实际上不会改变。重要的不是绝对的reward value, 而是relative value。

2、定理:Optimal Policy Invariance

在这里插入图片描述

另一个例子:

在这里插入图片描述
显然,左边(a)是具有最优策略,右边(b)不是一个最优策略。
问题是:为什么(b)的策略不是最优的呢?为什么最优策略不采取meaningless detours?因为taking detours虽然没有punishment.,但是还有discount rate!

在设计reward的时候,即使将默认r设计为0,也不会绕远路,因为除了r来约束不要绕远路,γ的存在也会限制不会绕远路,因为越绕远路,得到的reward越晚,最后计算得到的return越小。




3.1 Motivating example: How to improve policies?

在这里插入图片描述
Consider the policy shown in Figure 3.2. Here, the orange and blue cells represent the forbidden and target areas, respectively. The policy here is n o t not not g o o d good good because it selects a 2 a_{2} a2 ( rightward) at state s 1 s_1 s1. How can we improve the given policy to obtain a better policy? The answer lies in state values and action values.

I n t u i t i o n Intuition Intuition: It is intuitively clear that the policy can improve if it selects a 3 a_3 a3(downward) instead of a 2 a_2 a2(rightward) at s 1 . s_1. s1. This is because moving downward enables the agent to avoid entering the forbidden area.

M a t h e m a t i c s Mathematics Mathematics: The above intuition can be realized based on the calculation of state values and action values.

This example illustrates that we can obtain a better policy if we update the policy to select the action with the g r e a t e s t greatest greatest a c t i o n   v a l u e . action~value. action value. This is the basic idea of many reinforcement learning algorithms.

This example is very simple in the sense that the given policy is only not good for state s 1 . s_1. s1. If the policy is also not good for the other states, will selecting the action with the greatest action value still generate a better policy? Moreover, whether there always exist optimal policies? What does an optimal policy look like? We will answer all of these questions in this chapter.

3.2 Optimal state values and optimal policies

While the ultimate goal of reinforcement learning is to obtain optimal policies, it is necessary to first define what an optimal policy is. 【强化学习的最终目标是获得最优策略,但首先需要定义什么是最优策略。】

The defnition is based on state values.

In particular, consider two given policies π 1 \pi_{1} π1 and π 2 \pi_{2} π2. If the state value of π 1 \pi_{1} π1 is greater than or equal to that of π 2 \pi_2 π2 for any state,then π 1 \pi_{1} π1 is said to be better than π 2 \pi_{2} π2. :【考虑给定的两个策略 π 1 \pi_{1} π1 π 2 \pi_{2} π2, 如果对于任何状态在策略 π 1 \pi_{1} π1下的状态值都大于或等于 π 2 \pi_{2} π2 π 1 \pi_{1} π1被认为比 π 2 \pi_{2} π2更好。】

v π 1 ( s ) ≥ v π 2 ( s ) ,  for all  s ∈ S , \begin{aligned}v_{\pi_1}(s)\geq v_{\pi_2}(s),\quad\text{ for all }s\in\mathcal{S},\end{aligned} vπ1(s)vπ2(s), for all sS,

Furthermore, if a policy is better than all the other possible policies, then this policy is optimal.【如果一项政策比所有其他可能的政策都要好,那么这项政策就是最优的。】

Definition 3.1 (Optimal policy and optimal state value). A A A p o l i c y   π ∗   i s   o p t i m a l   i f policy\:\pi^*~is~optimal~if policyπ is optimal if v π ∗ ( s ) ≥ v π ( s )   f o r   a l l   s ∈ S   a n d   f o r   a n y   o t h e r   p o l i c y   π .   T h e   s t a t e   v a l u e s   o f   π ∗   a r e   t h e v_{\pi^*}(s)\geq v_\pi(s)\:for\:all\:s\in\mathcal{S}\:and\:for\:any\:other\:policy\:\pi.\:The\:state\:values\:of\:\pi^*\:are\:the vπ(s)vπ(s)forallsSandforanyotherpolicyπ.Thestatevaluesofπarethe o p t i m a l optimal optimal s t a t e state state v a l u e s . values. values.

The above definition indicates that an optimal policy has the greatest state value for every state compared to all the other policies. This definition also leads to many questions:

  • Existence: Does the optimal policy exist?
  • Uniqueness: Is the optimal policy unique?
  • Stochasticity: Is the optimal policy stochastic or deterministic?
  • Algorithm: How to obtain the optimal policy and the optimal state values?

These fundamental questions must be clearly answered to thoroughly understand optimal policies.

For example, regarding the existence of optimal policies, if optimal policies do not exist, then we do not need to bother to design algorithms to find them.

3.3 Bellman optimality equation【贝尔曼最优方程】

The tool for analyzing optimal policies and optimal state values is the Bellman optimality equation (BOE).

By solving this equation, we can obtain optimal policies and optimal state values. 【通过解决这个方程,我们可以得到最优策略和最优状态值。】

We next present the expression of the BOE and then analyze it in detail.

贝尔曼方程:
v π ( s ) = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r ⏟ mean of immediate rewards + γ ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) , ⏟ mean of future rewards = ∑ a ∈ A π ( a ∣ s ) [ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) ] , for all  s ∈ S \begin{aligned} \color{red}{v_{\pi}(s)}&=\mathbb{E}[R_{t+1}|S_t=s]+\gamma\mathbb{E}[G_{t+1}|S_t=s] \\[2ex] &=\underbrace{\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r}_{\text{mean of immediate rewards}}+\underbrace{\gamma\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s'),}_{\text{mean of future rewards}}\\ &=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right],\quad\text{for all }s\in\mathcal{S} \end{aligned} vπ(s)=E[Rt+1St=s]+γE[Gt+1St=s]=mean of immediate rewards aAπ(as)rRp(rs,a)r+mean of future rewards γaAπ(as)sSp(ss,a)vπ(s),=aAπ(as)[rRp(rs,a)r+γsSp(ss,a)vπ(s)],for all sS
在这里插入图片描述
For every s ∈ S s\in\mathcal{S} sS, the elementwise expression of the BOE(贝尔曼最优方程) is

υ ( s ) = max ⁡ π ( s ) ∈ Π ( s ) ∑ a ∈ A π ( a ∣ s ) ( ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v ( s ′ ) ) = max ⁡ π ( s ) ∈ Π ( s ) ∑ a ∈ A π ( a ∣ s ) q ( s , a ) , \begin{aligned} \upsilon(s)& \begin{aligned}=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a|s)\left(\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v(s')\right)\end{aligned} \\ &=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a|s)q(s,a), \end{aligned} υ(s)=π(s)Π(s)maxaAπ(as)(rRp(rs,a)r+γsSp(ss,a)v(s))=π(s)Π(s)maxaAπ(as)q(s,a),

where v ( s ) , v ( s ′ ) v(s),v(s^{\prime}) v(s),v(s) are unknown variables to be solved and

q ( s , a ) ≐ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v ( s ′ ) . q(s,a)\doteq\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s,a)v(s^{\prime}). q(s,a)rRp(rs,a)r+γsSp(ss,a)v(s).

Here, π ( s ) \pi(s) π(s) denotes a policy for state s s s, and Π ( s ) \Pi(s) Π(s) is the set of all possible policies for s s s.

The BOE(贝尔曼最优方程)is an elegant and powerful tool for analyzing optimal policies.

However, it may be nontrivial to understand this equation. For example, this equation has two unknown variables v ( s ) v(s) v(s) and π ( a ∣ s ) \pi(a|s) π(as).

It may be confusing to beginners how to solve two unknown variables from one equation.

Moreover, the BOE is actually a special Bellman equation.

However, it is nontrivial to see that since its expression is quite different from that of the Bellman equation.

We also need to answer the following fundamental questions about the BOE.

  • Existence: Does this equation have a solution?
  • Uniqueness: Is the solution unique?
  • Algorithm: How to solve this equation?
  • Optimality: How is the solution related to optimal policies?

Once we can answer these questions, we will clearly understand optimal state values and optimal policies.

3.3.1 Maximization of the right-hand side of the BOE

We next clarify how to solve the maximization problem on the right-hand side of the BOE.

At first glance, it may be confusing to beginners how to solve t w o two two unknown variables v ( s ) v(s) v(s) and π ( a ∣ s ) \pi(a|s) π(as) from o n e one one equation.

In fact, these two unknown variables can be solved one by one.

This idea is illustrated by the following example.

Example 3.1. C o n s i d e r   t w o   u n k n o w n   v a r i a b l e s   x , y ∈ R   t h a t   s a t i s f y 3.1.Consider\:two\:unknown\:variables\:x,y\in\mathbb{R}\:that\:satisfy 3.1.Considertwounknownvariablesx,yRthatsatisfy

x = max ⁡ y ∈ R ( 2 x − 1 − y 2 ) . \begin{aligned}x&=\max_{y\in\mathbb{R}}(2x-1-y^2).\end{aligned} x=yRmax(2x1y2).

T h e   f i r s t   s t e p   i s   t o   s o l v e   y   o n   t h e   r i g h t − h a n d   s i d e   o f   t h e   e q u a t i o n .   R e g a r d l e s s   o f   t h e   v a l u e The\:first\:step\:is\:to\:solve\:y\:on\:the\:right-hand\:side\:of\:the\:equation.\:Regardless\:of\:the\:value Thefirststepistosolveyontherighthandsideoftheequation.Regardlessofthevalue o f x ,   w e   a l w a y s   h a v e   max ⁡ y ( 2 x − 1 − y 2 ) = 2 x − 1 ,   w h e r e   t h e   m a x i m u m   i s   a c h i e v e d   w h e n ofx,\:we\:always\:have\:\max_y(2x-1-y^2)=2x-1,\:where\:the\:maximum\:is\:achieved\:when ofx,wealwayshavemaxy(2x1y2)=2x1,wherethemaximumisachievedwhen y = 0. T h e   s e c o n d   s t e p   i s   t o   s o l v e   x . W h e n   y = 0 ,   t h e   e q u a t i o n   b e c o m e s   x = 2 x − 1 y=0.\quad The\:second\:step\:is\:to\:solve\:x.\quad When\:y=0,\:the\:equation\:becomes\:x=2x-1 y=0.Thesecondstepistosolvex.Wheny=0,theequationbecomesx=2x1, w h i c h   l e a d s   t o   x = 1.   T h e r e f o r e ,   y = 0   a n d   x = 1   a r e   t h e   s o l u t i o n s   o f   t h e   e q u a t i o n . which\:leads\:to\:x=1.\:Therefore,\:y=0\:and\:x=1\:are\:the\:solutions\:of\:the\:equation. whichleadstox=1.Therefore,y=0andx=1arethesolutionsoftheequation.

We now turn to the maximization problem on the right-hand side of the BOE. The BOE in (3.1) can be written concisely as

v ( s ) = max ⁡ π ( s ) ∈ Π ( s ) ∑ a ∈ A π ( a ∣ s ) q ( s , a ) , s ∈ S . v(s)=\max_{\pi(s)\in\Pi(s)}\sum_{a\in\mathcal{A}}\pi(a|s)q(s,a),\quad s\in\mathcal{S}. v(s)=π(s)Π(s)maxaAπ(as)q(s,a),sS.

Inspired by Example 3.1, we can first solve the optimal π \pi π on the right-hand side. How to do that? The following example demonstrates its basic idea.
Example 3.2.   G i v e n   q 1 , q 2 , q 3 ∈ R ,   w e   w o u l d   l i k e   t o   f i n d   t h e   o p t i m a l   v a l u e s   o f   c 1 , c 2 , c 3 3.2.\:Given\:q_1,q_2,q_3\in\mathbb{R},\:we\:would\:like\:to\:find\:the\:optimal\:values\:of\:c_1,c_2,c_3 3.2.Givenq1,q2,q3R,wewouldliketofindtheoptimalvaluesofc1,c2,c3
t o to to m a x i m i z e maximize maximize

∑ i = 1 3 c i q i = c 1 q 1 + c 2 q 2 + c 3 q 3 , \begin{aligned}\sum_{i=1}^3c_iq_i&=c_1q_1+c_2q_2+c_3q_3,\\\end{aligned} i=13ciqi=c1q1+c2q2+c3q3,

w h e r e   c 1 + c 2 + c 3 = 1   a n d   c 1 , c 2 , c 3 ≥ 0. \begin{aligned}where\:c_1+c_2+c_3=1\:and\:c_1,c_2,c_3\geq0.\end{aligned} wherec1+c2+c3=1andc1,c2,c30.

W i t h o u t   l o s s   o f   g e n e r a l i t y ,   s u p p o s e   t h a t   q 3   ≥   q 1 , q 2 .   T h e n ,   t h e   o p t i m a l   s o l u t i o n   i s Without~loss~of~generality,~suppose~that~q_3~\geq~q_1,q_2.~Then,~the~optimal~solution~is Without loss of generality, suppose that q3  q1,q2. Then, the optimal solution is c 3 ∗ = 1   a n d   c 1 ∗ = c 2 ∗ = 0.   T h i s   i s   b e c a u s e \begin{aligned}c_3^*=1~and~c_1^*=c_2^*=0.~This~is~because\end{aligned} c3=1 and c1=c2=0. This is because

q 3 = ( c 1 + c 2 + c 3 ) q 3 = c 1 q 3 + c 2 q 3 + c 3 q 3 ≥ c 1 q 1 + c 2 q 2 + c 3 q 3 q_3=(c_1+c_2+c_3)q_3=c_1q_3+c_2q_3+c_3q_3\geq c_1q_1+c_2q_2+c_3q_3 q3=(c1+c2+c3)q3=c1q3+c2q3+c3q3c1q1+c2q2+c3q3

f o r   a n y   c 1 , c 2 , c 3 . \begin{aligned}for~any~c_1,c_2,c_3.\end{aligned} for any c1,c2,c3.

Inspired by the above example, since ∑ a π ( a ∣ s ) = 1 \sum_a\pi(a|s)=1 aπ(as)=1, we have

∑ a ∈ A π ( a ∣ s ) q ( s , a ) ≤ ∑ a ∈ A π ( a ∣ s ) max ⁡ a ∈ A q ( s , a ) = max ⁡ a ∈ A q ( s , a ) , \sum_{a\in\mathcal{A}}\pi(a|s)q(s,a)\leq\sum_{a\in\mathcal{A}}\pi(a|s)\max_{a\in\mathcal{A}}q(s,a)=\max_{a\in\mathcal{A}}q(s,a), aAπ(as)q(s,a)aAπ(as)aAmaxq(s,a)=aAmaxq(s,a),

where equality is achieved when

π ( a ∣ s ) = { 1 , a = a ∗ , 0 , a ≠ a ∗ . \left.\pi(a|s)=\left\{\begin{array}{ll}1,&a=a^*,\\0,&a\neq a^*.\end{array}\right.\right. π(as)={ 1,0,a=a,a=a.

Here, a ∗ = arg ⁡ max ⁡ a q ( s , a ) . a^* = \arg \max _aq( s, a) . a=argmaxaq(s,a). In summary, the optimal policy π ( s ) \pi(s) π(s) is the one that selects the action that has the greatest value of q ( s , a ) . q( s, a) . q(s,a).

3.3.2 Matrix-vector form of the BOE

The BOE refers to a set of equations defined for all states. If we combine these equations, we can obtain a concise matrix-vector form, which will be extensively used in this chapter.

The matrix-vector form of the BOE is

υ = max ⁡ π ∈ Π ( r π + γ P π υ ) , \upsilon=\max_{\pi\in\Pi}(r_\pi+\gamma P_\pi\upsilon), υ=πΠmax(rπ+γPπυ),

where v ∈ R ∣ S ∣ v\in\mathbb{R}^{|\mathcal{S}|} vRS and max is performed in an elementwise manner. The structures of r π r_{\pi} rπ and P π P_{\pi} Pπ are the same as those in the matrix-vector form of the normal Bellman equationı

[ r π ] s ≐ ∑ α ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r , [ P π ] s , s ′ = p ( s ′ ∣ s ) ≐ ∑ α ∈ A π ( a ∣ s ) p ( s ′ ∣ s , a ) . [r_\pi]_s\doteq\sum_{\alpha\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r,\quad[P_\pi]_{s,s^{\prime}}=p(s'|s)\doteq\sum_{\alpha\in\mathcal{A}}\pi(a|s)p(s'|s,a). [rπ]sαAπ(as)rRp(rs,a)r,[Pπ]s,s=p(ss)αAπ(as)p(ss,a).

Since the optimal value of π \pi π is determined by v v v, the right-hand side of (3.2) is a function of $v, $ denoted as

f ( v ) ≐ max ⁡ π ∈ Π ( r π + γ P π v ) . f(v)\doteq\max_{\pi\in\Pi}(r_\pi+\gamma P_\pi v). f(v)πΠmax(rπ+γPπv).
Then, the BOE can be expressed in a concise form as

υ = f ( υ ) \upsilon=f(\upsilon) υ=f(υ)

3.3.3 Contraction mapping theorem

Since the BOE can be expressed as a nonlinear equation v = f ( v ) v=f(v) v=f(v), we next introduce the contraction mapping theorem [6] to analyze it. The contraction mapping theorem is a powerful tool for analyzing general nonlinear equations. It is also known as the fixedpoint theorem. Readers who already know this theorem can skip this part. Otherwise the reader is advised to be familiar with this theorem since it is the key to analyzing the BOE.

Consider a function $f( x) , $where x ∈ R d x\in \mathbb{R} ^d xRd and f : R d → R d . f: \mathbb{R} ^d\to \mathbb{R} ^d. f:RdRd.

A point x ∗ x^* x is called a fxed point if

f ( x ∗ ) = x ∗ . f(x^*)=x^*. f(x)=x.

The interpretation of the above equation is that the map of x ∗ x^* x is itself. This is the reason why x ∗ x^* x is called “fixed”. The function f f f is a c o n t r a c t i o n contraction contraction m a p p i n g mapping mapping (or contractive function) if there exists $\gamma\in ( 0, 1) $ such that

∥ f ( x 1 ) − f ( x 2 ) ∥ ≤ γ ∥ x 1 − x 2 ∥ \|f(x_1)-f(x_2)\|\leq\gamma\|x_1-x_2\| f(x1)f(x2)γx1x2

for any x 1 , x 2 ∈ R d . x_1, x_2\in \mathbb{R} ^d. x1,x2Rd. In this book, ∥ ⋅ ∥ \|\cdot \| denotes a vector or matrix norm.




【强化学习的数学原理】课程:从零开始到透彻理解(完结)
MathFoundationRL/Book-Mathmatical-Foundation-of-Reinforcement-Learning
学习笔记-强化学习4-用Banach不动点定理证明Value-based RL收敛性
【强化学习】强化学习数学基础:贝尔曼最优公式
学习心得-强化学习【贝尔曼最优公式】

猜你喜欢

转载自blog.csdn.net/u013250861/article/details/134797110