Chapter3 Markov Decision Processes(MDP)

参考了《Reinforcement Learning: An Introduction》和
David Silver强化学习公开课,
这一章主要来自David Silver的ppt,建议直接看ppt,我只把容易犯错的地方点出来了


马尔科夫过程是强化学习的基础


Finite Markov Decision Processes

Markov property

A state S t is Markov if and only of

P [ S t + 1 | S t ] = P [ S t + 1 | S 1 , , S t ]

  • The state captures all relevant information from the history
  • Once the state is know, the history may be thrown away
  • i.e. The state is a sufficient statistic of the future

A Markov process is a memoryless random process, i.e. a sequence of random states S 1 , S 2 , with the Markov property.
Markov Process

A Markov Process (or Markov Chain) is a tuple S , P

  • S is a (finite) set of states
  • P is a state transition probability matrix, P s s = P [ S t + 1 = s | S t = s ]

A Markov reward process is a Markov chain with values.
Markov Reward Process

A Markov Process (or Markov Chain) is a tuple S , P , R , γ

  • S is a (finite) set of states
  • P is a state transition probability matrix, P s s = P [ S t + 1 = s | S t = s ]
  • R is a reward function,  R s = E [ R t + 1 | S t = s ]
  • γ  is a discount factor,  γ [ 0 , 1 ]

注意这里 P s s 的定义,是指从状态 s s 的概率

后面常因为名字(return)忘记这个的定义,跟上面的单个Reward不一样
Return

The return G t is the total discounted reward from time-step t.

G t = R t + 1 + γ R t + 2 + = k = 0 γ k R t + k + 1

  • The discount γ [ 0 , 1 ] is the present value of future rewards
  • The value of receiving reward R after k+1 time-steps is γ k R
    • γ close to 0 leads to “myopic(近视)” evaluation
    • γ close to 1 leads to “far-sighted(远见)” evaluation
      后面提到的很多方法都是看的很远(远见)的

Value Function

The state value function v(s) of an MRP is the expected return starting form state s

v ( s ) = E [ G t | S t = s ]

确实有必要看一下MRP的Bellman Equation,并与MDP对比。在MRP中没有考虑任何关于action的事情。因为MDP才是强化学习的主角,所以不看David Silver的ppt中的MRP实例了,容易对后面MDP的理解造成误解。
简单看一下Bellman Equation

v ( s ) = E [ G t | S t = s ] = E [ R t + 1 + γ v ( S t + 1 ) | S t = s ]

MRP的状态转换,没有任何action的影响, 我们在后面MDP中会考虑actions的影响
MRP state transfer
v ( s ) = R s + γ s S P s s v ( s )

其实观察上式,上面计算的是动态规划,而注意到Bellman Equation又称为动态规划方程,上面的计算就很容易理解了


A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.
Markov Decision Process

A Markov Process (or Markov Chain) is a tuple S , A , P , R , γ

  • S is a (finite) set of states
  • A is finite set of actions
  • P is a state transition probability matrix, P s s a = P [ S t + 1 = s | S t = s , A t = a ]
  • R is a reward function, R s a = E [ R t + 1 | S t = s , A t = a ]
  • γ is a discount factor, γ [ 0 , 1 ]

Student example for MDP
注意与上面MRP的区别,这里的黑点是执行一个action之后到达的中间状态,后面用 q ( s , a ) 来定义此状态,黑点到达后面的状态 s 的概率就是上面MDP中定义的那个 P s s a = P [ S t + 1 = s | S t = s , A t = a ]

Policy

A policy π is a distribution over actions given states,

π ( a | s ) = P [ A t = a | S t = s ]

  • A policy fully defines the behaviour of an agent
  • MDP policies depend on the current state (not the history)
  • i.e. Policies are stationary (time-independent), A t π ( | S t ) , t > 0
  • Given an MDP M = S , A , P , R , γ and a policy π
  • The state sequence S 1 , S 2 , is a Markov reward process S , P π
  • The state and reward sequence S 1 , R 2 , S 2 , is a Markov reward process S , P π , R π , γ
  • where
    P s , s π = a A π ( a | s ) P s s a R s π = a A π ( a | s ) R s a

要特别注意policy的distribution的定义,因为在后面讲的off-policy方法的概念中,生成样本的policy和目标policy是不同的

Value Function这个是针对MDP的

The state-value function v π ( s ) of an MDP is the expected return starting from state s , and then following policy π

v π ( s ) = E π [ G t | S t = s ]

The action-value function q π ( s , a ) is the expected return
starting from state s , taking action a , and then following policy π

q π ( s | a ) = E π [ G t | S t = s , A t = a ]

Bellman Expectation Equation for V π
Bellman Expectation Equation for $V^{\pi}$

v π ( s ) = a A π ( a | s ) q π ( s , a )

Bellman Expectation Equation for Q π
Bellman Expectation Equation for $Q^{\pi}$
q π ( s , a ) = R s a + γ s S P s s a v π ( s )

Bellman Expectation Equation for $v_{\pi} 2$
v π ( s ) = a A π ( a | s ) ( R s a + γ s S P s s a v π ( s ) )

Bellman Expectation Equation for $q_{\pi} 2$
q π ( s , a ) = R s a + γ s S P s s a a A π ( a | s ) q π ( s , a )

Optimal Value Function

The optimal state-value function v ( s ) is the maximum value function over all policies

v ( s ) = max π v π ( s )

The optimal action-value function q ( s , a ) is the maximum action-value function over all policies

q ( s , a ) = max π q π ( s , a )

只要知道了 q 问题就解决了,比知道 v 更方便。还有注意的是,上面是在所有的 π (policy)中选择使得 q 最大的 π (policy),这就是值给出了最佳policy的概念,当然是没有很直接的办法得到结果的,后面将针对上述问题介绍各种逼近的方法

Optimal Policy
De ne a partial ordering over policies

π π   if   v π ( s ) v π ( s ) , s

Finding an Optimal Policy
An optimal policy can be found by maximising over q ( s , a ) ,

π ( a | s ) = { 1 if a =  a r g max a A q ( s , a ) 0 otherwise

如果我们知道了 q ( s , a ) ,那么我就可以马上得到optimal policy

Optimal Bellman Expectation Equation

v π ( s ) E π [ G t | S t = s ] = E π [ k = 0 γ k R t + k + 1 | S t = s ] = a π ( a | s ) s r p ( s , r | s , a ) [ r + γ E π [ G t + 1 | S t + 1 = s ] ] = a π ( a | s ) s , r p ( s , r | s , a ) [ r + γ v π ( s ) ] ,  for all  s S

The Agent-Environment Interface
  • The learner and decision maker is called the agent.
  • The thing it interacts with, comprising everything outside the agent, is called the environment.

MDP和agent一起生成的sequence或者trajectory

S 0 , A 0 , R 1 , S 1 , A 1 , R 2 , S 2 , A 2 , R 3 ,

以下函数定义了MDP的动态性,agent处于某个状态s,在该状态下采取行动a,然后到达状态 s ,并获得奖励r。这个公式是MDP的关键。这个四参数的函数可以推导出任何东西

p ( s , r | s , a ) Pr { S t = s , R t = r | S t 1 = s , A t 1 = a }

The agent-environment interaction in a Markov decision process
for all s , s S , r R , and a A ( s )

其中有

s S r R p ( s , r | s , a ) = 1 ,  for all  s S a A ( s )

3.2 Goals and Rewards

agent的目的就是最大化它收到的全部rewards

3.5 Policies and Value Functions

state-value function for policy π

v π ( s ) E π [ G t | S t = s ] = E π [ k = 0 γ k R t + k + 1 | S t = s ] = a π ( a | s ) s r p ( s , r | s , a ) [ r + γ E π [ G t + 1 | S t + 1 = s ] ] = a π ( a | s ) s , r p ( s , r | s , a ) [ r + γ v π ( s ) ] ,  for all  s S

action-value function for policy π

q π ( s , a ) E π [ G t | S t = s , A t = a ] = E π [ k = 0 γ k R t + k + 1 | S t = s , A t = a ]

对于任何policy π 和任何状态 s ,state-value和其可能的后继状态的state-value之间存在以下一致性条件

3.6 Optimal Policies and Optimal Value Functions

optimal state-value function

v ( s ) max π v π ( s )

optimal action-value function
q ( s , a ) max π q π ( s , a )

写出关于 v q

q ( s , a ) = E [ R t + 1 + γ v π ( S t + 1 ) | S t = s , A t = a ]

Bellman optimality equation

v ( s ) = max a A ( s ) q π


Bellman Optimality Equation for $V^*$

v ( s ) = max a A ( s ) q π ( s , a ) = max a E π [ G t | S t = s , A t = a ] = max a E π [ R t + 1 + γ G t + 1 | S t = s , A t = a ] = max a E [ R t + 1 + γ v ( S t + 1 ) | S t = s , A t = a ] = max a s , r p ( s , r | s , a ) [ r + γ v ( s ) ]


Bellman Optimality Equation for $Q^*$

q ( s , a ) = E [ R t + 1 + γ max a q ( S t + 1 , a ) | S t = s , A t = a ] = s , r p ( s , r | s , a ) [ r + γ max a q ( s , a ) ]

猜你喜欢

转载自blog.csdn.net/dengyibing/article/details/80456077