参考了《Reinforcement Learning: An Introduction》和
David Silver强化学习公开课,
这一章主要来自David Silver的ppt,建议直接看ppt,我只把容易犯错的地方点出来了
马尔科夫过程是强化学习的基础
Finite Markov Decision Processes
Markov property
A state is Markov if and only of
- The state captures all relevant information from the history
- Once the state is know, the history may be thrown away
- i.e. The state is a sufficient statistic of the future
A Markov process is a memoryless random process, i.e. a sequence of random states
with the Markov property.
Markov Process
A Markov Process (or Markov Chain) is a tuple
- S is a (finite) set of states
- P is a state transition probability matrix,
A Markov reward process is a Markov chain with values.
Markov Reward Process
A Markov Process (or Markov Chain) is a tuple
- S is a (finite) set of states
- P is a state transition probability matrix,
注意这里 的定义,是指从状态 到 的概率
后面常因为名字(return)忘记这个的定义,跟上面的单个Reward不一样
Return
The return is the total discounted reward from time-step t.
- The discount is the present value of future rewards
- The value of receiving reward R after k+1 time-steps is
- close to 0 leads to “myopic(近视)” evaluation
- close to 1 leads to “far-sighted(远见)” evaluation
后面提到的很多方法都是看的很远(远见)的
Value Function
The state value function v(s) of an is the expected return starting form state s
确实有必要看一下MRP的Bellman Equation,并与MDP对比。在MRP中没有考虑任何关于action的事情。因为MDP才是强化学习的主角,所以不看David Silver的ppt中的MRP实例了,容易对后面MDP的理解造成误解。
简单看一下Bellman Equation
MRP的状态转换,没有任何action的影响, 我们在后面MDP中会考虑actions的影响
其实观察上式,上面计算的是动态规划,而注意到Bellman Equation又称为动态规划方程,上面的计算就很容易理解了
A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.
Markov Decision Process
A Markov Process (or Markov Chain) is a tuple
- S is a (finite) set of states
- A is finite set of actions
- P is a state transition probability matrix,
- R is a reward function,
- is a discount factor,
注意与上面MRP的区别,这里的黑点是执行一个action之后到达的中间状态,后面用
来定义此状态,黑点到达后面的状态
的概率就是上面MDP中定义的那个
Policy
A policy is a distribution over actions given states,
- A policy fully defines the behaviour of an agent
- MDP policies depend on the current state (not the history)
- i.e. Policies are stationary (time-independent),
- Given an MDP and a policy
- The state sequence is a Markov reward process
- The state and reward sequence is a Markov reward process
- where
要特别注意policy的distribution的定义,因为在后面讲的off-policy方法的概念中,生成样本的policy和目标policy是不同的
Value Function这个是针对MDP的
The state-value function of an MDP is the expected return starting from state , and then following policy
The action-value function is the expected return
starting from state , taking action , and then following policy
Bellman Expectation Equation for
Bellman Expectation Equation for
Optimal Value Function
The optimal state-value function is the maximum value function over all policies
The optimal action-value function is the maximum action-value function over all policies
只要知道了 问题就解决了,比知道 更方便。还有注意的是,上面是在所有的 (policy)中选择使得 最大的 (policy),这就是值给出了最佳policy的概念,当然是没有很直接的办法得到结果的,后面将针对上述问题介绍各种逼近的方法
Optimal Policy
De ne a partial ordering over policies
Finding an Optimal Policy
An optimal policy can be found by maximising over
,
如果我们知道了 ,那么我就可以马上得到optimal policy
Optimal Bellman Expectation Equation
The Agent-Environment Interface
- The learner and decision maker is called the agent.
- The thing it interacts with, comprising everything outside the agent, is called the environment.
MDP和agent一起生成的sequence或者trajectory
以下函数定义了MDP的动态性,agent处于某个状态s,在该状态下采取行动a,然后到达状态 ,并获得奖励r。这个公式是MDP的关键。这个四参数的函数可以推导出任何东西
for all
,
,
, and
其中有
3.2 Goals and Rewards
agent的目的就是最大化它收到的全部rewards
3.5 Policies and Value Functions
state-value function for policy
action-value function for policy
对于任何policy 和任何状态 ,state-value和其可能的后继状态的state-value之间存在以下一致性条件
3.6 Optimal Policies and Optimal Value Functions
optimal state-value function
optimal action-value function
写出关于
的
Bellman optimality equation