Thanks Richard S. Sutton and Andrew G. Barto for their great work of Reinforcement Learning: An Introduction - 2nd Edition.

Here we summarize some basic notions and formulations in most reinforcement learning problems. This note DO NOT include detailed explanantion of each notion. Refer to the references above if you want a deeper insight.

Markov decision processes are a classcial formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequentsituations, or states, and through those future rewards. MDPs are a mathematically idealized form of the reinforcement learning problem.

Agent-Environment Interface

Interface
MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations (state) to the agent. The environment also gives rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.

More speci cally, the agent and environment interact at each of a sequence of discrete time steps, $t=0,1,2,3,\dots$ . At each time step $t$ , the agent receives some representation of environment’s state, $S_{t}\in\mathcal{S}$ , and on that basis selects an action, $A_{t}\in \mathcal{A}(s)$ . One time step later, in part as a consequence of its action, the agent receives a numerical reward, $R_{t+1}\in\mathcal{R}\subset \mathbb{R}$ , and finds itself in a new state, $S_{t+1}$ . The MDP and agent together thereby give rise to a sequence or trajectory that begins like this:

S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, S_{2}, A_{2}, R_{3}, \dots

$S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, S_{2}, A_{2}, R_{3},\dots$

In a finite MDP, the sets of states, actions and rewards all have a finite number of elements. In this case, the random variables $R_t$ and $S_t$ have well de ned discrete probability distributions dependent only on the preceding state and action.

p (s^{'}, r | s, a) = P r {S_{t} = s^{'}, R_{t} = r | S_{t - 1} = s, A_{t - 1} = a}, \forall s^{'}, s \in S, r \in R, a \in A (s)

$p(s',r|s,a)=\mathrm{Pr}\{S_{t}=s',R_{t}=r | S_{t-1}=s, A_{t-1}=a\},\forall s',s\in\mathcal{S},r\in\mathcal{R}, a\in\mathcal{A}(s)$

\sum_{s^{'} \in S} \sum_{r \in R} p (s^{'}, r | s, a) = 1, \forall s \in S, a \in A (s)

$\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r|s,a)=1, \forall s\in\mathcal{S}, a\in\mathcal{A}(s)$
One can compute anything else one might want to know about the environment, such as the state-transition probabilities:

p (s^{'} | s, a) = P r {S_{t} = s^{'} | S_{t - 1} = s, A_{t - 1} = a} = \sum_{r \in R} p (s^{'}, r | s, a)

$p(s'|s,a)=\mathrm{Pr}\{S_{t}=s' | S_{t-1}=s, A_{t-1}=a\}=\sum_{r\in\mathcal{R}}p(s',r|s,a)$
We can also compute the expected rewards for state{action pairs

r (s, a) = E [R_{t} | S_{t - 1} = s, A_{t - 1} = a] = \sum_{r \in R} r \sum_{s^{'} \in S} p (s^{'}, r | s, a)

$r(s,a)=\mathbb{E}[R_{t} | S_{t-1}=s, A_{t-1}=a]=\sum_{r\in\mathcal{R}}r\sum_{s'\in\mathcal{S}}p(s',r|s,a)$ Or

r (s, a, s^{'}) = E [R_{t} | S_{t - 1} = s, A_{t - 1} = a, S_{t} = s^{'}] = \sum_{r \in R} r \frac{p (s^{'}, r | s, a)}{p (s^{'} | s, a)}

$r(s,a,s')=\mathbb{E}[R_{t} | S_{t-1}=s, A_{t-1}=a, S_{t}=s' ]=\sum_{r\in\mathcal{R}}r\frac{p(s',r|s,a)}{p(s'|s,a)}$

Goals and Rewards

That all of what we mean by goals and purpose can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal, called reward.

The agent always learns to maximize its reward. If we want it to do something for us, we must provide rewards to it in such a way that in maximizing them the agent will also achieve our goals. It is thus critical that the rewards we set up truly indicate what we want accomplished.

Returns and Episodes

We seek to maximize the expected return, where the return, denoted $G_{t}$ , is defined as some speci c function of the reward sequence. In the simplest case the return is the sum of the rewards:

G_{t} = R_{t + 1} + R_{t + 2} + \dots + R_{T}

$G_{t} = R_{t+1}+R_{t+2}+\cdots +R_{T}$ where

T

$T$ is a final time step. This approach makes sense in applications in which there is a natural notion of time step, that is, when the agent-environment interaction breaks naturally into subsequences, which we call episodes. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Tasks with episodes of this kind are called episodic tasks.

On the other hand, in many cases the agent-environment interaction does not break naturally into identi able episodes, but goes on continually without limit. We call these continuing tasks. Then we introduce an additional concept, discounting. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses $A_{t}$ to maximize the expected discounted return:

G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} = R_{t + 1} + γ G_{t + 1}

$G_{t} = R_{t+1}+\gamma R_{t+2}+\gamma^{2}R_{t+3}+\cdots = \sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}=R_{t+1}+\gamma G_{t+1}$ where

γ

$\gamma$ is a parameter,

0 \leq γ \leq 1

$0\leq \gamma \leq 1$ , called the discounted rate.

Policies and Value Functions

Almost all reinforcement learning algorithms involve estimating value functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of “how good” here is defi ned in terms of future rewards that can be expected, or, to be precise, in terms of expected return.

A policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy $\pi$ at time $t$ , then $\pi(a|s)$ is the probability that $A_{t}=a$ if $S_{t}=s$ .

The value of a state $s$ under a policy $\pi$ , denoted $v_{\pi}(s)$ , is the expected return when starting in $s$ and following $\pi$ thereafter.

v_{π} (s) = E_{π} [G_{t} | S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s], \forall s \in S

$v_{\pi}(s)=\mathbb{E}_{\pi}[G_{t} | S_{t}=s]=\mathbb{E}_{\pi}\left[\left. \sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1} \right| S_{t}=s \right],\quad \forall s\in\mathcal{S}$ where

E_{π} [\cdot]

$\mathbb{E}_{\pi}[\cdot]$ denotes the expected value of a random variable given that the agent follows policy

π

$\pi$ . Note that the value of the terminal state, if any, is always zero. We call the function

v_{π}

$v_{\pi}$ the state-value function for policy

π

$\pi$ .

Similarly, we de fine the value of taking action $a$ in state $s$ under a policy $\pi$ , denote $q_{\pi}(s,a)$ , as the expected return starting from $s$ , taking the action $a$ , and thereafter following policy $\pi$ :

q_{π} (s, a) = E_{π} [G_{t} | S_{t} = s, A_{t} = a] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s, A_{t} = a]

$q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_{t} | S_{t}=s, A_{t}=a]=\mathbb{E}_{\pi}\left[\left. \sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1} \right| S_{t}=s, A_{t}=a \right]$ We call

q_{π}

$q_{\pi}$ the action-value function for policy

π

$\pi$ .

The value functions can be estimated from experience.

\begin{aligned} v_{π} (s) & = E_{π} [G_{t} | S_{t} = s] \\ = E_{π} [R_{t + 1} + γ G_{t + 1} | S_{t} = s] \\ = \sum_{a} π (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | s, a) [r + γ E_{π} [G_{t + 1} | S_{t + 1} = s^{'}]] \\ = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{π} (s^{'})] \end{aligned}, \forall s \in S

$\begin{split} v_{\pi}(s) &= \mathbb{E}_{\pi}[G_{t} | S_{t}=s ] \\ &=\mathbb{E}_{\pi}[R_{t+1}+\gamma G_{t+1} | S_{t}=s] \\ &= \sum_{a}\pi(a|s)\sum_{s'}\sum_{r}p(s',r | s, a)\left[ r+\gamma\mathbb{E}_{\pi}[G_{t+1} | S_{t+1}=s'] \right] \\ &=\sum_{a}\pi(a|s)\sum_{s',r}p(s',r | s, a)\left[r+\gamma v_{\pi}(s')\right] \end{split},\quad \forall s\in\mathcal{S}$ where it is implicit that the actions

a

$a$ are taken from the set

A (s)

$\mathcal{A}(s)$ , taht the next states,

s^{'}

$s'$ , are taken fromt he set

S

$\mathcal{S}$ and the rewards,

r

$r$ , are taken from the set

R

$\mathcal{R}$ . Equation above is the Bellman equation for

v_{π}

$v_{\pi}$ .

Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over the long run. There is always at least one policy that is better than or equal to all other policies, i.e. optimal policy, denoted by $\pi_{*}$ . Although there may be more than one, they shared the same state-value function, called the optimal state-value function, denoted $v_{*}$ , and defined as

v_{*} (s) = max_{π} v_{π} (s), \forall s \in S

$v_{*}(s)=\max_{\pi}v_{\pi}(s),\quad \forall s\in\mathcal{S}$ Optimal policies also share the same optimal action-value function, denoted

q_{*}

$q_{*}$ , and defined as

q_{*} (s, a) = max_{π} q_{π} (s, a), \forall s \in S, a \in A (s)

$q_{*}(s,a)=\max_{\pi}q_{\pi}(s,a),\quad \forall s\in\mathcal{S}, a\in\mathcal{A}(s)$ We can write

q_{*}

$q_{*}$ in terms of

v_{*}

$v_{*}$ as follows:

q_{*} (s, a) = E [R_{t + 1} + γ v_{*} (S_{t + 1}) | S_{t} = s, A_{t} = a]

$q_{*}(s,a)=\mathbb{E}[R_{t+1}+\gamma v_{*}(S_{t+1}) | S_{t}=s, A_{t}=a ]$

Because $v_{*}$ is the value function for a policy, it must satisfy the self-consistency condition given by the Bellman equation. Because it is the optimal value function, however, $v_{*}$ ’s consistency condition can be written in a special form without reference to any speci c policy. This is the Bellman optimality equation. Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state:

\begin{aligned} v_{*} (s) & = max_{a \in A (s)} q_{π_{*}} (s, a) \\ = max_{a} E_{π_{*}} [G_{t} | S_{t} = s, A_{t} = a] \\ = max_{a} E_{π_{*}} [R_{t + 1} + γ G_{t + 1} | S_{t} = s, A_{t} = a] \\ = max_{a} E_{π_{*}} [R_{t + 1} + γ v_{*} (S_{t + 1}) | S_{t} = s, A_{t} = a] \\ = max_{a} \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{*} (s^{'})] \end{aligned}

$\begin{split} v_{*}(s) &=\max_{a\in\mathcal{A}(s)} q_{\pi_{*}}(s,a) \\ &=\max_{a}\mathbb{E}_{\pi_{*}}[G_{t} | S_{t}=s, A_{t}=a] \\ &=\max_{a}\mathbb{E}_{\pi_{*}}[R_{t+1}+\gamma G_{t+1} | S_{t}=s, A_{t}=a] \\ &= \max_{a}\mathbb{E}_{\pi_{*}}[R_{t+1}+\gamma v_{*}(S_{t+1}) | S_{t}=s, A_{t}=a] \\ &= \max_{a}\sum_{s',r}p(s',r | s, a)[r+\gamma v_{*}(s')] \end{split}$ The last two equations are two forms of the Bellman optimality equation for

v_{*}

$v_{*}$ .

The Bellman optimality equation for $q_{*}$ is

\begin{aligned} q_{*} (s, a) & = E [R_{t + 1} + γ max_{a^{'}} q_{*} (S_{t + 1}, a^{'}) | S_{t} = s, A_{t} = a] \\ = \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ max_{a^{'}} q_{*} (s^{'}, a^{'})] \end{aligned}

$\begin{split} q_{*}(s,a) &= \mathbb{E}\left[\left. R_{t+1}+\gamma\max_{a'}q_{*}(S_{t+1}, a') \right| S_{t}=s, A_{t}=a \right] \\ &= \sum_{s',r}p(s',r | s,a)\left[r+\gamma\max_{a'}q_{*}(s',a') \right] \end{split}$

Dynamic Programming

Assume that the environment is a finite MDP.

Policy Evaluation - Prediction Problem

First we consider how to compute the state-value function $v_{\pi}$ for an arbitrary policy $\pi$ . This is called policy evaluation in dynamic programming literature. If the environment’s dynamics are completely known, then let the initial approximation, $v_{0}$ , is chosen arbitrarily ( except that the terminal state, if any, must be given value $0$ ), and each successive approximation is obtained by using the Bellman equation for $v_{\pi}$ as an update rule:

\begin{aligned} v_{k + 1} (s) & = E_{π} [R_{t + 1} + γ v_{k} (S_{t + 1}) | S_{t} = s] \\ = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{k} (s^{'})] \end{aligned}

$\begin{split} v_{k+1}(s) &= \mathbb{E}_{\pi}[R_{t+1}+\gamma v_{k}(S_{t+1}) | S_{t}=s] \\ &= \sum_{a}\pi(a|s)\sum_{s',r}p(s',r | s,a)[r+\gamma v_{k}(s')] \end{split}$ The sequence

{v_{k}}

$\{v_{k}\}$ can be shown in general to converge to

v_{p i}

$v_{pi}$ as

k \to \infty

$k\to\infty$ under the same conditions that guarantee the existence of

v_{π}

$v_{\pi}$ . This algorithm is called iterative policy evaluation. It replaces the old value of

s

$s$ with a new value obtained from the old values of the successor states of

s

$s$ , and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated. We call this kind of operation an expected update. All the updates done in DP algorithms are called expected updates because they are based on an expectation over all possible next states rather than on a sample next state.

Policy Improvement

Our reason for computing the value function for a policy is to help nd better policies. Suppose we have determined the value function $v_{\pi}$ for an arbitrary deterministic policy $\pi$ . For some state $s$ we would like to know whether or not we should change the policy to deterministically choose an action $a\neq \pi(s)$ . We know how good it is to follow the current policy from $s$ |that is $v_{\pi}(s)$ but would it be better or worse to change to the new policy? One way to answer this question is to consider selecting $a$ in $s$ and thereafter following the existing policy $\pi$ .

The key criterion is whether this is greater than or less than $v_{\pi}(s)$ . If it is greater — that is, if it is better to select $a$ once in $s$ and thereafter follow $\pi$ than it would be to follow $\pi$ all the time — then one would expect it to be better still to select a every time $s$ is encountered, and that the new policy would in fact be a better one overall. be a better one overall. That this is true is a special case of a general result called the policy improvement theorem.

The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement.

\begin{aligned} v_{π^{'}} (s) & = max_{a} E [R_{t + 1} + γ v_{π^{'}} (S_{t + 1}) | S_{t} = s, A_{t} = a] \\ = max_{a} \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{π^{'}} (s^{'})] \end{aligned}

$\begin{split} v_{\pi'}(s) &= \max_{a}\mathbb{E}[R_{t+1}+\gamma v_{\pi'}(S_{t+1}) | S_{t}=s, A_{t}=a ] \\ &= \max_{a}\sum_{s',r}p(s',r | s, a)[r +\gamma v_{\pi'}(s')] \end{split}$

Policy Iteration

Once a policy, $\pi$ , has been improved using $v_{\pi}$ to yield a better policy, $\pi'$ , we can then compute $v_{\pi'}$ and improve it again to yield an even better $\pi''$ . We can thus obtain a sequence of monotonically improving policies and value functions.
Eval
This way of nding an optimal policy is called policy iteration.

One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called value iteration. It can be written as a particularly simple update operation that combines the policy improvement and truncated policy evaluation steps:

\begin{aligned} v_{k + 1} (s) & = max_{a} E [R_{t + 1} + γ v_{k} (S_{t + 1}) | S_{t} = s, A_{t} = a] \\ = max_{a} \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{k} (s^{'})] \end{aligned}

$\begin{split} v_{k+1}(s) &= \max_{a}\mathbb{E}[ R_{t+1}+\gamma v_{k}(S_{t+1}) | S_{t}=s, A_{t} = a ] \\ &=\max_{a}\sum_{s',r}p(s', r | s,a)[r+\gamma v_{k}(s')] \end{split}$ for all

s \in S

$s\in\mathcal{S}$ . For arbitrary

v_{0}

$v_{0}$ , the sequence

{v_{k}}

$\{v_{k}\}$ can be shown to converge to

v_{*}

$v_{*}$ under the same conditions that guarantee the existence of

v_{*}

$v_{*}$ .

We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes. Almost all reinforcement learning methods are well described as GPI. That is, all have identi able policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the right.
Itera
The evaluation and improvement processes in GPI can be viewed as both competing and cooperating. They compete in the sense that they pull in opposing directions. Making the policy greedy with respect to the value function typically makes the value function incorrect for the changed policy, and making the value function consistent with the policy typically causes that policy no longer to be greedy.

DP may not be practical for very large problems, but compared with other methods for solving MDPs, DP methods are actually quite effcient.

Convergence Proof

Here we give a proof of the convergence of the policy evaluation process. The proof is based on contraction mapping and fixed point principle, but we do not discuss the mathematic basics.

Definition: Let $T$ be a metric space with metric $\rho$ . Mapping $T:X\to X$ . If there exists $a, 0\leq a<1$ satisfying $\rho(Tx, Ty)\leq a\rho(x,y), \forall x,y\in X$ , then $T$ is a contraction mapping on space $X$ .
Definition: If there exists $x_{0}\in X$ satisfying $Tx_{0} = x_{0}$ , then $x_{0}$ is the fixed point of $T$ .
Theorem: There is only ONE fixed point of some contraction mapping in a complete metric space.

To prove that some iteration sequence is convergent, we only have to prove that the corresponded mapping is a contraction mapping.

v_{π} (s) = \sum_{a \in A} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{π} (s^{'})]

$v_{\pi}(s) = \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')]$ Take the infinite norm as the metric:

‖ v ‖_{\infty} = max_{s \in S} ‖ v (s) ‖

$\|v\|_{\infty}=\max_{s\in\mathcal{S}}\|v(s)\|$ Then

\begin{aligned} ‖ T^{π} (u) - T^{π} (v) ‖_{\infty} & = max_{s} ‖ \sum_{a \in A} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ u_{π} (s^{'})] - \sum_{a \in A} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{π} (s^{'})] ‖ \\ = max_{s} ‖ \sum_{a \in A} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) γ u_{π} (s^{'}) - \sum_{a \in A} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) γ v_{π} (s^{'}) ‖ \\ \leq γ max_{s} ‖ \sum_{s^{'}, r} p (s^{'}, r | s, a) u_{π} (s^{'}) - \sum_{s^{'}, r} p (s^{'}, r | s, a) v_{π} (s^{'}) ‖ \\ \leq γ max_{s} ‖ max_{s^{'}} ‖ u_{π} (s^{'}) - v_{π} (s^{'}) ‖ ‖ \\ \leq γ max_{s} ‖ u_{π} (s) - v_{π} (s) ‖ \\ = γ ‖ u - v ‖_{\infty} \end{aligned}

$\begin{split} \| T^{\pi}(u)- T^{\pi}(v) \|_{\infty} &= \max_{s}\left\| \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma u_{\pi}(s')] - \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')] \right\| \\ &= \max_{s}\left\| \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s',r}p(s',r|s,a)\gamma u_{\pi}(s') - \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s',r}p(s',r|s,a)\gamma v_{\pi}(s') \right\| \\ &\leq \gamma\max_{s}\left\| \sum_{s',r}p(s',r|s,a) u_{\pi}(s') - \sum_{s',r}p(s',r|s,a)v_{\pi}(s') \right\| \\ &\leq \gamma\max_{s}\left\| \max_{s'}\left\| u_{\pi}(s') - v_{\pi}(s') \right\| \right\| \\ &\leq \gamma\max_{s}\left\| u_{\pi}(s) - v_{\pi}(s) \right\| \\ &= \gamma\|u-v\|_{\infty} \end{split}$ Done.

强化学习中的有限马尔可夫决策过程 Finite Markov Decision Processes in RL