Thanks Richard S. Sutton and Andrew G. Barto for their great work of Reinforcement Learning: An Introduction - 2nd Edition.
Here we summarize some basic notions and formulations in most reinforcement learning problems. This note DO NOT include detailed explanantion of each notion. Refer to the references above if you want a deeper insight.
Markov decision processes are a classcial formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequentsituations, or states, and through those future rewards. MDPs are a mathematically idealized form of the reinforcement learning problem.
Agent-Environment Interface
MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations (state) to the agent. The environment also gives rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.
More speci cally, the agent and environment interact at each of a sequence of discrete time steps,
. At each time step
, the agent receives some representation of environment’s state,
, and on that basis selects an action,
. One time step later, in part as a consequence of its action, the agent receives a numerical reward,
, and finds itself in a new state,
. The MDP and agent together thereby give rise to a sequence or trajectory that begins like this:
In a finite MDP, the sets of states, actions and rewards all have a finite number of elements. In this case, the random variables
and
have well de ned discrete probability distributions dependent only on the preceding state and action.
One can compute anything else one might want to know about the environment, such as the state-transition probabilities:
We can also compute the expected rewards for state{action pairs
Goals and Rewards
That all of what we mean by goals and purpose can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal, called reward.
The agent always learns to maximize its reward. If we want it to do something for us, we must provide rewards to it in such a way that in maximizing them the agent will also achieve our goals. It is thus critical that the rewards we set up truly indicate what we want accomplished.
Returns and Episodes
We seek to maximize the expected return, where the return, denoted
, is defined as some speci c function of the reward sequence. In the simplest case the return is the sum of the rewards:
On the other hand, in many cases the agent-environment interaction does not break naturally into identi able episodes, but goes on continually without limit. We call these continuing tasks. Then we introduce an additional concept, discounting. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses
to maximize the expected discounted return:
Policies and Value Functions
Almost all reinforcement learning algorithms involve estimating value functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of “how good” here is defi ned in terms of future rewards that can be expected, or, to be precise, in terms of expected return.
A policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy at time , then is the probability that if .
The value of a state
under a policy
, denoted
, is the expected return when starting in
and following
thereafter.
Similarly, we de fine the value of taking action
in state
under a policy
, denote
, as the expected return starting from
, taking the action
, and thereafter following policy
:
The value functions can be estimated from experience.
Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over the long run. There is always at least one policy that is better than or equal to all other policies, i.e. optimal policy, denoted by
. Although there may be more than one, they shared the same state-value function, called the optimal state-value function, denoted
, and defined as
Because
is the value function for a policy, it must satisfy the self-consistency condition given by the Bellman equation. Because it is the optimal value function, however,
’s consistency condition can be written in a special form without reference to any speci c policy. This is the Bellman optimality equation. Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state:
The Bellman optimality equation for
is
Dynamic Programming
Assume that the environment is a finite MDP.
Policy Evaluation - Prediction Problem
First we consider how to compute the state-value function
for an arbitrary policy
. This is called policy evaluation in dynamic programming literature. If the environment’s dynamics are completely known, then let the initial approximation,
, is chosen arbitrarily ( except that the terminal state, if any, must be given value
), and each successive approximation is obtained by using the Bellman equation for
as an update rule:
Policy Improvement
Our reason for computing the value function for a policy is to help nd better policies. Suppose we have determined the value function for an arbitrary deterministic policy . For some state we would like to know whether or not we should change the policy to deterministically choose an action . We know how good it is to follow the current policy from |that is but would it be better or worse to change to the new policy? One way to answer this question is to consider selecting in and thereafter following the existing policy .
The key criterion is whether this is greater than or less than . If it is greater — that is, if it is better to select once in and thereafter follow than it would be to follow all the time — then one would expect it to be better still to select a every time is encountered, and that the new policy would in fact be a better one overall. be a better one overall. That this is true is a special case of a general result called the policy improvement theorem.
The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement.
Policy Iteration
Once a policy,
, has been improved using
to yield a better policy,
, we can then compute
and improve it again to yield an even better
. We can thus obtain a sequence of monotonically improving policies and value functions.
This way of nding an optimal policy is called policy iteration.
One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called value iteration. It can be written as a particularly simple update operation that combines the policy improvement and truncated policy evaluation steps:
We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes. Almost all reinforcement learning methods are well described as GPI. That is, all have identi able policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the right.
The evaluation and improvement processes in GPI can be viewed as both competing and cooperating. They compete in the sense that they pull in opposing directions. Making the policy greedy with respect to the value function typically makes the value function incorrect for the changed policy, and making the value function consistent with the policy typically causes that policy no longer to be greedy.
DP may not be practical for very large problems, but compared with other methods for solving MDPs, DP methods are actually quite effcient.
Convergence Proof
Here we give a proof of the convergence of the policy evaluation process. The proof is based on contraction mapping and fixed point principle, but we do not discuss the mathematic basics.
- Definition: Let be a metric space with metric . Mapping . If there exists satisfying , then is a contraction mapping on space .
- Definition: If there exists satisfying , then is the fixed point of .
- Theorem: There is only ONE fixed point of some contraction mapping in a complete metric space.
To prove that some iteration sequence is convergent, we only have to prove that the corresponded mapping is a contraction mapping.