Key Concepts and Terminology
overview
An agent interacts with the environment, in the way that
- the agent observes the environment, and take an action,
- the action changes the environment’s current state, and meanwhile the agent receives a reward,
- back to 1.
In an episode, the return the agent gets is defined by the cumulative rewards.
Breifly speaking, the goal of RL is to find an agent with the best possible policy that receives maximal returns.
states and observations
concept | meaning |
---|---|
state | full description of the environment |
observation | partial description of the environment |
Both states and observations can be represented in vectors, matrices, high-dim tensor, e.g., visual observations often use RGB matrices, robots in GYM often use high-dim vectors to encapsulate angles and velocities.
However in the literature, the two concepts are not differentiated clearly. Often when researchers mentions state in the paper, their experiemtal subjects actually have observation instead, because normally the subjects have partial access to the environment.
action spaces
The set of all valid actions is termed action space.
concept | meaning | representations |
---|---|---|
discrete action space | finite options of actions | integer indices |
continuous action space | infinite, smooth action changes | real-valued vectors |
As the goal of RL is to find optimal policies, the responsibility of whom is to give appropriate actions on observing the environment, the two concepts matter to the method of finding optimal policies.
policies
The policy is what an agent conforms with to give actions when observing the environment. It takes the current state (or observation) as input and outputs an action
concept | notation |
---|---|
deterministic policy | |
stochastic policy |
I would interpret stochastic as such that actions are randomly sampled from a distribution dependent on the state.
Note policy sometimes take the place of agent in the literature. This is reasonable since all an agent do is to follow its policy.
In deep RL, we survey on the parameterized policy. The policy is parameterized computable functions, like in the form of a neural network, so that we can obtain a policy that produces optimal actions conditioned on given states, by adjusting the parameters according to some optimization methods.
deterministic policies
A typical example is several stacking dense layers with activations followed by a final dense layer that outputs logits.
The above example is like a vanilla finite-classes classifier. The essential difference to the stochastic policies, from my perspective, is it often takes an argmax over the logits instead of sampling, rendering it a deterministic function.
To think further, is it necessarily be a categorical policy?
My answer is yes. The question is in essence asking is it possible to have infinite action space for a deterministic policy.
At the first glance, I thought I just need an infinite action space and when I get the logits for each actions, I simply take an argmax and my work is over.
But how can we have logits for an infinite action space? Only via a distribution’s formula, right? But recall what exactly do we use the distribution’s density function. We feed a point to it and get the probability density. But there is no way enumerating through all possible actions (each of which is a real-valued vector) and get their logits, since the action space is inifinite.
stochastic policies
concept | action space |
---|---|
categorical polices | discrete |
diagonal gaussian policies | continuous |
There are two essential tasks, 1) sampling, and 2) log-likelihood.
1. categorical policies
the common practice is:
-
via network inference we get logits for the finite actions.
-
sample from the softmax-ed logits:
tf.multinomial
is meant for the sampling result of n trials of (each) k options with specified probability.recall that
>> bernoulli distribution models the outcome of a single trial of two options
>> binomial distribution models the outcome of n trials of two options
>> multimomial distribution models the outcome of n trials of k options -
log-likelihood: use the sampled action to index the logits.
2. Diagonal Gaussian policies
As the name suggests, the distribution from which actions are sampled is a Diagonal Multivariate Gaussian Distribution. As an epitome of a normal multivariate gaussian, here diagonal refers to a diagonal covariance matrix of such distribution, meaning the dimensions are linearly independent to each other.
Normally such distribution is represented by a mean vector and a diagonal covariance matrix (equivalent to a variance vector).
As common practices, the mean vector of the policy is modeled by a neural network that inputs a state vector and outputs the mean action vector, used for the distribution of actions.
Two options are offered for modeling the variance vector.
- can be a single parameter independent on the state, i.e.
- dependent on the state, , where the function is parameterized by .
Note we use log-standard-deviation because it takes on any values in while std-dev non-negative, and Training turns out easier if we enforce no constraints.
The sampling process can be done by tf.random_normal
in tensorflow, following exactly
, where
and
respectively the mean and std-dev, and
.
The log-likelihood is obtained via the following
trajectories
def
start-state
state-transition
- mdp
- determ
- stochastic
alias
reward and return
reward function
why return
finite undiscounted return
infinite discounted return
- why discount
common practice: optimize undiscounted, value functions use discounted
the RL problem
goal: expected return
(stochastic case) proba of a T-step Traj.
expected return
optimal policy
value functions
def. what does the value mean
four types
on-policy off-policy
two lemma value and q
the optimal Q-function and the optimal action
def. > rel.
note. the policy with respect to optimal action
Bellman Equations
in recurrence
the optimal value and q
bellman backup
Advantage function
meaning
form