Key Concepts and Terminology

overview

An agent interacts with the environment, in the way that

the agent observes the environment, and take an action,
the action changes the environment’s current state, and meanwhile the agent receives a reward,
back to 1.

In an episode, the return the agent gets is defined by the cumulative rewards.
Breifly speaking, the goal of RL is to find an agent with the best possible policy that receives maximal returns.

states and observations

concept	meaning
state	full description of the environment
observation	partial description of the environment

Both states and observations can be represented in vectors, matrices, high-dim tensor, e.g., visual observations often use RGB matrices, robots in GYM often use high-dim vectors to encapsulate angles and velocities.

However in the literature, the two concepts are not differentiated clearly. Often when researchers mentions state in the paper, their experiemtal subjects actually have observation instead, because normally the subjects have partial access to the environment.

action spaces

The set of all valid actions is termed action space.

concept	meaning	representations
discrete action space	finite options of actions	integer indices
continuous action space	infinite, smooth action changes	real-valued vectors

As the goal of RL is to find optimal policies, the responsibility of whom is to give appropriate actions on observing the environment, the two concepts matter to the method of finding optimal policies.

policies

The policy is what an agent conforms with to give actions when observing the environment. It takes the current state (or observation) as input and outputs an action

concept	notation
deterministic policy	$a_t=\pi(s_t)$
stochastic policy	$a_t\sim\pi(\cdot\mid s_t)$

I would interpret stochastic as such that actions are randomly sampled from a distribution dependent on the state.

Note policy sometimes take the place of agent in the literature. This is reasonable since all an agent do is to follow its policy.

In deep RL, we survey on the parameterized policy. The policy is parameterized computable functions, like in the form of a neural network, so that we can obtain a policy that produces optimal actions conditioned on given states, by adjusting the parameters according to some optimization methods.

deterministic policies

A typical example is several stacking dense layers with activations followed by a final dense layer that outputs logits.

The above example is like a vanilla finite-classes classifier. The essential difference to the stochastic policies, from my perspective, is it often takes an argmax over the logits instead of sampling, rendering it a deterministic function.

To think further, is it necessarily be a categorical policy?
My answer is yes. The question is in essence asking is it possible to have infinite action space for a deterministic policy.
At the first glance, I thought I just need an infinite action space and when I get the logits for each actions, I simply take an argmax and my work is over.
But how can we have logits for an infinite action space? Only via a distribution’s formula, right? But recall what exactly do we use the distribution’s density function. We feed a point to it and get the probability density. But there is no way enumerating through all possible actions (each of which is a real-valued vector) and get their logits, since the action space is inifinite.

stochastic policies

concept	action space
categorical polices	discrete
diagonal gaussian policies	continuous

There are two essential tasks, 1) sampling, and 2) log-likelihood.

1. categorical policies

the common practice is:

via network inference we get logits for the finite actions.
sample from the softmax-ed logits:
tf.multinomial is meant for the sampling result of n trials of (each) k options with specified probability.

recall that
>> bernoulli distribution models the outcome of a single trial of two options
>> binomial distribution models the outcome of n trials of two options
>> multimomial distribution models the outcome of n trials of k options
log-likelihood: use the sampled action to index the logits.

2. Diagonal Gaussian policies

As the name suggests, the distribution from which actions are sampled is a Diagonal Multivariate Gaussian Distribution. As an epitome of a normal multivariate gaussian, here diagonal refers to a diagonal covariance matrix of such distribution, meaning the dimensions are linearly independent to each other.

Normally such distribution is represented by a mean vector and a diagonal covariance matrix (equivalent to a variance vector).

As common practices, the mean vector of the policy is modeled by a neural network that inputs a state vector and outputs the mean action vector, used for the distribution of actions.

Two options are offered for modeling the variance vector.

can be a single parameter independent on the state, i.e.
$\mu=log\sigma$
dependent on the state, $\mu_t=log\sigma_t=\phi_\theta(s_t)$ , where the function $\phi$ is parameterized by $\theta$ .

Note we use log-standard-deviation because it takes on any values in $(-\infty,+\infty)$ while std-dev non-negative, and Training turns out easier if we enforce no constraints.

The sampling process can be done by tf.random_normal in tensorflow, following exactly $a=\mu_\theta(s_t)+\sigma_\theta(s_t)\odot z$ , where $\mu$ and $\sigma$ respectively the mean and std-dev, and $z\sim N(0,1)$ .

The log-likelihood is obtained via the following
在这里插入图片描述

trajectories

def
start-state
state-transition

mdp
determ
stochastic
alias

reward and return

reward function
why return
finite undiscounted return
infinite discounted return

why discount
common practice: optimize undiscounted, value functions use discounted

the RL problem

goal: expected return
(stochastic case) proba of a T-step Traj.
expected return
optimal policy

value functions

def. what does the value mean
four types
on-policy off-policy
two lemma value and q

the optimal Q-function and the optimal action

def. > rel.
note. the policy with respect to optimal action

Bellman Equations

in recurrence
the optimal value and q
bellman backup

Advantage function

meaning
form

[OpenAI SpinningUp] Key Concepts and Terminology