[OpenAI SpinningUp] Key Concepts and Terminology

An agent interacts with the environment, in the way that

  1. the agent observes the environment, and take an action,
  2. the action changes the environment’s current state, and meanwhile the agent receives a reward,
  3. back to 1.

In an episode, the return the agent gets is defined by the cumulative rewards.
Breifly speaking, the goal of RL is to find an agent with the best possible policy that receives maximal returns.

states and observations

concept meaning
state full description of the environment
observation partial description of the environment

Both states and observations can be represented in vectors, matrices, high-dim tensor, e.g., visual observations often use RGB matrices, robots in GYM often use high-dim vectors to encapsulate angles and velocities.

However in the literature, the two concepts are not differentiated clearly. Often when researchers mentions state in the paper, their experiemtal subjects actually have observation instead, because normally the subjects have partial access to the environment.

action spaces

The set of all valid actions is termed action space.

concept meaning representations
discrete action space finite options of actions integer indices
continuous action space infinite, smooth action changes real-valued vectors

As the goal of RL is to find optimal policies, the responsibility of whom is to give appropriate actions on observing the environment, the two concepts matter to the method of finding optimal policies.


The policy is what an agent conforms with to give actions when observing the environment. It takes the current state (or observation) as input and outputs an action

concept notation
deterministic policy a t = π ( s t ) a_t=\pi(s_t)
stochastic policy a t π ( s t ) a_t\sim\pi(\cdot\mid s_t)

I would interpret stochastic as such that actions are randomly sampled from a distribution dependent on the state.

Note policy sometimes take the place of agent in the literature. This is reasonable since all an agent do is to follow its policy.

In deep RL, we survey on the parameterized policy. The policy is parameterized computable functions, like in the form of a neural network, so that we can obtain a policy that produces optimal actions conditioned on given states, by adjusting the parameters according to some optimization methods.

deterministic policies

A typical example is several stacking dense layers with activations followed by a final dense layer that outputs logits.

The above example is like a vanilla finite-classes classifier. The essential difference to the stochastic policies, from my perspective, is it often takes an argmax over the logits instead of sampling, rendering it a deterministic function.

To think further, is it necessarily be a categorical policy?
My answer is yes. The question is in essence asking is it possible to have infinite action space for a deterministic policy.
At the first glance, I thought I just need an infinite action space and when I get the logits for each actions, I simply take an argmax and my work is over.
But how can we have logits for an infinite action space? Only via a distribution’s formula, right? But recall what exactly do we use the distribution’s density function. We feed a point to it and get the probability density. But there is no way enumerating through all possible actions (each of which is a real-valued vector) and get their logits, since the action space is inifinite.

stochastic policies

concept action space
categorical polices discrete
diagonal gaussian policies continuous

There are two essential tasks, 1) sampling, and 2) log-likelihood.

1. categorical policies

the common practice is:

  1. via network inference we get logits for the finite actions.

  2. sample from the softmax-ed logits:
    tf.multinomial is meant for the sampling result of n trials of (each) k options with specified probability.

    recall that
    >> bernoulli distribution models the outcome of a single trial of two options
    >> binomial distribution models the outcome of n trials of two options
    >> multimomial distribution models the outcome of n trials of k options

  3. log-likelihood: use the sampled action to index the logits.

2. Diagonal Gaussian policies

As the name suggests, the distribution from which actions are sampled is a Diagonal Multivariate Gaussian Distribution. As an epitome of a normal multivariate gaussian, here diagonal refers to a diagonal covariance matrix of such distribution, meaning the dimensions are linearly independent to each other.

Normally such distribution is represented by a mean vector and a diagonal covariance matrix (equivalent to a variance vector).

As common practices, the mean vector of the policy is modeled by a neural network that inputs a state vector and outputs the mean action vector, used for the distribution of actions.

Two options are offered for modeling the variance vector.

  1. can be a single parameter independent on the state, i.e.
    μ = l o g σ \mu=log\sigma
  2. dependent on the state, μ t = l o g σ t = ϕ θ ( s t ) \mu_t=log\sigma_t=\phi_\theta(s_t) , where the function ϕ \phi is parameterized by θ \theta .

Note we use log-standard-deviation because it takes on any values in ( , + ) (-\infty,+\infty) while std-dev non-negative, and Training turns out easier if we enforce no constraints.

The sampling process can be done by tf.random_normal in tensorflow, following exactly a = μ θ ( s t ) + σ θ ( s t ) z a=\mu_\theta(s_t)+\sigma_\theta(s_t)\odot z , where μ \mu and σ \sigma respectively the mean and std-dev, and z N ( 0 , 1 ) z\sim N(0,1) .

The log-likelihood is obtained via the following



