Stochastic DP method

Given an initial state $x_0$ and a policy $\pi=\{\mu_0, \dots, \mu_{N-1}\}$ , the future states $x_k$ and disturbances $w_k$ are random variables with distributions defined through the system equation
$x_{k+1}=f_k(x_k, \mu_k(x_k), w_k), \quad k=0, 1, \dots, N-1$
Thus, for given functions $g_k, k=0, 1, \dots, N$ , the expected cost of $\pi$ starting at $x_0$ is
$J_\pi(x_0)=\mathbb{E}\bigg\{g_N(x_N)+\sum_{k=0}^{N-1}g_k(x_k, \mu_k(x_k), w_k)\bigg\}$
An optimal policy $\pi^*$ is one that minimizes the cost, i.e.,
$J_{\pi^*}(x_0)=\min_{\pi\in\Pi} J_\pi(x_0)$
where $\Pi$ is the set of all policies.

The optimal cost depends on $x_0$ and is denoted by $J^*(x_0)$ ; i.e.,
$J^*(x_0)=\min_{x\in\Pi}J_\pi(x_0)$

DP algorithm for stochastic finite Horizon Problems

Start with
$J_N^*=g_N(x_N)$
for $\dots, N-1$ , let
$J_k^*(x_k)=\min_{u_k\in U_k(x_k)}\mathbb{E}\bigg\{g_k(x_k, u_k, w_k)+J_{k+1}^*(f_k(x_k, u_k, w_k))\bigg\}$
If $u_k^*=\mu_k^*(x_k)$ minimizes the right side of this equation for each $x_k$ and $k$ , the policy $\pi^*=\{\mu_0^*, \dots, \mu_{N-1}^*\}$ is optimal.

Simultaneously with the off-line computation of the optimal cost-to-go functions $J_0^*, \dots, J_N^*$ , we can compute and store an optimal policy $\pi^*=\{\mu_0^*, \dots, \mu_{N-1}^*\}$ . We can then use this policy on-line to retrieve from memory and apply the control $\mu_k^*(x_k)$ once we reach state $x_k$ .

Q-factors for stochastic problems

The Q-factors for a stochastic problem, similar to the case of deterministic problem, as the expressions that minimized in the right-hand side of stochastic DP equation:
$Q_k^*(x_k, u_k)=\mathbb{E}=\bigg\{g_k(x_k, u_k, w_k)+J_{k+1}^*(f_k(x_k, u_k, w_k))\bigg\}$
The optimal cost-to-go functions $J_k^*$ can be recovered from the optimal Q-factor $Q_k^*$ by means of
$J_k^*(x_k)=\min_{u_k\in U(x_k)} Q_k^*(x_k, u_k)$
and the DP algorithm can be written in terms of Q-factors as
$Q_k^*(x_k, u_k)=\mathbb{E}\bigg\{g_k(x_k, u_k, w_k)+\min_{u_{k+1}} Q_{k+1}^*(f_k(x_k, u_k, w_k), w_{k+1})\bigg\}$

Source From

RL & OC

【RL-Notes】Stochastic Dynamic Programming

Navigator

Stochastic DP method

DP algorithm for stochastic finite Horizon Problems

Q-factors for stochastic problems

Source From

猜你喜欢