Navigator
Stochastic DP method
Given an initial state x 0 x_0 x0 and a policy π = { μ 0 , … , μ N − 1 } \pi=\{\mu_0, \dots, \mu_{N-1}\} π={
μ0,…,μN−1}, the future states x k x_k xk and disturbances w k w_k wk are random variables with distributions defined through the system equation
x k + 1 = f k ( x k , μ k ( x k ) , w k ) , k = 0 , 1 , … , N − 1 x_{k+1}=f_k(x_k, \mu_k(x_k), w_k), \quad k=0, 1, \dots, N-1 xk+1=fk(xk,μk(xk),wk),k=0,1,…,N−1
Thus, for given functions g k , k = 0 , 1 , … , N g_k, k=0, 1, \dots, N gk,k=0,1,…,N, the expected cost of π \pi π starting at x 0 x_0 x0 is
J π ( x 0 ) = E { g N ( x N ) + ∑ k = 0 N − 1 g k ( x k , μ k ( x k ) , w k ) } J_\pi(x_0)=\mathbb{E}\bigg\{g_N(x_N)+\sum_{k=0}^{N-1}g_k(x_k, \mu_k(x_k), w_k)\bigg\} Jπ(x0)=E{
gN(xN)+k=0∑N−1gk(xk,μk(xk),wk)}
An optimal policy π ∗ \pi^* π∗ is one that minimizes the cost, i.e.,
J π ∗ ( x 0 ) = min π ∈ Π J π ( x 0 ) J_{\pi^*}(x_0)=\min_{\pi\in\Pi} J_\pi(x_0) Jπ∗(x0)=π∈ΠminJπ(x0)
where Π \Pi Π is the set of all policies.
The optimal cost depends on x 0 x_0 x0 and is denoted by J ∗ ( x 0 ) J^*(x_0) J∗(x0); i.e.,
J ∗ ( x 0 ) = min x ∈ Π J π ( x 0 ) J^*(x_0)=\min_{x\in\Pi}J_\pi(x_0) J∗(x0)=x∈ΠminJπ(x0)
DP algorithm for stochastic finite Horizon Problems
Start with
J N ∗ = g N ( x N ) J_N^*=g_N(x_N) JN∗=gN(xN)
for k = 0 , … , N − 1 k=0, \dots, N-1 k=0,…,N−1, let
J k ∗ ( x k ) = min u k ∈ U k ( x k ) E { g k ( x k , u k , w k ) + J k + 1 ∗ ( f k ( x k , u k , w k ) ) } J_k^*(x_k)=\min_{u_k\in U_k(x_k)}\mathbb{E}\bigg\{g_k(x_k, u_k, w_k)+J_{k+1}^*(f_k(x_k, u_k, w_k))\bigg\} Jk∗(xk)=uk∈Uk(xk)minE{
gk(xk,uk,wk)+Jk+1∗(fk(xk,uk,wk))}
If u k ∗ = μ k ∗ ( x k ) u_k^*=\mu_k^*(x_k) uk∗=μk∗(xk) minimizes the right side of this equation for each x k x_k xk and k k k, the policy π ∗ = { μ 0 ∗ , … , μ N − 1 ∗ } \pi^*=\{\mu_0^*, \dots, \mu_{N-1}^*\} π∗={
μ0∗,…,μN−1∗} is optimal.
Simultaneously with the off-line computation of the optimal cost-to-go functions J 0 ∗ , … , J N ∗ J_0^*, \dots, J_N^* J0∗,…,JN∗, we can compute and store an optimal policy π ∗ = { μ 0 ∗ , … , μ N − 1 ∗ } \pi^*=\{\mu_0^*, \dots, \mu_{N-1}^*\} π∗={ μ0∗,…,μN−1∗}. We can then use this policy on-line to retrieve from memory and apply the control μ k ∗ ( x k ) \mu_k^*(x_k) μk∗(xk) once we reach state x k x_k xk.
Q-factors for stochastic problems
The Q-factors for a stochastic problem, similar to the case of deterministic problem, as the expressions that minimized in the right-hand side of stochastic DP equation:
Q k ∗ ( x k , u k ) = E = { g k ( x k , u k , w k ) + J k + 1 ∗ ( f k ( x k , u k , w k ) ) } Q_k^*(x_k, u_k)=\mathbb{E}=\bigg\{g_k(x_k, u_k, w_k)+J_{k+1}^*(f_k(x_k, u_k, w_k))\bigg\} Qk∗(xk,uk)=E={
gk(xk,uk,wk)+Jk+1∗(fk(xk,uk,wk))}
The optimal cost-to-go functions J k ∗ J_k^* Jk∗ can be recovered from the optimal Q-factor Q k ∗ Q_k^* Qk∗ by means of
J k ∗ ( x k ) = min u k ∈ U ( x k ) Q k ∗ ( x k , u k ) J_k^*(x_k)=\min_{u_k\in U(x_k)} Q_k^*(x_k, u_k) Jk∗(xk)=uk∈U(xk)minQk∗(xk,uk)
and the DP algorithm can be written in terms of Q-factors as
Q k ∗ ( x k , u k ) = E { g k ( x k , u k , w k ) + min u k + 1 Q k + 1 ∗ ( f k ( x k , u k , w k ) , w k + 1 ) } Q_k^*(x_k, u_k)=\mathbb{E}\bigg\{g_k(x_k, u_k, w_k)+\min_{u_{k+1}} Q_{k+1}^*(f_k(x_k, u_k, w_k), w_{k+1})\bigg\} Qk∗(xk,uk)=E{
gk(xk,uk,wk)+uk+1minQk+1∗(fk(xk,uk,wk),wk+1)}
Source From
RL & OC