【RL-Notes】Stochastic Dynamic Programming

Stochastic DP method

Given an initial state x 0 x_0 x0 and a policy π = { μ 0 , … , μ N − 1 } \pi=\{\mu_0, \dots, \mu_{N-1}\} π={ μ0,,μN1}, the future states x k x_k xk and disturbances w k w_k wk are random variables with distributions defined through the system equation
x k + 1 = f k ( x k , μ k ( x k ) , w k ) , k = 0 , 1 , … , N − 1 x_{k+1}=f_k(x_k, \mu_k(x_k), w_k), \quad k=0, 1, \dots, N-1 xk+1=fk(xk,μk(xk),wk),k=0,1,,N1
Thus, for given functions g k , k = 0 , 1 , … , N g_k, k=0, 1, \dots, N gk,k=0,1,,N, the expected cost of π \pi π starting at x 0 x_0 x0 is
J π ( x 0 ) = E { g N ( x N ) + ∑ k = 0 N − 1 g k ( x k , μ k ( x k ) , w k ) } J_\pi(x_0)=\mathbb{E}\bigg\{g_N(x_N)+\sum_{k=0}^{N-1}g_k(x_k, \mu_k(x_k), w_k)\bigg\} Jπ(x0)=E{ gN(xN)+k=0N1gk(xk,μk(xk),wk)}
An optimal policy π ∗ \pi^* π is one that minimizes the cost, i.e.,
J π ∗ ( x 0 ) = min ⁡ π ∈ Π J π ( x 0 ) J_{\pi^*}(x_0)=\min_{\pi\in\Pi} J_\pi(x_0) Jπ(x0)=πΠminJπ(x0)
where Π \Pi Π is the set of all policies.

The optimal cost depends on x 0 x_0 x0 and is denoted by J ∗ ( x 0 ) J^*(x_0) J(x0); i.e.,
J ∗ ( x 0 ) = min ⁡ x ∈ Π J π ( x 0 ) J^*(x_0)=\min_{x\in\Pi}J_\pi(x_0) J(x0)=xΠminJπ(x0)

DP algorithm for stochastic finite Horizon Problems

Start with
J N ∗ = g N ( x N ) J_N^*=g_N(x_N) JN=gN(xN)
for k = 0 , … , N − 1 k=0, \dots, N-1 k=0,,N1, let
J k ∗ ( x k ) = min ⁡ u k ∈ U k ( x k ) E { g k ( x k , u k , w k ) + J k + 1 ∗ ( f k ( x k , u k , w k ) ) } J_k^*(x_k)=\min_{u_k\in U_k(x_k)}\mathbb{E}\bigg\{g_k(x_k, u_k, w_k)+J_{k+1}^*(f_k(x_k, u_k, w_k))\bigg\} Jk(xk)=ukUk(xk)minE{ gk(xk,uk,wk)+Jk+1(fk(xk,uk,wk))}
If u k ∗ = μ k ∗ ( x k ) u_k^*=\mu_k^*(x_k) uk=μk(xk) minimizes the right side of this equation for each x k x_k xk and k k k, the policy π ∗ = { μ 0 ∗ , … , μ N − 1 ∗ } \pi^*=\{\mu_0^*, \dots, \mu_{N-1}^*\} π={ μ0,,μN1} is optimal.

Simultaneously with the off-line computation of the optimal cost-to-go functions J 0 ∗ , … , J N ∗ J_0^*, \dots, J_N^* J0,,JN, we can compute and store an optimal policy π ∗ = { μ 0 ∗ , … , μ N − 1 ∗ } \pi^*=\{\mu_0^*, \dots, \mu_{N-1}^*\} π={ μ0,,μN1}. We can then use this policy on-line to retrieve from memory and apply the control μ k ∗ ( x k ) \mu_k^*(x_k) μk(xk) once we reach state x k x_k xk.

Q-factors for stochastic problems

The Q-factors for a stochastic problem, similar to the case of deterministic problem, as the expressions that minimized in the right-hand side of stochastic DP equation:
Q k ∗ ( x k , u k ) = E = { g k ( x k , u k , w k ) + J k + 1 ∗ ( f k ( x k , u k , w k ) ) } Q_k^*(x_k, u_k)=\mathbb{E}=\bigg\{g_k(x_k, u_k, w_k)+J_{k+1}^*(f_k(x_k, u_k, w_k))\bigg\} Qk(xk,uk)=E={ gk(xk,uk,wk)+Jk+1(fk(xk,uk,wk))}
The optimal cost-to-go functions J k ∗ J_k^* Jk can be recovered from the optimal Q-factor Q k ∗ Q_k^* Qk by means of
J k ∗ ( x k ) = min ⁡ u k ∈ U ( x k ) Q k ∗ ( x k , u k ) J_k^*(x_k)=\min_{u_k\in U(x_k)} Q_k^*(x_k, u_k) Jk(xk)=ukU(xk)minQk(xk,uk)
and the DP algorithm can be written in terms of Q-factors as
Q k ∗ ( x k , u k ) = E { g k ( x k , u k , w k ) + min ⁡ u k + 1 Q k + 1 ∗ ( f k ( x k , u k , w k ) , w k + 1 ) } Q_k^*(x_k, u_k)=\mathbb{E}\bigg\{g_k(x_k, u_k, w_k)+\min_{u_{k+1}} Q_{k+1}^*(f_k(x_k, u_k, w_k), w_{k+1})\bigg\} Qk(xk,uk)=E{ gk(xk,uk,wk)+uk+1minQk+1(fk(xk,uk,wk),wk+1)}

Source From

RL & OC

猜你喜欢

转载自blog.csdn.net/qq_18822147/article/details/121107577