强化学习算法通常有多种分类依据,按照有无模型分为有模型和无模型方法,按照学习目标分为基于价值、基于策略和基于Actor-Critic框架的方法,按照更新方式分为蒙特卡洛和时间差分方法,按照采样策略和优化策略是否相同分为在线和离线方法等。其中按照学习目标的分类是最常用的分类标准。
基于价值的方法通常优化动作值函数Q,优点在于样本效率高、值函数估计方差较低以及不容易陷入局部最优,但它通常不能处理连续动作空间问题,DQN[1]中的Ꜫ-greedy策略和取最大操作也会导致过估计问题。基于价值的方法中有代表性的包括Q-learning[2]、DQN及其相关改进算法:优先经验回放[3](使用TD error来为数据赋予不同的权重来提升样本效率);Dueling DQN[4](将动作值函数Q解耦为状态值函数V和优势函数A来提升近似能力);Double DQN[5](使用不同参数的神经网络选择和评估动作来解决过估计问题);Retrace[6](修正了Q值的计算方法并且减少了值估计的方差);Noisy DQN[7](给神经网络参数增加噪声来增强探索能力);Distributed DQN[8](将对Q值的估计改进为对分布的估计)。
基于策略的方法直接对策略进行优化,通过更新策略迭代使策略获得的累计回报最大。和基于价值的方法相比,基于策略的方法有着策略参数化简单、收敛性好、适用于连续或高维动作空间等优点;代表性的基于策略的方法有PG[9],TRPO[10],PPO[11]等,其中TRPO和PPO通过限制基于PG的更新步骤,来避免策略崩溃问题,使算法更加稳定。
基于Actor-Critic框架的方法结合了基于价值的和基于策略的方法各自的优点。基于Actor-Critic框架的方法借助基于价值方法来学习Q函数或V函数以提升样本效率,使用基于策略的方法来学习策略函数使其既适用于离散动作空间也适用于连续动作空间。这类的方法可以看作是基于价值的方法在连续动作空间的扩展,也可看作是基于策略的方法减少采样方差的改进,但同时也存在着基于价值的和基于策略的方法的缺点,例如Critic存在过估计问题,Actor存在探索不充分问题。代表性的Actor-Critic方法有Actor-Critic(AC)[12]及其改进算法:A3C[13](将AC算法扩展到异步并行学习,打乱数据间的相关性,提高数据采集和训练的速度);DDPG[14](继承DQN的目标网络,将确定性策略作为Actor);TD3[15](引入Clipped Double Q-learning和延迟策略更新的策略来解决过度估计的问题);SAC[16](在Q值的估计中使用熵正则化来提升探索能力)。
表1 基于值函数的RL算法
强化学习算法 |
策略类型 |
动作空间 |
年份 |
论文题目 |
Q-learning |
off-policy |
discrete |
1992 |
Q-learning |
SARSA |
on-policy |
discrete |
1994 |
Online Q-learning using connectionist systems[17] |
REINFORCE |
on-policy |
discrete or continuous |
1988 |
On the use of backpropagation in associative reinforcement learning[18] |
DQN |
off-policy |
discrete |
2015 |
Human-level control through deep reinforcement learning |
Dueling DQN |
off-policy |
discrete |
2015 |
Dueling network architectures for deep reinforcement learning |
Double DQN |
off-policy |
discrete |
2016 |
Deep reinforcement learning with double Q-learning |
Noisy DQN |
off-policy |
discrete |
2017 |
Noisy networks for exploration |
Distributed DQN |
off-policy |
discrete |
2017 |
A distributional perspective on reinforcement learning |
表2 基于策略的RL算法
强化学习算法 |
策略类型 |
动作空间 |
年份 |
论文题目 |
TRPO |
on-policy |
discrete or continuous |
2015 |
Trust region policy optimization |
PPO |
on-policy |
discrete or continuous |
2017 |
Proximal policy optimization algorithms |
DPPO |
on-policy |
discrete or continuous |
2017 |
Emergence of locomotion behaviours in rich environments |
ACKTR |
on-policy |
discrete or continuous |
2017 |
Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation[19] |
表3 基于Actor-Critic的RL算法
强化学习算法 |
策略类型 |
动作空间 |
年份 |
论文题目 |
Actor-critic (QAC) |
on-policy |
discrete or continuous |
2000 |
Actor-critic algorithms
|
CE method |
on-policy |
discrete or continuous |
2004 |
The cross-entropy method: A unified approach to Monte Carlo simulation, randomized optimization and machine learning[20] |
A3C |
on-policy |
discrete or continuous |
2016 |
Asynchronous methods for deep reinforcement learning |
DDPG |
off-policy |
continuous |
2016 |
Continuous control with deep reinforcement learning |
TD3 |
off-policy |
continuous |
2018 |
Addressing function approximation error in actor-critic methods |
SAC |
off-policy |
discrete or continuous |
2018 |
Soft actor-critic algorithms and applications |
参考文献:
- Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
- Watkins C J C H, Dayan P. Q-learning[J]. Machine Learning, 1992, 8(3-4): 279-292.
- Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J]. arXiv preprint arXiv: 1511.05952, 2015.
- Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning. 2016: 1995-2003.
- Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning[J]. arXiv preprint arXiv: 1509.06461, 2015.
- Munos R, Stepleton T, Harutyunyan A, et al. Safe and efficient off-policy reinforcement learning[C]//Advances in Neural Information Processing Systems. 2016: 1054-1062.
- Fortunato M, Azar M G, Piot B, et al. Noisy networks for exploration[J]. arXiv preprint arXiv: 1706.10295, 2017.
- Bellemare M G, Dabney W, Munos R. A distributional perspective on reinforcement learning[J]. arXiv preprint arXiv: 1707.06887, 2017.
- Sutton R S, McAllester D A, Singh S P, et al. Policy gradient methods for reinforcement learning with function approximation[C]//Advances in Neural Information Processing Systems. 2000: 1057-1063.
- Schulman J, Levine S, Abbeel P, et al. Trust region policy optimization[C]//International Conference on Machine Learning. 2015: 1889-1897.
- Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J]. arXiv preprint arXiv: 1707.06347, 2017.
- Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. MIT press, 2018.
- Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning. 2016: 1928-1937.
- Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[J]. arXiv preprint arXiv:1509.02971, 2015.
- Fujimoto S, Van Hoof H, Meger D. Addressing function approximation error in actor-critic methods[J]. arXiv preprint arXiv: 1802.09477, 2018.
- Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[J]. arXiv preprint arXiv: 1801.01290, 2018.
- Rummery G A, Niranjan M. On-line Q-learning using connectionist systems[M]. Cambridge, UK: University of Cambridge, Department of Engineering, 1994.
- Williams R J. On the use of backpropagation in associative reinforcement learning[C]//ICNN. 1988: 263-270.
- Wu Y, Mansimov E, Grosse R B, et al. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation[C]//Advances in Neural Information Processing Systems. 2017: 5279-5288.
- Rubinstein R Y, Kroese D P. The cross-entropy method: A unified approach to monte carlo simulation, randomized optimization and machine learning[J]. Information Science & Statistics, Springer Verlag, NY, 2004.