[BM95] Boyan and Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7. Morgan Kaufmann, 1995.
运用多层感知器表示价值函数,所存在的问题
[EPG05] D. Ernst and and L. Wehenkel P. Geurts. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
NFQ是其中’Fitted Q Iteration’的special realisation
[Gor95] G. J. Gordon. Stable function approximation in dynamic programming. In A. Prieditis and S. Russell, editors, Proceedings of the ICML, San Francisco, CA, 1995.
定值迭代算法fitted value iteration algorithm,NFQ基于此
[Lin92] L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8:293–321, 1992.
运用多层感知器表示价值函数的成功案例;
’experience replay‘ technique
[LP03] M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.
倒立摆(5.1节)所需的样本,系统方程及参数;LSPI方法及其结果
[RB93] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In H. Ruspini, editor, Proceedings of the IEEE International Conference on Neural Networks (ICNN), pages 586 – 591, San Francisco, 1993.
Rprop算法,一种用于批量学习的监督学习方法,训练Q函数
[Rie00] M. Riedmiller. Concepts and facilities of a neural reinforcement learning control architecture for technical process control. Journal of Neural Computing and Application, 8:323–338, 2000.
运用多层感知器表示价值函数的成功案例
[SB98] R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, Cambridge, MA, 1998.
爬山小车的模型;cartploe模型
[Tes92] G. Tesauro. Practical issues in temporal difference learning. Machine Learning, (8):257–277, 1992.
运用多层感知器表示价值函数的成功案例