《Reinforcement Learning》读书笔记 2：多臂老虎机（Multi-armed Bandits） - 代码天地

《Reinforcement Learning》读书笔记 2：多臂老虎机（Multi-armed Bandits）

其他 2018-09-22 10:17:01 阅读次数: 0

版权声明：本文为博主原创文章，欢迎交流分享，未经博主允许不得转载。 https://blog.csdn.net/qjf42/article/details/79655483

《Reinforcement Learning: An Introduction》读书笔记 - 目录

Reinforcement Learning 和 Supervised Learning 的区别

evaluate vs instruct

也就是说，RL的对于每一个action的效果不是非黑即白的，而是在每一次的action之后都可能不一样的后果（feedback, reward）
- 非iid，基于不同环境和/或之前的 actions
- reward可能是随机的

定义问题（ k-armed bandit problem）

k种actions => k个reward $R$ 的平稳分布
目标
- $max\ E(\sum R_t)$

一些概念

`exploitation vs exploration (EE)`

exploitation: greedy move
exploration: nongreedy trial

reward & value

the value of an action $a$ , denoted $q_*(a)$ , is the expected reward given that $a$

i.e. $q_∗(a) = E[R_t | A_t = a]$
用经验分布近似估计：
- $Q_t(a) = \frac{\sum_{i=1}^{t-1}R_i \cdot 1_{A_i=a}}{\sum_{i=1}^{t-1}1_{A_i=a}}$
- 迭代式（在执行某个 $a$ 后）： $Q_{t} (a) = Q_{t - 1} + \frac{1}{t} (R_{t} (a) - Q_{t - 1}) = Q_{t - 1} + α (t) (R_{t} (a) - Q_{t - 1})$ $Q_t(a) = Q_{t-1} + \frac{1}{t}(R_t(a) - Q_{t-1}) \\ = Q_{t-1} + \alpha(t)(R_t(a) - Q_{t-1})$
更广义的，
- 这里，StepSize可以是单调减的，常数(指数平滑), …

几种方法

$\varepsilon$ -greedy

算法
- 以 $p = 1-\varepsilon$ 执行greedy action (exploitation)
- 以 $p = \varepsilon$ 执行nongreedy action (exploration)
优点
- 实现简单
- 效果不会太差，即使分布是非平稳的
缺点
- 通常收敛比较慢
- 单纯的 $\varepsilon$ -greedy收敛后执行最优action(greedy)的比例为 $1-\varepsilon < 1$
优化点
- $\varepsilon$ 随时间减小
- 选一个大点的
  - encourage exploration，选择足够大，能保证state space都覆盖到
  - 即使非平稳也没问题，因为影响只是暂时的

UCB（Upper-Confidence-Bound）

算法
- $A_t = argmax_a (\ Q_t(a) + c\sqrt{ln(t)/(N_t(a)+\epsilon})\ )$
- $\epsilon \rightarrow 0$ 或1
- $c$ 是平衡EE的参数（类比置信度）
缺点
- 适用范围没有 $\varepsilon$ -greedy广，比如非平稳分布

Gradient Bandit

算法
- 定义
  - $H_t(a)$ 为preference for action a
  - $\pi_t(a) = P_t(A_t = a) =softmax_t(H_t(a))$ ，非argmax
- 迭代
  - $H_{t+1}(A_t) = H_t(A_t) + \alpha (R_t − \bar R_t)(1 − \pi_t(A_t))$
  - $H_{t+1}(a) = H_t(a) - \alpha (R_t − \bar R_t)\pi_t(a), \text{ for all } a \ne A_t$
- 推导
  - $E(R_t) =\sum_x \pi_t(x) q_∗(x)$
  - $H_{t+1}(a) = H_t(a) + \alpha \frac{\partial E(R_t)}{\partial H_t(a) } = \dots$
优点
- 通用思想，可以引申到后面的full RL问题中

其它

Bayesian methods(posterior sampling/Thompson sampling)

假设value服从某个（未知的）稳定分布 $f$
假设一个（确定的）先验分布 $f_{pri}$ ，执行一系列action，根据结果，得到后验分布 $f_{post}$ （收敛于 $f$ ）
e.g

如何比较（参数&算法）

learning curve
- x轴为参数，y轴为average sum of rewards (e.g of 1000 experiments)

其他点

associative search (contextual bandits)

就是包含不同situation (environment)的问题（但与former actions仍无关）

If actions are allowed to affect the next situation as well as the reward, then we have the full reinforcement learning problem.

猜你喜欢

转载自blog.csdn.net/qjf42/article/details/79655483

《Reinforcement Learning》读书笔记 2：多臂老虎机（Multi-armed Bandits）

《Reinforcement Learning: An Introduction》 Chapter 2 Multi-arm Bandits 笔记

强化学习系列（二）：Multi-armed Bandits(多臂老虎机问题）

Reinforcement Learning: An Introduction读书笔记(2)--多臂机

RLAI读书笔记-第二章-Multi-armed Bandits

《Reinforcement Learning: An Introduction》读书笔记 - 目录

Multi-armed Bandits

Chapter 2 Multi-armed Bandits

Reinforcement Learning:An Inteoduction第二章读书笔记

《Reinforcement Learning》读书笔记 4：动态规划（Dynamic Programing）

Reinforcement Learning: An Introduction读书笔记(3)--finite MDPs

Deep Reinforcement Learning for Chinese Zero pronoun Resolution读书笔记

Reinforcement Learning 笔记（1）

Reinforcement Learning 笔记（3）

Reinforcement Learning 笔记（4）

读书笔记-Distributed Cooperative Reinforcement Learning-Based Traffic Signal Control That Integrates V2X

随机多臂赌博机 (Stochastic Multi-armed Bandits)：置信上界算法 (Upper Confidence Bound)

Introduction to Multi-Armed Bandits——02 Stochastic Bandits

Introduction to Multi-Armed Bandits——04 Thompson Sampling[2]

Reinforcement Learning:An Introduction 第三章读书笔记

读书笔记5：Deep Progressive Reinforcement Learning for Skeleton-based Action Recognition

《Reinforcement Learning》读书笔记 5：蒙特卡洛（Monte Carlo Methods）

读书笔记 - Clique-based Cooperative Multiagent Reinforcement Learning Using Factor Graphs

Introduction to Multi-Armed Bandits——05 Thompson Sampling[3]

Introduction to Multi-Armed Bandits——03 Thompson Sampling[1]

Introduction to Multi-Armed Bandits——01 Scope and Motivation

Issues in Using Function Approximation for Reinforcement Learning笔记

算法笔记：Playing Atari with Deep Reinforcement Learning

李宏毅Deep Reinforcement Learning笔记

强化学习（Reinforcement Learning）笔记（收藏）

今日推荐

周排行

Access的四舍五入取整

8.23 前端学习过程

入门学习过程方向与漏洞复现总结：

操作分布式文件之八：如何批量并行读写远程文件和事务补偿处理

应邀出个教程（搭建tensorflow跑网络环境）

Kubernetes之Pod控制器应用进阶

14-[mysql内置功能]--

HDU6212 区间dp 好题

VS2015生成代码图

验证手机号的工具类

每日归档

更多

2024-10-21(0)

2024-10-20(0)

2024-10-19(0)

2024-10-18(0)

2024-10-17(0)

2024-10-16(0)

2024-10-15(0)

2024-10-14(0)

2024-10-13(0)

2024-10-12(0)