记录学习过程
【3.2 最简单也最常用的决策树【斯坦福21秋季:实用机器学习中文版】】
决策树(Decision Trees)可以用来做分类(Classification)和回归(regression tree) 任务
Decision Trees
- Pros
- Explainable(可解释性 )
- Can handle both numerical and categorical features(可以同时处理数值类(大于或小于)和类别类的特征)
- Cons
- Very non-robust (ensemble(集成学习) to help) (很可能被数据的噪音影响)
- Complex trees cause over-fitting (prune trees(剪枝))
- Not easy to be parallelized in computing(顺序化过程 比较难并行 性能会稍微吃亏)
Random Forest
- Train multiple decision trees to improve robustness
- Each tree is trained independently
- Majority voting for classification, average for regression(分类问题就投票,回归问题就取平均)
- 代价就是训练成本高一点点
- Where is the randomness from?(两种随机的方式)
- Bagging: randomly sample training examples with replacement (可以重复地取出一组数据单独训练,重复这个过程,训练n棵树)
- E.g. [1,2,3,4,5] → \rightarrow → [1,2,2,3,4](采样可能是有重复的)
- Randomly select a subset of features.(对特征随机,不要用整个特征)
- Bagging: randomly sample training examples with replacement (可以重复地取出一组数据单独训练,重复这个过程,训练n棵树)
Gradient Boosting Decision Trees
- Train multiple trees sequentially (跟之前一样是训练很多树,但是不再是独立地完成,而是顺序地完成,这些树一起能合成一个比较大的模型)
- At step t = 1 , … t=1, \ldots t=1,…, denote by F t ( x ) F_t(x) Ft(x) the sum of past trained trees(过去所有树训练的和,每一个树是一个函数,你的输出是那些每个数的输出的和加起来)
- Train a new tree f t f_t ft on residuals: { ( x i , y i − F t ( x i ) ) } i = 1 , … \left\{\left(x_i, y_i-F_t\left(x_i\right)\right)\right\}_{i=1, \ldots} { (xi,yi−Ft(xi))}i=1,… (在接下来的时间t里面训练一颗新的树,它不在原始的数据上,而是在残差(真实值和预测值之间的差,这个模型没有做好的那一块)的数据上,再训练一棵树) 用这种方式更加靠近真实值
- F t + 1 ( x ) = F t ( x ) + f t ( x ) F_{t+1}(x)=F_t(x)+f_t(x) Ft+1(x)=Ft(x)+ft(x)
- The residual equals to − ∂ L / ∂ F -\partial L / \partial F −∂L/∂F if using mean square the loss, so it’s called gradient boosting ()(等价于去了一个平均均方误差,每一次去训练一个新的树来拟合梯度的负数)(具体梯度下降的定义在这一节)
Summary
- Decision tree: an explainable model for classification/regression
- Ensemble trees to reduce bias and variance(决策树对数据的噪音非常敏感,具体偏移和方差的定义在这一节)
- Random forest: trees trained in parallel with randomness(随机并行训练树)
- Gradient boosting trees: train in sequential on residuals(顺序地训练一些数,每一棵新的树都是对之前的数预测的不准的那一块部分去继续拟合)
- Trees are widely used in industry
- Simple, easy-to-tune, often gives satisfied results(训练简单,没有太多超参数可以调,容易给出比较好的结果)