【实用机器学习】3.2 DecisionTrees

决策树(Decision Trees)可以用来做分类(Classification)和回归(regression tree) 任务

Decision Trees

Pros
- Explainable（可解释性）
- Can handle both numerical and categorical features（可以同时处理数值类（大于或小于）和类别类的特征）
Cons
- Very non-robust (ensemble（集成学习） to help) （很可能被数据的噪音影响）
- Complex trees cause over-fitting (prune trees（剪枝）)
- Not easy to be parallelized in computing（顺序化过程比较难并行性能会稍微吃亏）

Random Forest

Train multiple decision trees to improve robustness
- Each tree is trained independently
- Majority voting for classification, average for regression（分类问题就投票，回归问题就取平均）
- 代价就是训练成本高一点点
Where is the randomness from?（两种随机的方式）
- Bagging: randomly sample training examples with replacement （可以重复地取出一组数据单独训练，重复这个过程，训练n棵树）
  - E.g. [1,2,3,4,5] $\rightarrow$ [1,2,2,3,4]（采样可能是有重复的）
- Randomly select a subset of features.（对特征随机，不要用整个特征）

Gradient Boosting Decision Trees

Train multiple trees sequentially （跟之前一样是训练很多树，但是不再是独立地完成，而是顺序地完成，这些树一起能合成一个比较大的模型）
At step $\ldots$ , denote by $F_t(x)$ the sum of past trained trees（过去所有树训练的和，每一个树是一个函数，你的输出是那些每个数的输出的和加起来）
- Train a new tree $f_t$ on residuals: $\left\{\left(x_i, y_i-F_t\left(x_i\right)\right)\right\}_{i=1, \ldots}$ （在接下来的时间t里面训练一颗新的树，它不在原始的数据上，而是在残差（真实值和预测值之间的差，这个模型没有做好的那一块）的数据上，再训练一棵树）用这种方式更加靠近真实值
- $F_{t+1}(x)=F_t(x)+f_t(x)$
The residual equals to $-\partial L / \partial F$ if using mean square the loss, so it’s called gradient boosting （）（等价于去了一个平均均方误差，每一次去训练一个新的树来拟合梯度的负数）（具体梯度下降的定义在这一节）

Summary

Decision tree: an explainable model for classification/regression
Ensemble trees to reduce bias and variance（决策树对数据的噪音非常敏感，具体偏移和方差的定义在这一节）
- Random forest: trees trained in parallel with randomness（随机并行训练树）
- Gradient boosting trees: train in sequential on residuals（顺序地训练一些数，每一棵新的树都是对之前的数预测的不准的那一块部分去继续拟合）
Trees are widely used in industry
- Simple, easy-to-tune, often gives satisfied results（训练简单，没有太多超参数可以调，容易给出比较好的结果）