决策树

超参数

最大深度
Maximum depth of a decision tree
每片叶子的最小样本数
Minmum samples per leaf
- 整数：单片叶子上的最小样本数
- 浮点数：单片叶子上的样本比例。样本总数为该节点的样本总数。若叶节点的样本比例小于设定的浮点数，则此次分裂会被拒绝。
每次分裂的样本最小数
与叶子最小样本数类似，只不过是应用于结点分裂时
最大特征数
限制每次分裂过程中查找的特征总数
总结
- 较大的深度往往会导致过拟合，这是因为过深的决策树可以记忆数据。而较小的深度会使得模型过于简单，导致欠拟合。
- 当每片叶子的样本数量较小时，叶子上的样本数量也有可能过于稀少，此时模型将记忆数据，也就是过拟合。当每片叶子的样本数量较大时，决策树能够获得足够的弹性进行构建，这也许会导致欠拟合。

网格搜索

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')

from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV

clf = DecisionTreeClassifier(random_state=42)

# TODO: Create the parameters list you wish to tune.
parameters = {'max_depth':[2,4,6,8,10],'min_samples_leaf':[2,4,6,8,10], 'min_samples_split':[2,4,6,8,10]}

# TODO: Make an fbeta_score scoring object.
scorer = make_scorer(f1_score)

# TODO: Perform grid search on the classifier using 'scorer' as the scoring method.
grid_obj = GridSearchCV(clf, parameters, scoring=scorer)

# TODO: Fit the grid search object to the training data and find the optimal parameters.
grid_fit = grid_obj.fit(X_train, y_train)

# Get the estimator.
best_clf = grid_fit.best_estimator_

# Fit the new model.
best_clf.fit(X_train, y_train)

# Make predictions using the new model.
best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)

# Calculate the f1_score of the new model.
print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, y_test))

# Plot the new model.
plot_model(X, y, best_clf)

# Let's also explore what parameters ended up being used in the new model.
best_clf

朴素贝叶斯

Bayes Theorem

P (A | R) = \frac{P (A) P (R | A)}{P (A) P (R | A) + P (B) P (R | B)}

$P(A | R)=\frac{P(A)P(R | A)}{P(A)P(R | A) + P(B)P(R | B)}$

P (B | R) = \frac{P (B) P (R | B)}{P (A) P (R | A) + P (B) P (R | B)}

$P(B | R)=\frac{P(B)P(R | B)}{P(A)P(R | A) + P(B)P(R | B)}$

P (A) ， P (B)

$P(A)，P(B)$ 称为先验概率（prior probabilities）不知道

R

$R$ 事件的情况下，计算得出的事件

A

$A$ 和

B

$B$ 的概率。

P (A | R) ， P (B | R)

$P(A | R)，P(B | R)$ 称为后验概率（posterior probabilities）已知

R

$R$ 事件发生的情况下，推断出

A

$A$ 事件和

B

$B$ 事件的概率。

注意“假正例”

SVM

误差函数 Error Function

E r r o r = Classification Error + Margin Error

$Error = \text{Classification Error} + \text{Margin Error}$

分类误差Classification Error
在设立阈值分界时，不再局限于一个。而时有一定间隔。对于每一类，计算他们各自的误差，再将它们相加。
边际误差Margin Error
这多个阈值分界之间的区域中的点算作误差。计算所有在这个区域中的误差和。

C参数

E r r o r = C * Classification Error + Margin Error

$Error = C*\text{Classification Error} + \text{Margin Error}$
是分类误差前的参数
若C值较大，我们更注重分类误差，而不是寻找更合适的间隔；反之，我们更注重合适的间隔，而不是数据的正确分类。

多项式内核 Polynomials kernel

提升参数的维度，更高的维度上达到可分割性

RBF内核 Radial basis functions kernel

每个点的小山（可以是高斯分布）的叠加

项目2: 为CharityML寻找捐献者

# TODO：被调查者的收入大于$50,000的人数
n_greater_50k = len(data[data['income']=='>50K'])

# 将数据切分成特征和对应的标签
income_raw = data['income']
features_raw = data.drop('income', axis = 1)

# 对于倾斜的数据使用Log转换
skewed = ['capital-gain', 'capital-loss']
features_raw[skewed] = data[skewed].apply(lambda x: np.log(x + 1))

from sklearn.preprocessing import MinMaxScaler

# 初始化一个 scaler，并将它施加到特征上
scaler = MinMaxScaler()
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
features_raw[numerical] = scaler.fit_transform(data[numerical])

# 显示一个经过缩放的样例记录
display(features_raw.head(n = 1))

# TODO：使用pandas.get_dummies()对'features_raw'数据进行独热编码
features = pd.get_dummies(features_raw)

# TODO：将'income_raw'编码成数字值
income = income_raw.replace(['<=50K', '>50K'], [0, 1])

# 打印经过独热编码之后的特征数量
encoded = list(features.columns)
print ("{} total features after one-hot encoding.".format(len(encoded)))

# 移除下面一行的注释以观察编码的特征名字
#print(encoded)

# 导入 train_test_split
from sklearn.model_selection import train_test_split

# 将'features'和'income'数据切分成训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(features, income, test_size = 0.2, random_state = 0,stratify = income)
# 将'X_train'和'y_train'进一步切分为训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=0,stratify = y_train)

A c c u r a c y = \frac{Correctly classified points}{All points}

$Accuracy=\frac{\text{Correctly classified points}}{\text{All points}}$

P r e c i s i o n = \frac{Correctly predicted positive}{All positive points}

$Precision=\frac{\text{Correctly predicted positive}}{\text{All positive points}}$

R e c a l l = \frac{Correctly predicted negative}{All negative points}

$Recall=\frac{\text{Correctly predicted negative}}{\text{All negative points}}$

F_{β}

$F_\beta$ 分数的界限

F_{β} = (1 + β^{2}) \frac{Precision * Recall}{β^{2} * Precision + Recall}

$F_\beta=(1+\beta^2)\frac{\text{Precision}*\text{Recall}}{\beta^2*\text{Precision}+\text{Recall}}$

l i m_{β \to 0} F_{β} = P r e c i s i o n

$lim_{\beta\to0}F_\beta=Precision$

l i m_{β \to \infty} F_{β} = R e c a l l

$lim_{\beta\to\infty}F_\beta=Recall$

β

$\beta$ 的大小决定分数偏向考虑精确率还是召回率，当为1是，分数是二者的调和平均数

参考

Udacity学习课程

Udacity 学习笔记-监督学习

决策树

超参数