模型构建之集成模型

构建随机森林、GBDT、XGBoost和LightGBM这四个模型，并对每一个模型进行评分，评分方式任意，例如准确度和auc值。

本篇代码均建立在上一篇(【数据分析实践】 Task1.1 模型构建)代码已运行的基础上

集成模型构建

模型训练：

rf_clf = RandomForestClassifier(random_state=2018).fit(X_train, y_train)
gbdt = GradientBoostingClassifier(random_state=2018).fit(X_train, y_train)
xgb = XGBClassifier(random_state=2018).fit(X_train, y_train)
lgbm = LGBMClassifier(random_state=2018).fit(X_train, y_train)

模型评估

准确度Accuracy

models = {'随机森林': rf_clf,
          'GBDT': gbdt,
          'XGBoost': xgb,
          'LightGBM': lgbm}

for name, model in models.items():
    accuracy = accuracy_score(model.predict(X_test), y_test) * 100
    print('%s模型的准确度为%.2f%%' % (name, accuracy))

随机森林模型的准确度为77.08%
GBDT模型的准确度为78.07%
XGBoost模型的准确度为78.56%
LightGBM模型的准确度为77.01%

可以看出集成模型的准确度较之前所用的基本模型(SVM, 逻辑回归，决策树)有所提高

混淆矩阵

混淆矩阵中的每一行表示一个实际的类, 而每一列表示一个预测的类。一个完美的分类器将只有真反例和真正例，所以混淆矩阵的左上到右下的对角线值越小越好。根据混淆矩阵定义，在二元分类任务中，混淆矩阵的四个值的意义为：

True Positive(TP)	False Positive(FP)
False Negative(FN)	True Negative(TN)

True Positive(TP)表示被正确预测的正例个数
False Positive(FP)表示被错误预测的正例个数
False Negative(FN)表示被错误预测的反例个数
True Negative(TN)表示被正确预测的反例个数

def plot_confusion_matrix(conf, classes, title, cmap=plt.cm.gray):
    plt.imshow(conf, interpolation='nearest', cmap=cmap)
    plt.title(title)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes)
    plt.yticks(tick_marks, classes)
    plt.ylabel('真实标签')
    plt.xlabel('预测标签')
    plt.tight_layout()

fig = plt.figure(figsize=(8, 6))
i=1
classes=['未逾期', '逾期']
for name, model in models.items():
    conf = confusion_matrix(model.predict(X_test), y_test) * 100
    print('%s模型的混淆矩阵为\n%s' % (name, conf))
    subfig = plt.subplot(2,2,i)
    plot_confusion_matrix(conf, classes, name)
    plt.subplots_adjust(wspace =0, hspace =0.4)
    i += 1

随机森林模型的混淆矩阵为
[[100500  26400]
 [  6300   9500]]
GBDT模型的混淆矩阵为
[[98800 23300]
 [ 8000 12600]]
XGBoost模型的混淆矩阵为
[[99300 23100]
 [ 7500 12800]]
LightGBM模型的混淆矩阵为
[[97300 23300]
 [ 9500 12600]]

在这里插入图片描述

精确率Precision

精确率的定义如下：
$Precision=\frac{TP}{TP+FP}$
即在所有预测为正例的测试样本中，正确预测的比例。

for name, model in models.items():
    precision = precision_score(y_test, model.predict(X_test)) * 100
    print('%s模型的精确率为%.2f%%' % (name, precision))

随机森林模型的精确率为60.13%
GBDT模型的精确率为61.17%
XGBoost模型的精确率为63.05%
LightGBM模型的精确率为57.01%

召回率Recall

召回率定义如下：
$Recall=\frac{TP}{TP+FN}$
即在所有实际为正例的测试样本中，正确预测的比例。

for name, model in models.items():
    recall = recall_score(y_test, model.predict(X_test)) * 100
    print('%s模型的召回率为%.2f%%' % (name, recall))

随机森林模型的召回率为26.46%
GBDT模型的召回率为35.10%
XGBoost模型的召回率为35.65%
LightGBM模型的召回率为35.10%

F1值

F1值的定义如下：
$F1\space Score=\frac{2\times precision\times recall}{precision + recall}$
F1值是在认为召回率和精确率权重（重要性）相等的情况下，对两者的综合考虑。

for name, model in models.items():
    f1 = f1_score(y_test, model.predict(X_test)) * 100
    print('%s模型的F1值为%.2f%%' % (name, f1))

随机森林模型的F1值为36.75%
GBDT模型的F1值为44.60%
XGBoost模型的F1值为45.55%
LightGBM模型的F1值为43.45%

ROC曲线

首先考虑两个指标True positive rate(TPR)和False positive rate(FPR)。TPR定义和召回率相同，而FPR定义如下：
$FPR = \frac{FP}{FP+TN}=1-TPR$
以FPR为x轴，TPR为y轴，就得到了ROC曲线。

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend()

i = 1
fig = plt.figure(figsize=(8, 6))
for name, model in models.items():
    proba = model.predict_proba(X_test)[:,1]
    fpr, tpr, thresholds = roc_curve(y_test, proba)
    plot_roc_curve(fpr, tpr, label=name)
    i += 1