xgboost是陈天奇大神搞出来的大杀器,我在mac上费老半天劲还没安装好,查了各种安装教程,后来找到一个一句话安装,另一个大杀器anaconda,真香~
安装好之后就直接用,xgboost是gbdt的升级版,性能更强大,可以并行。前两年基本上是霸占kaggle,碾压其他算法。
import numpy as np
import random
import sklearn
import pandas as pd
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn import metrics
from sklearn.model_selection import train_test_split
import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
column_names = ['uin', 'gender', 'age', 'play_cnt', 'share_cnt', 'influence_pv', 'ds1', 'ds2', 'ds3', 'label']
data = pd.read_csv('lr_feature.csv', usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], names=column_names)
print(data.head(10))
# 分训练集测试集
X_train, X_test, y_train, y_test = train_test_split(data[column_names[1:6]], data[column_names[9]],
test_size=0.25, random_state=3)
model = XGBClassifier(learning_rate=0.01,
n_estimators=10, # 树的个数-10棵树建立xgboost
max_depth=3, # 树的深度
min_child_weight=1, # 叶子节点最小权重
gamma=0., # 惩罚项中叶子结点个数前的参数
subsample=1, # 所有样本建立决策树
colsample_btree=1, # 所有特征建立决策树
scale_pos_weight=1, # 解决样本个数不平衡的问题
random_state=27, # 随机数
slient=0
)
model.fit(X_train, y_train)
# 预测
y_test, y_pred = y_test, model.predict(X_test)
print("Accuracy: %.4g" % metrics.accuracy_score(y_test, y_pred))
print("F1_score: %.4g" % metrics.f1_score(y_test, y_pred))
print("Recall: %.4g" % metrics.recall_score(y_test, y_pred))
y_train_proba = model.predict_proba(X_train)[:, 1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(y_train, y_train_proba))
y_proba = model.predict_proba(X_test)[:, 1]
print("AUC Score (Test): %f" % metrics.roc_auc_score(y_test, y_proba))
运行结果:
uin gender age play_cnt share_cnt influence_pv ds1 ds2 ds3 label
0 1889812 2 67 2 1 0 0 2 2 0.0
1 1966339 2 69 747 92 194 15 15 30 1.0
2 1982539 2 66 1165 104 40 12 12 24 1.0
3 2131170 3 78 53 146 117 9 3 12 1.0
4 4471700 3 81 2 0 0 1 3 4 0.0
5 4921331 3 79 1634 176 178 15 15 30 1.0
6 5441180 3 68 0 4 0 0 4 4 0.0
7 6144422 2 79 109 23 25 10 14 24 1.0
8 6807020 3 72 418 54 90 11 11 22 1.0
9 7015648 3 76 144 9 15 11 7 18 1.0
Accuracy: 0.9668
F1_score: 0.97
Recall: 0.9693
AUC Score (Train): 0.989206
AUC Score (Test): 0.988982
我们使用特征比较少,因此树的深度只设定为3,数量是10,其他参数基本是用的默认值,当特征数量比较多的时候,调参会比较重要,选择一组好的参数效果很可能比花时间精力做特征工程好。调参的细节可以参考文献1,2。网上也有不少同学些的博客可以看。
参考资料:
- https://zhuanlan.zhihu.com/p/52501965
- https://zhuanlan.zhihu.com/p/68864414
- https://blog.csdn.net/sinat_20177327/article/details/81090324
- https://blog.csdn.net/han_xiaoyang/article/details/52665396