文章目录
Task9 - 统一数据,数据三七分,随机种子2018,用AUC作为模型评价指标,对比单模型和融合模型的比分。
具体代码见Github
思路
导入原始数据,特征归一化后,调参,然后模型融合。
1. 导入数据
导入数据和特征归一化
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
# 导入数据
data = pd.read_csv('data_all.csv')
y = data['status']
data.drop('status', axis = 1, inplace = True)
X = data
# 划分训练集测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2018)
# 特征归一化
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)
2. 性能评估函数
from sklearn.metrics import accuracy_score, roc_auc_score
def model_metrics(clf, X_train, X_test, y_train, y_test):
# 预测
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
y_train_proba = clf.predict_proba(X_train)[:,1]
y_test_proba = clf.predict_proba(X_test)[:,1]
# 准确率
print('[准确率]', end = ' ')
print('训练集:', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
print('测试集:', '%.4f'%accuracy_score(y_test, y_test_pred))
# auc取值:用roc_auc_score或auc
print('[auc值]', end = ' ')
print('训练集:', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
print('测试集:', '%.4f'%roc_auc_score(y_test, y_test_proba))
3. 模型优化
调参:首先大范围粗略地调,然后细分区间调。
对于包含较多参数的模型,例如xgb和lgb。首先固定其他参数为常用值,然后扫描某几个参数,循环扫描,直到分数不再增加。
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from mlxtend.classifier import StackingClassifier
3.1 LR模型
调参:正则化因子C和正则化方式penalty。
lr = LogisticRegression(random_state = 2018)
# param = {'C': [1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
param = {'C': [i/100 for i in range(1,21)], 'penalty':['l1', 'l2']}
gsearch = GridSearchCV(lr, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
print('最佳参数:',gsearch.best_params_)
print('训练集的最佳分数:', gsearch.best_score_)
print('测试集的最佳分数:', gsearch.score(X_test, y_test))
lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
lr.fit(X_train, y_train)
model_metrics(lr, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.8016 测试集: 0.7884
[auc值] 训练集: 0.8080 测试集: 0.7831
3.2 SVM模型
# 线性SVM
svm_linear = svm.SVC(kernel = 'linear', probability=True, random_state = 2018)
param = {'C':[0.01, 0.05, 0.1, 0.5, 1]}
gsearch = GridSearchCV(svm_linear, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
print('最佳参数:',gsearch.best_params_)
print('训练集的最佳分数:', gsearch.best_score_)
print('测试集的最佳分数:', gsearch.score(X_test, y_test))
svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_linear.fit(X_train, y_train)
model_metrics(svm_linear, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.7992 测试集: 0.7765
[auc值] 训练集: 0.8152 测试集: 0.7790
其他三个svm_poly、svm_rbf和svm_sigmoid 此处不再展示,具体参见Github。
3.3 决策树模型
1)首先观察一下默认参数的结果。(最后调参完肯定要比默认参数好才对)
dt = DecisionTreeClassifier(random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 1.0000 测试集: 0.6854
[auc值] 训练集: 1.0000 测试集: 0.5956
2)具体调参过程如下:先大范围调,再小范围调。调完后,再回到起始位置循环调,直到参数不再需要变化。
param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
#param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
#param = {'min_samples_split':range(100,401,10), 'min_samples_leaf':range(40,101,10)}
#param = {'max_features':range(7,20,2)}
#param = {'max_features':[18,19,20]}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018),
param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
3)调参最终结果
dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.7812 测试集: 0.7561
[auc值] 训练集: 0.7721 测试集: 0.6946
3.4 XGBoost模型
1)首先观察一下默认参数的结果。
import warnings
warnings.filterwarnings("ignore")
xgb0 = XGBClassifier(random_state =2018)
xgb0.fit(X_train, y_train)
model_metrics(xgb0, X_train, X_test, y_train, y_test)
2)具体调参过程:下面的参数循环调。最后降低学习速率为0.01,调一下n_estimators,看一下性能有没有改善。
param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(40,81,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[10,11,12]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3,
min_child_weight=11, gamma=0, subsample=0.7,
colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, random_state =2018),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
#gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
3)调参最终结果
xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11,
gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, random_state =2018)
xgb.fit(X_train, y_train)
model_metrics(xgb, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.8302 测试集: 0.7891
[auc值] 训练集: 0.8710 测试集: 0.7780
3.5 LightGBM模型
类似XGBoost
lgb0 = LGBMClassifier(random_state =2018)
lgb0.fit(X_train, y_train)
model_metrics(lgb0, X_train, X_test, y_train, y_test)
param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(30,51,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[6,7,8]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}
# 上述循环调整, 然后降低学习速率
gsearch = GridSearchCV(estimator = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3,
min_child_weight=7, gamma=0, subsample=0.5,
colsample_bytree=0.8, reg_alpha = 1e-5,
nthread=4,scale_pos_weight=1, random_state =2018),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7,
gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5,
nthread=4,scale_pos_weight=1,random_state =2018)
lgb.fit(X_train, y_train)
model_metrics(lgb, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.8269 测试集: 0.7877
[auc值] 训练集: 0.8741 测试集: 0.7746
3.6 模型融合
1.在融合的时候需要对模型进行筛选
2.StackingClassifier的参数设置
如果average_probas=True,则对分类器的结果求平均,得到:p=[0.25,0.45,0.35]
如果average_probas=False,则分类器的所有结果都保留作为新的特征:p=[0.2,0.5,0.3,0.3,0.4,0.4]
average_probas尝试True后, 效果更好。其次, 决策树和svm_poly单模型效果并不好, 尝试去掉两者后再Stacking
lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
svm_linear =svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_poly = svm.SVC(C = 0.01, kernel = 'poly', probability=True,random_state = 2018)
svm_rbf = svm.SVC(gamma = 0.01, C =0.01 , probability=True,random_state = 2018)
svm_sigmoid = svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True,random_state = 2018)
dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9,
random_state = 2018)
xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11,
gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, random_state =2018)
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7,
gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5,
nthread=4,scale_pos_weight=1,random_state =2018)
将初级分类器产生的输出类概率作为新特征
sclf = StackingClassifier(classifiers=[lr, svm_linear, svm_rbf, xgb, lgb],
meta_classifier=lr, use_probas=True,average_probas=True)
sclf.fit(X_train, y_train.values)
model_metrics(sclf, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.8161 测试集: 0.7821
[auc值] 训练集: 0.8556 测试集: 0.7861
4. 结果对比和分析
模型 | 参数 | auc值 |
---|---|---|
LR | C = 0.04, penalty = ‘l1’ | 训练集: 0.8080 测试集: 0.7831 |
svm_linear | C = 0.01 | 训练集: 0.8152 测试集: 0.7790 |
svm_poly | C = 0.01 | 训练集: 0.8626 测试集: 0.7347 |
svm_rbf | gamma = 0.01, C =0.01 | 训练集: 0.8522 测试集: 0.7708 |
svm_sigmoid | C = 0.01 | 训练集: 0.7660 测试集: 0.7590 |
决策树 | max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9 | 训练集: 0.7721 测试集: 0.6946 |
XGBoost | learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, gamma=0, subsample=0.7,colsample_bytree=0.8 | 训练集: 0.8710 测试集: 0.7780 |
LightGBM | learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, gamma=0, subsample=0.5, colsample_bytree=0.8 | 训练集: 0.8741 测试集: 0.7746 |
Stacking | - | 训练集: 0.8750 测试集: 0.7861 |
分析
测试集最好情况是LR模型0.7831。
并且可以看到LR取最好结果时, 选择的是L1正则化。所以猜测需要进一步进行特征选择。
5. 遇到的问题
调参后best_score_的分数和求得的测试集auc值不相同。
原因是best_score_时使用交叉验证,和最终的test数据切分不一样吗?感觉有几个模型差了0.2,有点多。