文章目录

思路

1. 导入数据
2. 性能评估函数
3. 模型优化

3.1 LR模型
3.2 SVM模型
3.3 决策树模型
3.4 XGBoost模型
3.5 LightGBM模型
3.6 模型融合

4. 结果对比和分析
5. 遇到的问题

Task9 - 统一数据，数据三七分，随机种子2018，用AUC作为模型评价指标，对比单模型和融合模型的比分。

具体代码见Github

思路

导入原始数据，特征归一化后，调参，然后模型融合。

1. 导入数据

导入数据和特征归一化

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

# 导入数据
data = pd.read_csv('data_all.csv')
y = data['status']
data.drop('status', axis = 1, inplace = True)
X = data

# 划分训练集测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2018)

# 特征归一化
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)

2. 性能评估函数

from sklearn.metrics import accuracy_score, roc_auc_score

def model_metrics(clf, X_train, X_test, y_train, y_test):
    # 预测
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    y_train_proba = clf.predict_proba(X_train)[:,1]
    y_test_proba = clf.predict_proba(X_test)[:,1]
    
    # 准确率
    print('[准确率]', end = ' ')
    print('训练集：', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
    print('测试集：', '%.4f'%accuracy_score(y_test, y_test_pred))
    
    # auc取值：用roc_auc_score或auc
    print('[auc值]', end = ' ')
    print('训练集：', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
    print('测试集：', '%.4f'%roc_auc_score(y_test, y_test_proba))

3. 模型优化

调参：首先大范围粗略地调，然后细分区间调。

对于包含较多参数的模型，例如xgb和lgb。首先固定其他参数为常用值，然后扫描某几个参数，循环扫描，直到分数不再增加。

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from mlxtend.classifier import StackingClassifier

3.1 LR模型

调参：正则化因子C和正则化方式penalty。

lr = LogisticRegression(random_state = 2018)
# param = {'C': [1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
param = {'C': [i/100 for i in range(1,21)], 'penalty':['l1', 'l2']}

gsearch = GridSearchCV(lr, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳参数：',gsearch.best_params_)
print('训练集的最佳分数：', gsearch.best_score_)
print('测试集的最佳分数：', gsearch.score(X_test, y_test))

lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
lr.fit(X_train, y_train)
model_metrics(lr, X_train, X_test, y_train, y_test)

输出

[准确率] 训练集： 0.8016 测试集： 0.7884
[auc值] 训练集： 0.8080 测试集： 0.7831

3.2 SVM模型

# 线性SVM
svm_linear = svm.SVC(kernel = 'linear', probability=True, random_state = 2018)
param = {'C':[0.01, 0.05, 0.1, 0.5, 1]}

gsearch = GridSearchCV(svm_linear, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳参数：',gsearch.best_params_)
print('训练集的最佳分数：', gsearch.best_score_)
print('测试集的最佳分数：', gsearch.score(X_test, y_test))

svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_linear.fit(X_train, y_train)
model_metrics(svm_linear, X_train, X_test, y_train, y_test)

输出

[准确率] 训练集： 0.7992 测试集： 0.7765
[auc值] 训练集： 0.8152 测试集： 0.7790

其他三个svm_poly、svm_rbf和svm_sigmoid 此处不再展示，具体参见Github。

3.3 决策树模型

1）首先观察一下默认参数的结果。（最后调参完肯定要比默认参数好才对）

dt = DecisionTreeClassifier(random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)

输出

[准确率] 训练集： 1.0000 测试集： 0.6854
[auc值] 训练集： 1.0000 测试集： 0.5956

2）具体调参过程如下：先大范围调，再小范围调。调完后，再回到起始位置循环调，直到参数不再需要变化。

param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
#param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
#param = {'min_samples_split':range(100,401,10), 'min_samples_leaf':range(40,101,10)}
#param = {'max_features':range(7,20,2)}
#param = {'max_features':[18,19,20]}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018),
                       param_grid = param,scoring ='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

3）调参最终结果

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)

输出

[准确率] 训练集： 0.7812 测试集： 0.7561
[auc值] 训练集： 0.7721 测试集： 0.6946

3.4 XGBoost模型

1）首先观察一下默认参数的结果。

import warnings
warnings.filterwarnings("ignore")

xgb0 = XGBClassifier(random_state =2018)
xgb0.fit(X_train, y_train)

model_metrics(xgb0, X_train, X_test, y_train, y_test)

2）具体调参过程：下面的参数循环调。最后降低学习速率为0.01，调一下n_estimators，看一下性能有没有改善。

param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(40,81,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[10,11,12]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}

gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, 
                                                  min_child_weight=11, gamma=0, subsample=0.7, 
                                                  colsample_bytree=0.8, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, random_state =2018), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
#gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

3）调参最终结果

xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, 
                    gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
                    nthread=4,scale_pos_weight=1, random_state =2018)
xgb.fit(X_train, y_train)
model_metrics(xgb, X_train, X_test, y_train, y_test)

输出

[准确率] 训练集： 0.8302 测试集： 0.7891
[auc值] 训练集： 0.8710 测试集： 0.7780

3.5 LightGBM模型

类似XGBoost

lgb0 = LGBMClassifier(random_state =2018)
lgb0.fit(X_train, y_train)

model_metrics(lgb0, X_train, X_test, y_train, y_test)

param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(30,51,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[6,7,8]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}
# 上述循环调整, 然后降低学习速率
gsearch = GridSearchCV(estimator = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, 
                                                  min_child_weight=7, gamma=0, subsample=0.5, 
                                                  colsample_bytree=0.8, reg_alpha = 1e-5,
                                                  nthread=4,scale_pos_weight=1, random_state =2018), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, 
                    gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1,random_state =2018)
lgb.fit(X_train, y_train)
model_metrics(lgb, X_train, X_test, y_train, y_test)

输出

[准确率] 训练集： 0.8269 测试集： 0.7877
[auc值] 训练集： 0.8741 测试集： 0.7746

3.6 模型融合

1.在融合的时候需要对模型进行筛选
2.StackingClassifier的参数设置

如果average_probas=True，则对分类器的结果求平均，得到：p=[0.25,0.45,0.35]
如果average_probas=False，则分类器的所有结果都保留作为新的特征：p=[0.2,0.5,0.3,0.3,0.4,0.4]

average_probas尝试True后, 效果更好。其次, 决策树和svm_poly单模型效果并不好, 尝试去掉两者后再Stacking

lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
svm_linear =svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_poly =  svm.SVC(C = 0.01, kernel = 'poly', probability=True,random_state = 2018)
svm_rbf =  svm.SVC(gamma = 0.01, C =0.01 , probability=True,random_state = 2018)
svm_sigmoid =  svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True,random_state = 2018)
dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, 
                            random_state = 2018)
xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, 
                    gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
                    nthread=4,scale_pos_weight=1, random_state =2018)
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, 
                    gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1,random_state =2018)

将初级分类器产生的输出类概率作为新特征

sclf = StackingClassifier(classifiers=[lr, svm_linear, svm_rbf, xgb, lgb], 
                            meta_classifier=lr, use_probas=True,average_probas=True)
                            
sclf.fit(X_train, y_train.values)
model_metrics(sclf, X_train, X_test, y_train, y_test)

输出

[准确率] 训练集： 0.8161 测试集： 0.7821
[auc值] 训练集： 0.8556 测试集： 0.7861

4. 结果对比和分析

模型	参数	auc值
LR	C = 0.04, penalty = ‘l1’	训练集： 0.8080 测试集： 0.7831
svm_linear	C = 0.01	训练集： 0.8152 测试集： 0.7790
svm_poly	C = 0.01	训练集： 0.8626 测试集： 0.7347
svm_rbf	gamma = 0.01, C =0.01	训练集： 0.8522 测试集： 0.7708
svm_sigmoid	C = 0.01	训练集： 0.7660 测试集： 0.7590
决策树	max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9	训练集： 0.7721 测试集： 0.6946
XGBoost	learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, gamma=0, subsample=0.7,colsample_bytree=0.8	训练集： 0.8710 测试集： 0.7780
LightGBM	learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, gamma=0, subsample=0.5, colsample_bytree=0.8	训练集： 0.8741 测试集： 0.7746
Stacking	-	训练集： 0.8750 测试集： 0.7861

分析
测试集最好情况是LR模型0.7831。
并且可以看到LR取最好结果时, 选择的是L1正则化。所以猜测需要进一步进行特征选择。

5. 遇到的问题

调参后best_score_的分数和求得的测试集auc值不相同。

原因是best_score_时使用交叉验证，和最终的test数据切分不一样吗？感觉有几个模型差了0.2，有点多。

ML - 贷款用户逾期情况分析6 - Final