1 说明
在做逻辑回归的最初就有尝试做网络搜索,找到最佳的特征组合
代码链接: https://github.com/spareribs/kaggleSpareribs/blob/master/Overdue/ml/for beginner/stacking.py
2 代码使用方法
- 【必须】先执行 features 中的 base.py 先把数据处理好 [PS:需要根据实际情况修改]
- 【可选】再通过 code 中的 sklearn_gcv.py 找到最优参数,修改 sklearn_config.py 中的模型参数配置
- 【必须】再通过 for beginner 中的 stacking.py 直接模型融合,输出结果
3 核心代码说明
sklearn_config.py 指定的模型参数
clfs = {
'lr': LogisticRegression(penalty='l1', C=0.1, tol=0.0001),
'svm': LinearSVC(C=0.05, penalty='l2', dual=True),
'svm_linear': SVC(kernel='linear', probability=True),
'svm_ploy': SVC(kernel='poly', probability=True),
'bagging': BaggingClassifier(base_estimator=base_clf, n_estimators=60, max_samples=1.0, max_features=1.0,
random_state=1, n_jobs=1, verbose=1),
'rf': RandomForestClassifier(n_estimators=40, criterion='gini', max_depth=9),
'adaboost': AdaBoostClassifier(base_estimator=base_clf, n_estimators=50, algorithm='SAMME'),
'gbdt': GradientBoostingClassifier(),
'xgb': xgb.XGBClassifier(learning_rate=0.1, max_depth=3, n_estimators=50),
'lgb': lgb.LGBMClassifier(boosting_type='gbdt', learning_rate=0.01, max_depth=5, n_estimators=250, num_leaves=90)
}
模型融合
lr_clf = clfs["lr"] # meta_classifier
svm_clf = clfs["svm_ploy"]
rf_clf = clfs["rf"]
xgb_clf = clfs["xgb"]
lgb_clf = clfs["lgb"]
# 简单调包实现了~~~ classifiers 定义融合的模型 meta_classifier 定义主要的模型
sclf = StackingCVClassifier(classifiers=[lr_clf, svm_clf, rf_clf, xgb_clf, lgb_clf],
meta_classifier=lr_clf, use_probas=True, verbose=3)
sclf.fit(x_train, y_train)
print("测试模型 & 模型参数如下:\n{0}".format(sclf))
print("=" * 20)
pre_train = sclf.predict(x_train)
print("训练集正确率: {0:.4f}".format(sclf.score(x_train, y_train)))
print("训练集f1分数: {0:.4f}".format(f1_score(y_train, pre_train)))
print("训练集auc分数: {0:.4f}".format(roc_auc_score(y_train, pre_train)))
输出结果
测试模型 & 模型参数如下:
StackingCVClassifier(classifiers=[LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l1', random_state=None, solver='warn',
tol=0.0001, verbose=0, warm_start=False), SVC(C=1.0, cache_size=....0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0)],
cv=2,
meta_classifier=LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l1', random_state=None, solver='warn',
tol=0.0001, verbose=0, warm_start=False),
shuffle=True, store_train_meta_features=False, stratify=True,
use_clones=True, use_features_in_secondary=False,
use_probas=True, verbose=3)
====================
训练集正确率: 0.8220
训练集f1分数: 0.5331
训练集auc分数: 0.6833
4 问题
- 问题1:https://blog.csdn.net/qq1483661204/article/details/80157365 代码实现的方式不是太了解
- 问题2:如果使用 train_test_split 拆分整个数据集,会报错,所以训练集的数据,没有测试集
Traceback (most recent call last):
File "D:/SoftwareData/Dropbox/MachineLearning/kaggleSpareribs/Overdue/ml/for beginner/stacking.py", line 42, in <module>
sclf.fit(x_train, y_train)
File "C:\ProgramData\Anaconda3\lib\site-packages\mlxtend\classifier\stacking_cv_classification.py", line 216, in fit
model.fit(X[train_index], y[train_index])
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py", line 1284, in fit
accept_large_sparse=solver != 'liblinear')
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 753, in check_X_y
_assert_all_finite(y)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
- 问题3:分数有点高,担心又是模型评分方式的问题
参考文章
https://blog.csdn.net/qq1483661204/article/details/80157365
https://mp.weixin.qq.com/s/DG5VbDgOcqlRhbEbiP9_aw
https://blog.csdn.net/LAW_130625/article/details/78573736