数据挖掘算法和实践（二十二）：LightGBM集成算法案列（癌症数据集）

本节使用datasets数据集中的癌症数据集使用LightGBM进行建模的简单案列，关于集成学习的学习可以参考：数据挖掘算法和实践（十八）：集成学习算法（Boosting、Bagging），LGBM是一个非常常用算法；

一、引入常用包

import datetime
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
%matplotlib inline

二、加载数据集

# 加载数据集
breast = load_breast_cancer()
# 获取特征值和目标指
X,y = breast.data,breast.target
# 获取特征名称
feature_name = breast.feature_names

三、数据预处理

数据是比较标准的玩具数据，因此不需要复杂的数据预处理；

# 数据集划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 数据格式转换
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

四、建模和参数

# 参数设置
boost_round = 50 # 迭代次数              
early_stop_rounds = 10 # 验证数据若在early_stop_rounds轮中未提高，则提前停止

params = {
    'boosting_type': 'gbdt',  # 设置提升类型
    'objective': 'regression',  # 目标函数
    'metric': {'l2', 'auc'},  # 评估函数
    'num_leaves': 31,  # 叶子节点数
    'learning_rate': 0.05,  # 学习速率
    'feature_fraction': 0.9,  # 建树的特征选择比例
    'bagging_fraction': 0.8,  # 建树的样本采样比例
    'bagging_freq': 5,  # k 意味着每 k 次迭代执行bagging
    'verbose': 1  # <0 显示致命的, =0 显示错误 (警告), >0 显示信息
}

# 模型训练:加入提前停止的功能
results = {}
gbm = lgb.train(params,
                lgb_train,
                num_boost_round= boost_round,
                valid_sets=(lgb_eval, lgb_train),
                valid_names=('validate','train'),
                early_stopping_rounds = early_stop_rounds,
                evals_result= results)

训练结果：

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001093 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4548
[LightGBM] [Info] Number of data points in the train set: 455, number of used features: 30
[LightGBM] [Info] Start training from score 0.637363
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1]	train's auc: 0.984943	train's l2: 0.21292	validate's auc: 0.98825	validate's l2: 0.225636
Training until validation scores don't improve for 10 rounds
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[2]	train's auc: 0.990805	train's l2: 0.196278	validate's auc: 0.992855	validate's l2: 0.208124
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[3]	train's auc: 0.990324	train's l2: 0.181505	validate's auc: 0.992379	validate's l2: 0.192562
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[4]	train's auc: 0.990439	train's l2: 0.168012	validate's auc: 0.993966	validate's l2: 0.178022
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[5]	train's auc: 0.990376	train's l2: 0.15582	validate's auc: 0.993014	validate's l2: 0.164942
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[6]	train's auc: 0.990752	train's l2: 0.144636	validate's auc: 0.993649	validate's l2: 0.152745
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[7]	train's auc: 0.991641	train's l2: 0.134404	validate's auc: 0.993331	validate's l2: 0.142248
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[8]	train's auc: 0.992571	train's l2: 0.124721	validate's auc: 0.992379	validate's l2: 0.132609
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[9]	train's auc: 0.992884	train's l2: 0.116309	validate's auc: 0.991743	validate's l2: 0.123573
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[10]	train's auc: 0.992989	train's l2: 0.108757	validate's auc: 0.992696	validate's l2: 0.115307
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[11]	train's auc: 0.993156	train's l2: 0.101871	validate's auc: 0.991743	validate's l2: 0.108458
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[12]	train's auc: 0.99348	train's l2: 0.0954168	validate's auc: 0.99222	validate's l2: 0.101479
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[13]	train's auc: 0.993396	train's l2: 0.0897573	validate's auc: 0.99222	validate's l2: 0.0956762
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[14]	train's auc: 0.993605	train's l2: 0.0846034	validate's auc: 0.992855	validate's l2: 0.0898012
Early stopping, best iteration is:
[4]	train's auc: 0.990439	train's l2: 0.168012	validate's auc: 0.993966	validate's l2: 0.178022

五、模型应用和评估

# 模型预测
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
y_pred

# 模型评估
lgb.plot_metric(results)
plt.show()

# 绘制重要的特征
lgb.plot_importance(gbm,importance_type = "split")
plt.show()

数据挖掘算法和实践（二十二）：LightGBM集成算法案列（癌症数据集）

一、引入常用包

二、加载数据集

三、数据预处理

四、建模和参数

五、模型应用和评估

猜你喜欢