ML:基于泰坦尼克号数据集利用多种树类算法(独热编码/标签编码+DT/RF/XGBoost/LightGBM/CatBoost+主要探究各算法对【类别型】特征的处理)进行交叉验证训练并对比模型性能
目录
基于泰坦尼克号数据集利用多种树类算法(独热编码/标签编码+DT/RF/XGBoost/LightGBM/CatBoost+主要探究各算法对【类别型】特征的处理)进行交叉验证训练并对比模型性能
相关文章
ML:基于泰坦尼克号数据集利用多种树类算法(独热编码/标签编码+DT/RF/XGBoost/LightGBM/CatBoost+主要探究各算法对【类别型】特征的处理)进行交叉验证训练并对比模型性能
ML:基于泰坦尼克号数据集利用多种树类算法(独热编码/标签编码+DT/RF/XGBoost/LightGBM/CatBoost+主要探究各算法对【类别型】特征的处理)进行交叉验证训练并对比模型性能实现代码
基于泰坦尼克号数据集利用多种树类算法(独热编码/标签编码+DT/RF/XGBoost/LightGBM/CatBoost+主要探究各算法对【类别型】特征的处理)进行交叉验证训练并对比模型性能
# 1、定义数据集
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
[5 rows x 12 columns]
# 定义入模特征
after featuresIN………………………………………………
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Age 714 non-null float64
3 Fare 891 non-null float64
4 Sex 891 non-null object
5 Embarked 889 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 41.9+ KB
# 2、数据预处理
# 2.1、缺失值处理
# 2.2、特征编码
# T1、独热编码
OHEncode………………………………………………
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Age 891 non-null float64
3 Fare 891 non-null float64
4 Sex_female 891 non-null uint8
5 Sex_male 891 non-null uint8
6 Embarked_C 891 non-null uint8
7 Embarked_Q 891 non-null uint8
8 Embarked_S 891 non-null uint8
dtypes: float64(2), int64(2), uint8(5)
memory usage: 32.3 KB
# T2、标签编码
LBEncode………………………………………………
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Age 891 non-null float64
3 Fare 891 non-null float64
4 Sex 891 non-null int32
5 Embarked 891 non-null int32
dtypes: float64(2), int32(2), int64(2)
memory usage: 34.9 KB
# 2.3、是否在算法中指定类别型特征
只有当LBEncode才需要设置值cat_features, cat_features_indices
# 2.4、分离特征与标签
# 3、模型训练与验证
# T1、模型交叉验证并评估
# 3.1、定义5折交叉验证并划分数据集
# 3.2、模型训练与评估
OHEncode + not_set_para_cat_features | LBEncode + not_set_para_cat_features | LBEncode + set_para_cat_features | OHEncode + set_para_cat_features | |||||
DT | 0.7787 | 0.7229 | 0.7808 | 0.7252 | 0.7808 | 0.7252 | 0.7787 | 0.7229 |
RF | 0.8042 | 0.7541 | 0.8069 | 0.7589 | 0.8069 | 0.7589 | 0.8042 | 0.7541 |
XGBoost | 0.7938 | 0.7422 | 0.8055 | 0.7576 | 0.8055 | 0.7576 | 0.7938 | 0.7422 |
LightGBM | 0.8022 | 0.7529 | 0.8028 | 0.7539 | 0.8009 | 0.7509 | 0.8022 | 0.7529 |
CatBoost | 0.8034 | 0.751 | 0.7972 | 0.7434 | 0.791 | 0.734 | 0.8011 | 0.749 |
# T2、模型训单次切分训练并评估
# 3.1、切分数据集
# 3.2、模型训练与评估
LBEncode + set_para_cat_features
DT - AUC: 0.8046, F1: 0.7500
RF - AUC: 0.8232, F1: 0.7752
XGBoost - AUC: 0.8353, F1: 0.7907
LightGBM - AUC: 0.8594, F1: 0.8217
CatBoost - AUC: 0.8341, F1: 0.7934
OHEncode + set_para_cat_features
DT - AUC: 0.8364, F1: 0.7883
RF - AUC: 0.8276, F1: 0.7813
XGBoost - AUC: 0.8265, F1: 0.7786
LightGBM - AUC: 0.8594, F1: 0.8217
CatBoost - AUC: 0.8331, F1: 0.7903