ML之FE之FS:特征筛选之Wrapper、Embedded—基于titanic泰坦尼克数据集(自定义所有类别型特征统一执行特征编码)利用基于排列重要性算法Wrapper_PFI_RF和Embedded_ETC实现特征筛选应用案例
目录
# 2.3、特征编码:对自定义所有类别型特征统一执行特征编码
相关文章
ML之FE之FS:特征筛选之Wrapper、Embedded—基于titanic泰坦尼克数据集(自定义所有类别型特征统一执行特征编码)利用基于排列重要性算法Wrapper_PFI_RF和Embedded_ETC实现特征筛选应用案例
ML之FE之FS:特征筛选之Wrapper、Embedded—基于titanic泰坦尼克数据集(自定义所有类别型特征统一执行特征编码)利用基于排列重要性算法Wrapper_PFI_RF和Embedded_ETC实现特征筛选应用案例实现代码
特征筛选之Wrapper、Embedded—基于titanic泰坦尼克数据集(自定义所有类别型特征统一执行特征编码)利用基于排列重要性算法Wrapper_PFI_RF和Embedded_ETC实现特征筛选应用案例
# 1、定义数据集
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
# 定义所需要的特征
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
[5 rows x 12 columns]
# 2、特征工程/数据预处理
# 2.1、填充缺失值
# 2.2、分离特征与标签
# 2.3、特征编码:对自定义所有类别型特征统一执行特征编码
cat_cols2encode
encoder_cols 9 ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Embarked_nan']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 891 non-null float64
1 SibSp 891 non-null int64
2 Parch 891 non-null int64
3 Fare 891 non-null float64
4 Pclass_1 891 non-null float64
5 Pclass_2 891 non-null float64
6 Pclass_3 891 non-null float64
7 Sex_female 891 non-null float64
8 Sex_male 891 non-null float64
9 Embarked_C 891 non-null float64
10 Embarked_Q 891 non-null float64
11 Embarked_S 891 non-null float64
12 Embarked_nan 891 non-null float64
dtypes: float64(11), int64(2)
memory usage: 90.6 KB
# 2.4、切分数据集
# 2.5、特征筛选
# T2、wrapper法
# T2.4、基于排列重要性算法PFI_RF
………………Wrapper………………
feature_importances_: [ 0.06268657 -0.00149254 0.00746269 0.02238806 -0.00820896 -0.00671642
0.01940299 0.07238806 0.03432836 -0.00447761 -0.00074627 0.00671642
0. ]
feature_importances_std_: [0.02399692 0.00606272 0.00235991 0.01454745 0.00641965 0.00279228
0.01039432 0.0185219 0.0083101 0.00597015 0.00279228 0.00435146
0. ]
features features_weight feature_importances_std_
7 Sex_female 0.072388 0.018522
0 Age 0.062687 0.023997
8 Sex_male 0.034328 0.008310
3 Fare 0.022388 0.014547
6 Pclass_3 0.019403 0.010394
2 Parch 0.007463 0.002360
11 Embarked_S 0.006716 0.004351
12 Embarked_nan 0.000000 0.000000
10 Embarked_Q -0.000746 0.002792
1 SibSp -0.001493 0.006063
9 Embarked_C -0.004478 0.005970
5 Pclass_2 -0.006716 0.002792
4 Pclass_1 -0.008209 0.006420
# T3、Embedded法
# T1、特征重要性评估
………………Embedded………………
feature importances
0 Age 0.247548
3 Fare 0.231313
7 Sex_female 0.159077
8 Sex_male 0.140638
6 Pclass_3 0.050483
2 Parch 0.042369
1 SibSp 0.041065
4 Pclass_1 0.038140
5 Pclass_2 0.019780
11 Embarked_S 0.012757
9 Embarked_C 0.009893
10 Embarked_Q 0.006751
12 Embarked_nan 0.000186