特征工程项目训练(iris 和 Titanic)

特征工程训练

特征工程

  • 特征排序:通过递归特征消除,获取鸢尾花数据集特征排名

导入需求的库:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import warnings
warnings.filterwarnings("ignore")

读取数据:鸢尾花数据集

iris = load_iris()
selector = RFE(estimator=LogisticRegression(), n_features_to_select=1).fit(iris.data, iris.target)
data = selector.transform(iris.data)
print(data[0:5])
print(selector.ranking_)
[[1.4]
 [1.4]
 [1.3]
 [1.5]
 [1.4]]
[4 3 1 2]

总结:

递归特征消除原理为:进⾏⼤规模实验,进⾏递归特征消除,返回特征选择后的数据。递归消除特征法使⽤⼀个基模型来进⾏多轮训练,每轮训练后,消除若⼲权值系数的特征,再基于新的特征集进⾏下⼀轮训练。这里选用的RFE来实现该功能,基模型为逻辑回归模型,最终特征数选为1是为了最终排序结果可以设计到每一个特征项。

缺点:需进行多次模型训练,时间及运行代价过大

数据可视化

  • 通过matlpotlib绘制 y = x 3 + 10 y = x^3 +10 函数
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-2,2,50)
y = x ** 3 + 10
plt.plot(x,y)

在这里插入图片描述

  • 绘制散点图
X = np.linspace(-2, 2, 20)
Y = 2 * X + 1
plt.scatter(X, Y)

在这里插入图片描述

使⽤鸢尾花数据集

  • 分析4个特征统计信息和状况
  • 通过最近邻(KNN)分类器,设置k = 5。预测花种类,尝试使⽤特征扩展或特征选择⽅式提升预测评分

读取数据并查看数据统计信息和状况

iris = load_iris()
X = iris.data#特征数据
y = iris.target#数据标签
print(f'数据分布: \n {iris.DESCR}')
数据分布: 
 .. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.

数据分析:该数据集合共有150组信息,4种特征,三个类别(每种类别各占50条信息),无空值,且通过数据统计中可以得到,特征“petal length”和“petal width”与类别相关度比较高。

进行训练和测试,这里使用KNN算法,邻居参数设置为5,因为该数据集数据比较少,所以用到了交叉验证集,从而得到的准确率更加可信。

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
print('准确率为:{:.4f}%'.format(np.mean(scores)*100))
准确率为:97.3333%

为了让准确率进一步提升下面考虑进行特征选择和特征扩展:

  • 特征选择: 在1.1特征工程中我们已经得知该数据集特征排名为[4 3 1 2],因此选择后三个特征进行训练试一试
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X[:,-3:], y, cv=5, scoring='accuracy')
print('准确率为:{:.4f}%'.format(np.mean(scores)*100))
准确率为:96.6667%

事实证明结果并不如使用四个特征进行训练的效果

  • 特征扩展:由于最后两列特征的重要性更好一些,以此加一组新的特征为最后两个特征的均值
import pandas as pd

df = pd.DataFrame(X,columns=['E1','E2','E3','E4'])
f1index = df.columns.get_loc('E3')
f2index = df.columns.get_loc('E4')
df['E5'] = df.apply(lambda x: (x[f1index]+ x[f2index])/2, axis=1)
X_1 = np.array(df)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X_1, y, cv=5, scoring='accuracy')
print('准确率为:{:.4f}%'.format(np.mean(scores)*100))
准确率为:97.3333%

增加一组特征后和原始特征的效果基本保持一直,在加一组后三列均值的特征尝试一下

df = pd.DataFrame(X,columns=['E1','E2','E3','E4'])
f1index = df.columns.get_loc('E2')
f2index = df.columns.get_loc('E3')
f3index = df.columns.get_loc('E4')
df['E5'] = df.apply(lambda x: (x[f2index]+ x[f3index])/2, axis=1)
df['E6'] = df.apply(lambda x: (x[f1index] + x[f2index] + x[f3index])/3, axis=1)
X_2 = np.array(df)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X_2, y, cv=5, scoring='accuracy')
print('准确率为:{:.4f}%'.format(np.mean(scores)*100))
准确率为:96.6667%

结论:再增加特征效果反而下降了,说明原始特征表现的结果还是不错的。但由于特征选x和扩展的存在的可能不计其数,也许会存在更好的情况。

泰坦尼克Titanic 求⽣预测

题⽬描述:

本项⽬中需要利⽤监督学习算法对于Titanic获救概率进⾏分析。
通过对于性别、年龄等研究对于获救概率产⽣的影响。

数据读取与分析

PassengerId => 乘客ID
Survived => 获救情况(1为获救,0为未获救)
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口

data_train = pd.read_csv("data/train.csv")
print('数据列名:', '\n',data_train.columns)
print('训练集每列性质,空值和类型:')
print(data_train.info())
data_test = pd.read_csv("data/test.csv")
print('测试集每列性质,空值和类型:')
print(data_test.info())
数据列名: 
 Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
训练集每列性质,空值和类型:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
测试集每列性质,空值和类型:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

数据状况:训练集有数据891条,其中Age和Cabin存在缺失值;测试集有数据418条,其中Age和Cabin依旧存在缺失值

数据类型分析

接下来要分析数据类型,进行预测需要进行数学计算,而object类型是无法进行运算的,因此我们需要先观察统计一下object类型数据的具体情况

data = data_train.values
print('训练集数据:')
for idx, itr in enumerate(data_train.dtypes):
    if itr == np.object:
        print("第%d列不重复字符类个数:"%idx, len(set(data[:, idx])))
data = data_test.values
print('测试集数据:')
for idx, itr in enumerate(data_test.dtypes):
    if itr == np.object:
        print("第%d列不重复字符类个数:"%idx, len(set(data[:, idx])))
训练集数据:
第3列不重复字符类个数: 891
第4列不重复字符类个数: 2
第8列不重复字符类个数: 681
第10列不重复字符类个数: 148
第11列不重复字符类个数: 4
测试集数据:
第2列不重复字符类个数: 418
第3列不重复字符类个数: 2
第7列不重复字符类个数: 363
第9列不重复字符类个数: 77
第10列不重复字符类个数: 3

很明显一些列中包含的字符类比较多,对于这些数据而言是不适合做分类算法的,因此需要将其剔除在一些算法中是可以容忍类似数据的。

因此,Name和Ticket这两种特征应该被剔除出去。对于PassengerID是一个连续数据,只是一个ID对实际预测起不到正向影响也应该一同删除。

data_train.drop(['PassengerId','Name', 'Ticket'], axis=1, inplace=True)
data_test.drop(['PassengerId','Name', 'Ticket'], axis=1, inplace=True)
data_train.head()
Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 0 3 male 22.0 1 0 7.2500 NaN S
1 1 1 female 38.0 1 0 71.2833 C85 C
2 1 3 female 26.0 0 0 7.9250 NaN S
3 1 1 female 35.0 1 0 53.1000 C123 S
4 0 3 male 35.0 0 0 8.0500 NaN S
data_test.head()
Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 3 male 34.5 0 0 7.8292 NaN Q
1 3 female 47.0 1 0 7.0000 NaN S
2 2 male 62.0 0 0 9.6875 NaN Q
3 3 male 27.0 0 0 8.6625 NaN S
4 3 female 22.0 1 1 12.2875 NaN S

空值处理

前文已经得出:Age和Cabin两个特征类别存在比较多的空值,需要进行处理

对于年龄(Age)特征的空值,我们通过建立模型,通过已知的年龄数据进行训练,去预测未知年龄。这里年龄即为标签,而数据选取为’Fare’, ‘Parch’, ‘SibSp’, ‘Pclass’。

对于客舱(Cabin)特征的空值,由于缺失的过多,而且通过生活经验判读该特征与其余特征的关联度不大,因此不能用上面的模型预测方法,而是直接将其置为两种类别:已知与未知。

def set_missing_ages(df):
    # 引入随机森林模型
    from sklearn.ensemble import RandomForestRegressor
    # 把已有的数值型特征取出来丢进Random Forest Regressor中
    age_df = df[['Age', 'Fare', 'Parch', 'SibSp', 'Pclass']]

    # 乘客分成已知年龄和未知年龄两部分
    known_age = age_df[age_df.Age.notnull()].as_matrix()
    unknown_age = age_df[age_df.Age.isnull()].as_matrix()

    # y即目标年龄
    y = known_age[:, 0]

    # X即特征属性值
    X = known_age[:, 1:]

    # fit到RandomForestRegressor之中
    rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
    rfr.fit(X, y)

    # 用得到的模型进行未知年龄结果预测
    predictedAges = rfr.predict(unknown_age[:, 1::])

    # 用得到的预测结果填补原缺失数据
    df.loc[(df.Age.isnull()), 'Age'] = predictedAges

    return df


def set_Cabin_type(df):
    df.loc[(df.Cabin.notnull()), 'Cabin'] = "Yes"
    df.loc[(df.Cabin.isnull()), 'Cabin'] = "No"
    return df

在这里出现了一个小插曲:对训练集进行年龄空置预测时非常顺利,但在测试集年龄空置预测时,出现了报错“Input contains NaN, infinity or a value too large for dtype(‘float32’).”。发现特征中有空置存在,回过头看测试集信息时发现“Fare 417 non-null float64”,说明Fare特征下存在一个空值,因此将这个空值设置为票价的均值。

data_train = set_missing_ages(data_train)
data_train = set_Cabin_type(data_train)
data_test.loc[(data_test.Fare.isnull()),'Fare'] = data_test.Fare.mean() 
data_test = set_Cabin_type(data_test)
data_test = set_missing_ages(data_test)
data_train.to_csv("data_noNan.csv")
data_train.head()
Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 0 3 male 22.0 1 0 7.2500 No S
1 1 1 female 38.0 1 0 71.2833 Yes C
2 1 3 female 26.0 0 0 7.9250 No S
3 1 1 female 35.0 1 0 53.1000 Yes S
4 0 3 male 35.0 0 0 8.0500 No S
data_test.head()
Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 3 male 34.5 0 0 7.8292 No Q
1 3 female 47.0 1 0 7.0000 No S
2 2 male 62.0 0 0 9.6875 No Q
3 3 male 27.0 0 0 8.6625 No S
4 3 female 22.0 1 1 12.2875 No S

类目型的特征离散/因子化

因为逻辑回归建模时,需要输入的特征都是数值型特征

我们先对类目型的特征离散/因子化

以Cabin为例,原本一个属性维度,因为其取值可以是[‘yes’,‘no’],而将其平展开为’Cabin_yes’,'Cabin_no’两个属性

原本Cabin取值为yes的,在此处的’Cabin_yes’下取值为1,在’Cabin_no’下取值为0

原本Cabin取值为no的,在此处的’Cabin_yes’下取值为0,在’Cabin_no’下取值为1

我们使用pandas的get_dummies来完成这个工作,并拼接在原来的data_train之上,如下所示

处理categorical feature:一般就是通过dummy variable的方式解决,也叫one hot encode,可以通过pandas.get_dummies()或者
sklearn中preprocessing.OneHotEncoder(), 本例子选用pandas的get_dummies()

dummies_Cabin = pd.get_dummies(data_train['Cabin'], prefix= 'Cabin')
dummies_Embarked = pd.get_dummies(data_train['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data_train['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data_train['Pclass'], prefix= 'Pclass')
label_train = pd.get_dummies(data_train['Survived'], prefix= 'Survived')
df = pd.concat([data_train, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df.drop(['Survived','Pclass','Sex','Cabin', 'Embarked'], axis= 1, inplace= True)
label_train.head()
Survived_0 Survived_1
0 1 0
1 0 1
2 0 1
3 0 1
4 1 0
data_train_fixed = df
data_train_fixed.head()
Age SibSp Parch Fare Cabin_No Cabin_Yes Embarked_C Embarked_Q Embarked_S Sex_female Sex_male Pclass_1 Pclass_2 Pclass_3
0 22.0 1 0 7.2500 1 0 0 0 1 0 1 0 0 1
1 38.0 1 0 71.2833 0 1 1 0 0 1 0 1 0 0
2 26.0 0 0 7.9250 1 0 0 0 1 1 0 0 0 1
3 35.0 1 0 53.1000 0 1 0 0 1 1 0 1 0 0
4 35.0 0 0 8.0500 1 0 0 0 1 0 1 0 0 1

无量纲化

通过data_train_fixed.describe()可以看出:

除年龄(Age)、堂兄弟/妹个数(Slbsp)、父母与小孩个数(Parch)和票价(Fare)特征外,其余特征均为0或1。为达到统一的标准,需要对年龄和票价特征进行归一化处理。

data_train_fixed.describe()
Age SibSp Parch Fare Cabin_No Cabin_Yes Embarked_C Embarked_Q Embarked_S Sex_female Sex_male Pclass_1 Pclass_2 Pclass_3
count 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000
mean 29.647248 0.523008 0.381594 32.204208 0.771044 0.228956 0.188552 0.086420 0.722783 0.352413 0.647587 0.242424 0.206510 0.551066
std 13.629879 1.102743 0.806057 49.693429 0.420397 0.420397 0.391372 0.281141 0.447876 0.477990 0.477990 0.428790 0.405028 0.497665
min 0.420000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 21.000000 0.000000 0.000000 7.910400 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 29.000000 0.000000 0.000000 14.454200 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 1.000000
75% 36.000000 1.000000 0.000000 31.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000
max 80.000000 8.000000 6.000000 512.329200 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

其次先观察一下这四组数据的分布情况,分别如下:

y = np.array(df['Age'])
plt.hist(y, bins=30)
plt.xlabel('个数')
plt.ylabel('年龄')
plt.title('年龄直方图')
plt.show

在这里插入图片描述

y = np.array(df['SibSp'])
plt.hist(y, bins=8)
plt.xlabel('堂兄弟/妹个数')
plt.ylabel('个数')
plt.title('堂兄弟/妹个数直方图')
plt.show

在这里插入图片描述

y = np.array(df['Parch'])
plt.hist(y, bins=6)
plt.xlabel('父母与小孩个数')
plt.ylabel('个数')
plt.title('父母与小孩个数直方图')
plt.show

在这里插入图片描述

y = np.array(df['Fare'])
plt.hist(y,bins=30)
plt.xlabel('个数')
plt.ylabel('票价')
plt.title('票价直方图')
plt.show

在这里插入图片描述
可以看出,年龄是按照正态分布的,而其余三类基于泊松分布

根据参考博客的理论,这里选用了0均值标准化(Z-score standardization)的方法。具体使用了sklearn中preprocessing库里面的scale函数。

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(data_train_fixed[['Age','Fare','SibSp','Parch']])
temp = scaler.transform(data_train_fixed[['Age','Fare','SibSp','Parch']])
temp = pd.DataFrame(temp, columns=['Age_scalered','Fare_scalered','SibSp_scalered','Parch_scalered'])
temp.head()
Age_scalered Fare_scalered SibSp_scalered Parch_scalered
0 -0.561380 -0.502445 0.432793 -0.473674
1 0.613171 0.786845 0.432793 -0.473674
2 -0.267742 -0.488854 -0.474545 -0.473674
3 0.392942 0.420730 0.432793 -0.473674
4 0.392942 -0.486337 -0.474545 -0.473674
data_train_fixed.drop(['Age','Fare','SibSp','Parch'], axis= 1, inplace= True)
data_train_fixed.head()
Cabin_No Cabin_Yes Embarked_C Embarked_Q Embarked_S Sex_female Sex_male Pclass_1 Pclass_2 Pclass_3
0 1 0 0 0 1 0 1 0 0 1
1 0 1 1 0 0 1 0 1 0 0
2 1 0 0 0 1 1 0 0 0 1
3 0 1 0 0 1 1 0 1 0 0
4 1 0 0 0 1 0 1 0 0 1
data_train_scalered = pd.concat([data_train_fixed, temp], axis=1)
data_train_scalered.head()
Cabin_No Cabin_Yes Embarked_C Embarked_Q Embarked_S Sex_female Sex_male Pclass_1 Pclass_2 Pclass_3 Age_scalered Fare_scalered SibSp_scalered Parch_scalered
0 1 0 0 0 1 0 1 0 0 1 -0.561380 -0.502445 0.432793 -0.473674
1 0 1 1 0 0 1 0 1 0 0 0.613171 0.786845 0.432793 -0.473674
2 1 0 0 0 1 1 0 0 0 1 -0.267742 -0.488854 -0.474545 -0.473674
3 0 1 0 0 1 1 0 1 0 0 0.392942 0.420730 0.432793 -0.473674
4 1 0 0 0 1 0 1 0 0 1 0.392942 -0.486337 -0.474545 -0.473674

预测模型

  • KNN

    首先使用knn进行预测(n = 5,并使用交叉验证集)
from sklearn.model_selection import learning_curve

# 用sklearn的learning_curve得到training_score和cv_score,使用matplotlib画出learning curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1,
                        train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):
    """
    画出data在某模型上的learning curve.
    参数解释
    ----------
    estimator : 你用的分类器。
    title : 表格的标题。
    X : 输入的feature,numpy类型
    y : 输入的target vector
    ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点
    cv : 做cross-validation的时候,数据分成的份数,其中一份作为cv集,其余n-1份作为training(默认为3份)
    n_jobs : 并行的的任务数(默认1)
    """
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
    plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
    if plot:
        plt.figure(figsize=(12,5), dpi=600)
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel(u"训练样本数training samples num")
        plt.ylabel(u"得分score")
        plt.gca().invert_yaxis()
        plt.grid()

        plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std,
                         alpha=0.1, color="b")
        plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std,
                         alpha=0.1, color="r")
        plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"训练集上得分score over train data set")
        plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"测试集上得分score over test data set")

        plt.legend(loc=0)

        plt.draw()
        plt.gca().invert_yaxis()
        plt.show()

    midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
    diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
    return midpoint, diff
from sklearn.neighbors import KNeighborsClassifier

# 首先将数据和标签转换为矩阵,便于计算
X = data_train_scalered.as_matrix()
y = label_train.as_matrix()
knn = KNeighborsClassifier(n_neighbors=5)
midpoint, diff = plot_learning_curve(knn, u"学习曲线learning curve", X, y,cv=5)
print('最终训练集和测试集准确率的均值为 {:.2f}%, 最大差值为:{:.2f}%'.format(midpoint*100, diff*100))

最终训练集和测试集准确率的均值为 82.37%, 最大差值为:8.94%

结论:可以看出使用KNN的模型最后的准确率平均为82.37%,最大差值为8.94%,说明有些过拟合的情况存在

  • 决策树

    使用到的是sklearn.tree中的DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import tree

clf = tree.DecisionTreeClassifier(criterion = "entropy", max_depth = 4)
midpoint, diff = plot_learning_curve(clf, u"学习曲线learning curve", X, y,cv=5)
print('最终训练集和测试集准确率的均值为 {:.2f}%, 最大差值为:{:.2f}%'.format(midpoint*100, diff*100))

在这里插入图片描述
最终训练集和测试集准确率的均值为 80.78%, 最大差值为:8.52%

结论:对比KNN模型,决策树的效果略差,准确率低了2个百分点,而差值变化不大,意味着过拟合依然存在

拓展

在数据预处理中没有考虑到标签均衡的问题,这里将标签均衡后在进行一下测试

  • 首先观察数据标签是否均衡
data_train_noNan = pd.read_csv('data_noNan.csv')
print("看类别平衡的状况\n", data_train_noNan['Survived'].value_counts())
看类别平衡的状况
 0    549
1    342
Name: Survived, dtype: int64
data_train_noNan.drop(['Unnamed: 0'], axis= 1, inplace= True)
data_train_noNan.head()
Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 0 3 male 22.0 1 0 7.2500 No S
1 1 1 female 38.0 1 0 71.2833 Yes C
2 1 3 female 26.0 0 0 7.9250 No S
3 1 1 female 35.0 1 0 53.1000 Yes S
4 0 3 male 35.0 0 0 8.0500 No S
  • 可以看出标签并不均衡,为了使预测准确度更好,应该使标签尽可能的均衡

    这里由于数据量不多,应该采用upsample/SMOTE造类似少量数据类别相似的假数据
from sklearn.utils import resample

df_majority = data_train_noNan[data_train_noNan.Survived==0]
df_minority = data_train_noNan[data_train_noNan.Survived==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=(len(df_majority) - len(df_minority)),    # to match majority class
                                 random_state=123) # reproducible results

# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled, df_minority],ignore_index=True)
print(df_upsampled.columns) 
print(df_upsampled['Survived'].value_counts())
Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')
1    549
0    549
Name: Survived, dtype: int64

均衡标签后在检查一下新数据中是否存在空置

df_upsampled.count()
Survived    1098
Pclass      1098
Sex         1098
Age         1098
SibSp       1098
Parch       1098
Fare        1098
Cabin       1098
Embarked    1095
dtype: int64

发现:“Embarked”中存在3个空置,由于数量比较少,直接舍弃对结果的影响也不会很大

data_train_balance = df_upsampled.loc[(df_upsampled.Embarked.notnull()),:]
data_train_balance.reset_index(drop=True, inplace=True)
data_train_balance.count()
Survived    1095
Pclass      1095
Sex         1095
Age         1095
SibSp       1095
Parch       1095
Fare        1095
Cabin       1095
Embarked    1095
dtype: int64
  • 类目型的特征离散/因子化
dummies_Cabin = pd.get_dummies(data_train_balance['Cabin'], prefix= 'Cabin')
dummies_Embarked = pd.get_dummies(data_train_balance['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data_train_balance['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data_train_balance['Pclass'], prefix= 'Pclass')
label_train = pd.get_dummies(data_train_balance['Survived'], prefix= 'Survived')
df = pd.concat([data_train_balance, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df.drop(['Survived','Pclass','Sex','Cabin', 'Embarked'], axis= 1, inplace= True)
label_train.head()
Survived_0 Survived_1
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
  • 再进行一下上述的出模型训练
# 数据归一化
data_train_balance  = df
scaler.fit(data_train_balance[['Age','Fare','SibSp','Parch']])
temp = scaler.transform(data_train_balance[['Age','Fare','SibSp','Parch']])
temp = pd.DataFrame(temp, columns=['Age_scalered','Fare_scalered','SibSp_scalered','Parch_scalered'])
data_train_balance.drop(['Age','Fare','SibSp','Parch'], axis= 1, inplace= True)
data_train_balanced = pd.concat([data_train_balance, temp], axis=1)
  1. KNN
X = data_train_balanced.as_matrix()
y = label_train.as_matrix()
knn = KNeighborsClassifier(n_neighbors=5)
midpoint, diff = plot_learning_curve(knn, u"学习曲线learning curve", X, y,cv=5)
print('最终训练集和测试集准确率的均值为 {:.2f}%, 最大差值为:{:.2f}%'.format(midpoint*100, diff*100))

在这里插入图片描述

最终训练集和测试集准确率的均值为 77.86%, 最大差值为:18.55%
  1. 随机森林
X = data_train_balanced.as_matrix()
y = label_train.as_matrix()
clf = tree.DecisionTreeClassifier(criterion = "entropy", max_depth = 4)
midpoint, diff = plot_learning_curve(clf, u"学习曲线learning curve", X, y,cv=5)
print('最终训练集和测试集准确率的均值为 {:.2f}%, 最大差值为:{:.2f}%'.format(midpoint*100, diff*100))

在这里插入图片描述

最终训练集和测试集准确率的均值为 73.46%, 最大差值为:20.12%

结论:将标签均衡化的结果并不是很理想,由于添加的假数据,导致准确率下降了不少,影响最大就是最大差值,说明过拟合情况变得严重了

猜你喜欢

转载自blog.csdn.net/weixin_43870329/article/details/106257841