第6章 使用scikit-learn构建模型的 实训

文章目录

实训

实训 1 使用sklearn处理wine和wine_quality数据集

1.训练要点

(1)掌握 sklearn转换器的用法。
(2)掌握训练集、测试集划分的方法。
(3)掌握使用sklearn进行PCA降维的方法。

2.需求说明

wine数据集和 winequality数据集是两份和酒有关的数据集。wine数据集包含3种
同起源的葡萄酒的记录,共178条。其中,每个特征对应葡萄酒的每种化学成分,并且都
属于连续型数据。通过化学分析可以推断葡萄酒的起源。
winequality数据集共有4898个观察值,11个输入特征和一个标签。其中,不同类的
观察值数量不等,所有特征为连续型数据。通过酒的各类化学成分,预测该葡萄酒的评分

3.实现思路及步骤

(1) 使用pandas库分别读取wine数据集和 winquality数据集

import pandas as pd
wine = pd.read_csv('./data/wine.csv')
wine_quality = pd.read_csv('./data/winequality.csv',sep=';')

(2) 将wine数据集和winequality数据集的数据和标签拆分开。

wine_data = wine.iloc[:,1:]
wine_target=wine['Class']
wine_quality_data = wine_quality.iloc[:,:-1]
wine_quality_target = wine_quality.iloc[:,-1]

(3) 将wine和wine_quality数据集划分为训练集和测试集。

from sklearn.model_selection import train_test_split
wine_data_train,wine_data_test,\
wine_target_train,wine_target_test = \
train_test_split(wine_data,wine_target,test_size = 0.2,random_state=42)

wine_quality_data_train,wine_quality_data_test,\
wine_quality_target_train,wine_quality_target_test = \
train_test_split(wine_quality_data,wine_quality_target,test_size = 0.2,random_state=42)

(4) 标准化wine数据集和 wine quality数据集

import numpy as np
from sklearn.preprocessing import MinMaxScaler   #离差标准化
Scaler = MinMaxScaler().fit(wine_data_train)  # 生成规则
##将规则应用于训练集
wine_trainScaler = Scaler.transform(wine_data_train)
##将规则应用于测试集
wine_testScaler = Scaler.transform(wine_data_test)

Scaler1 = MinMaxScaler().fit(wine_quality_data_train) 
wine_quality_trainScaler = Scaler1.transform(wine_quality_data_train)
wine_quality_testScaler = Scaler1.transform(wine_quality_data_test)

(5) 对wine数据集和 winequality数据集进行PCA降维。

from sklearn.decomposition import PCA
pca = PCA(n_components=5).fit(wine_trainScaler) ## 生成规则
## 将规则应用于训练集
wine_trainPca = pca.transform(wine_trainScaler)
## 将规则应用于测试集
wine_testPca = pca.transform(wine_testScaler)

pca = PCA(n_components=5).fit(wine_quality_trainScaler) ## 生成规则
## 将规则应用于训练集
wine_quality_trainPca = pca.transform(wine_quality_trainScaler)
## 将规则应用于测试集
wine_quality_testPca = pca.transform(wine_quality_testScaler)

实训2 构建基于wine数据集的k- Means聚类模型

1.训练要点

(1)了解sklearn估计器的用法。
(2)掌握聚类模型的构建方法。
(3)掌握聚类模型的评价方法。

2.需求说明

wine数据集的葡萄酒总共分为3种,通过将wine数据集的数据进行聚类,聚集为3
个簇,能够实现葡萄酒的类别划分。

3.实现思路及步骤

(1)根据实训1的wine数据集处理的结果,构建聚类数目为3的- -Means模型。

from sklearn.cluster import KMeans  
#构建并训练模型
kmeans = KMeans(n_clusters = 3,random_state=123).fit(wine_trainScaler) 
#用标准化后PCA降维后的训练集建模(采用降维后的数据聚类效果不好,故此处不采用)
#kmeans = KMeans(n_clusters = 3,random_state=123).fit(wine_trainPca)
print('构建的K-Means模型为:\n',kmeans)
构建的K-Means模型为:
 KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=123, tol=0.0001, verbose=0)

(2)对比真实标签和聚类标签求取FMI。

from sklearn.metrics import fowlkes_mallows_score
# 构建并训练模型
score = fowlkes_mallows_score(wine_target_train,kmeans.labels_)
print('wine数据集的FMI:%f'%(score))
wine数据集的FMI:0.888732

(3)在聚类数目为2~10类时,确定最优聚类数目。

for i in range(2,11):
    kmeans = KMeans(n_clusters = i,random_state = 123).fit(wine_trainScaler)
    score = fowlkes_mallows_score(wine_target_train,kmeans.labels_)
    print('wine数据集的FMI:%f'%(score))
wine数据集的FMI:0.637271
wine数据集的FMI:0.888732
wine数据集的FMI:0.822386
wine数据集的FMI:0.718383
wine数据集的FMI:0.684199
wine数据集的FMI:0.612088
wine数据集的FMI:0.577474
wine数据集的FMI:0.536403
wine数据集的FMI:0.546499

(4)求取模型的轮廓系数,绘制轮廓系数折线图,确定最优聚类数目。

from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
silhouetteScore = []
for i in range(2,11):
    #构建并建立模型
    kmeans = KMeans(n_clusters = i,random_state = 123).fit(wine)
    score = silhouette_score(wine,kmeans.labels_)
    silhouetteScore.append(score)
plt.figure(figsize=(10,6))
plt.plot(range(2,11),silhouetteScore,linewidth=1.5,linestyle='-')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lquNILPA-1609145133464)(output_19_0.png)]

(5)求取 Calinski-Harabasz-指数,确定最优聚类数目。

from sklearn.metrics import calinski_harabaz_score
for i in range(2,11):
    #构建并训练模型
    kmeans = KMeans(n_clusters = i, random_state=123).fit(wine)
    score=  calinski_harabaz_score(wine,kmeans.labels_)
    print('wine数据聚%d类calinski_harabaz指数为:%f'%(i,score))

D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
  warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
  warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
  warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
  warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
  warnings.warn(msg, category=FutureWarning)


wine数据聚2类calinski_harabaz指数为:505.425689
wine数据聚3类calinski_harabaz指数为:561.805171
wine数据聚4类calinski_harabaz指数为:707.349460
wine数据聚5类calinski_harabaz指数为:787.011163
wine数据聚6类calinski_harabaz指数为:878.393807
wine数据聚7类calinski_harabaz指数为:1180.244416
wine数据聚8类calinski_harabaz指数为:1297.354659
wine数据聚9类calinski_harabaz指数为:1349.991148
wine数据聚10类calinski_harabaz指数为:1441.838351


D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
  warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
  warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
  warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
  warnings.warn(msg, category=FutureWarning)

3) 结果分析与思考
通过分析FMI评价分值,可以看出wine数据分3类的时候其FMI值最高,故聚类为3类的时候wine数据集K-means聚类效果最好

通过分析轮廓系数折线图,可以看出在wine数据集为3的时候,其平均畸变程度最大,故亦可知聚类为3类的时候效果最佳

通过分析Calinski-Harabasz指数,我们发现其数值大体随着聚类的种类的增加而变多,最大值出现在聚类为9类的时候,考虑到Calinski-Harabasz指数是不需要真实值的评估方法,其可信度不如FMI评价法,故这里有理由相信这里的Calinski-Harabasz指数评价结果是存在异常的。

综上分析,加上对于实际数据的描述,wine数据集K-means聚类为3类的时候效果最好。

实训3 构建基于wine数据集的SVM分类模型

1.训练要点

(1)掌握sklearn估计器的用法。
(2)掌握分类模型的构建方法。
(3)掌握分类模型的评价方法。

2.需求说明

wine数据集中的葡萄酒类别为3种,将wie数据集划分为训练集和测试集,使用训练
集训练SVM分类模型,并使用训练完成的模型预测测试集的葡萄酒类别归属。

3.实现思路及步骤

(1)读取wine数据集,区分标签和数据。

import pandas as pd
wine = pd.read_csv('./data/wine.csv')
wine_data=wine.iloc[:,1:]
wine_target=wine['Class']

(2)将wine数据集划分为训练集和测试集

from sklearn.model_selection import train_test_split
wine_data_train, wine_data_test, \
wine_target_train, wine_target_test = \
train_test_split(wine_data, wine_target, \
    test_size=0.1, random_state=6)

(3)使用离差标准化方法标准化wine数据集。

from sklearn.preprocessing import MinMaxScaler #标准差标准化
stdScale = MinMaxScaler().fit(wine_data_train) #生成规则(建模)
wine_trainScaler = stdScale.transform(wine_data_train)#对训练集进行标准化
wine_testScaler = stdScale.transform(wine_data_test)#用训练集训练的模型对测试集标准化

(4)构建SVM模型,并预测测试集结果。

from sklearn.svm import SVC 
svm = SVC().fit(wine_trainScaler,wine_target_train)
print('建立的SVM模型为:\n',svm)
wine_target_pred = svm.predict(wine_testScaler)
print('预测前10个结果为:\n',wine_target_pred[:10])
建立的SVM模型为:
 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
预测前10个结果为:
 [1 2 2 2 1 1 2 2 2 1]

(5)打印出分类报告,评价分类模型性能。

from sklearn.metrics import classification_report
print('使用SVM预测iris数据的分类报告为:','\n',
      classification_report(wine_target_test,
            wine_target_pred))
使用SVM预测iris数据的分类报告为: 
               precision    recall  f1-score   support

           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00         8
           3       1.00      1.00      1.00         1

    accuracy                           1.00        18
   macro avg       1.00      1.00      1.00        18
weighted avg       1.00      1.00      1.00        18

结果分析与思考

本题划分的训练集和测试集的比例为9:1,训练的效果很好,通过观察分类报告,精确度、召回率、F1值等指标均达到了1.00,即预测结果全部正确的高正确率!这里,我们做一个额外小实验——将训练集和测试集的比例为划分为1:1,结果如下:

%%html
<img style="float: left;" src="./image/s_3.png" width=400 height=400>

在这里插入图片描述

可见,模型得不到足够训练集训练,且测试集规模庞大时,结果并不能保证全对,但正确率确实也足够高了!

实训4 构建基于wine_quality数据集的回归模型

1.训练要点

(1)熟练sklearn估计器的用法。
(2)掌握回归模型的构建方法。
(3)掌握回归模型的评价方法。

2.需求说明

winequality数据集的葡萄酒评分在1~10之间,建线性回归模型与梯度提升回归模
型,训练 winequality数据集的训练集数据,训练完成后预测测试集的葡萄酒评分。结合
真实评分,评价构建的两个回归模型的好坏。

3.实现思路及步骤

(1)根据winequality数据集处理的结果,构建线性回归模型。

from sklearn.linear_model import LinearRegression
clf = LinearRegression().fit(wine_quality_trainPca,wine_quality_target_train)
y_pred = clf.predict(wine_quality_testPca)
print('线性回归模型预测前10个结果为:','\n',y_pred[:10])
线性回归模型预测前10个结果为: 
 [5.20302279 5.20466945 5.34234945 5.3790242  5.74640832 5.30545288
 5.27205578 5.27721131 5.66550711 5.70050188]

(2)根据wine quality数据集处理的结果,构建梯度提升回归模型。

from sklearn.ensemble import GradientBoostingRegressor
GBR_wine = GradientBoostingRegressor().\
fit(wine_quality_trainPca,wine_quality_target_train)
wine_target_pred = GBR_wine.predict(wine_quality_testPca)
print('梯度提升回归模型预测前10个结果为:','\n',wine_target_pred[:10])
print('真实标签前十个预测结果为:','\n',list(wine_quality_target_test[:10]))
梯度提升回归模型预测前10个结果为: 
 [5.28629565 5.14521438 5.4020539  5.10652992 6.01754672 5.15338501
 5.13264291 5.37157537 5.78959206 5.89730642]
真实标签前十个预测结果为: 
 [6, 5, 6, 5, 6, 5, 5, 5, 5, 6]

(3)结合真实评分和预测评分,计算均方误差中值绝对误差、可解释方差值。

from sklearn.metrics import explained_variance_score,\
mean_absolute_error,\
mean_squared_error,\
median_absolute_error,r2_score
print('线性回归模型评价结果:')
print('winequality数据线性回归模型的平均绝对误差为:',
     mean_absolute_error(wine_quality_target_test,y_pred))
print('winequality数据线性回归模型的均方误差为:',
     mean_squared_error(wine_quality_target_test,y_pred))
print('winequality数据线性回归模型的中值绝对误差为:',
     median_absolute_error(wine_quality_target_test,y_pred))
print('winequality数据线性回归模型的可解释方差值为:',
     explained_variance_score(wine_quality_target_test,y_pred))
print('winequality数据线性回归模型的R方值为:',
     r2_score(wine_quality_target_test,y_pred))

print('梯度提升回归模型评价结果:')
from sklearn.metrics import explained_variance_score,\
mean_absolute_error,mean_squared_error,median_absolute_error,r2_score
print('winequality数据梯度提升回归树模型的平均绝对误差为:',
     mean_absolute_error(wine_quality_target_test,wine_target_pred))
print('winequality数据梯度提升回归树模型的均方误差为:',
     mean_squared_error(wine_quality_target_test,wine_target_pred))
print('winequality数据梯度提升回归树模型的中值绝对误差为:',
     median_absolute_error(wine_quality_target_test,wine_target_pred))
print('winequality数据梯度提升回归树模型的可解释方差值为:',
     explained_variance_score(wine_quality_target_test,wine_target_pred))
print('winequality数据梯度提升回归树模型的R方值为:',
     r2_score(wine_quality_target_test,wine_target_pred))
线性回归模型评价结果:
winequality数据线性回归模型的平均绝对误差为: 0.5398251122317769
winequality数据线性回归模型的均方误差为: 0.43298314878877353
winequality数据线性回归模型的中值绝对误差为: 0.46814373379480045
winequality数据线性回归模型的可解释方差值为: 0.34153599915272226
winequality数据线性回归模型的R方值为: 0.3374456516688773
梯度提升回归模型评价结果:
winequality数据梯度提升回归树模型的平均绝对误差为: 0.5043859078173963
winequality数据梯度提升回归树模型的均方误差为: 0.3866355007418619
winequality数据梯度提升回归树模型的中值绝对误差为: 0.41613212262993216
winequality数据梯度提升回归树模型的可解释方差值为: 0.41147753430214007
winequality数据梯度提升回归树模型的R方值为: 0.40836720100469737

python 中 任务 6.1 使用sklearn 转换器处理数据(划分训练集,测试集,PCA降维) 学习笔记1
python 中 任务 6.2 构建并评价聚类模型 学习笔记2
python 中任务 6.3 构建并评分分类模型(SVM模型) 学习笔记3
python 中任务 6.4 构建并评价回归模型 学习笔记4

猜你喜欢

转载自blog.csdn.net/jcjic/article/details/111869901