随机森林算法——东北大学大数据班数据挖掘实训四

在这里插入图片描述在这里插入图片描述

利用train.csv中的数据,通过H2O框架中的随机森林算法构建分类模型,然后利用模型对test.csv中的数据进行预测,并计算分类的准确度进而评价模型的分类效果;通过调节参数,观察分类准确度的变化情况。注:准确度=预测正确的数与样本总数的比【注:可以做一些特征选择的工作,来提高准确度】

import  h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator 
from h2o.grid.grid_search import H2OGridSearch
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O cluster uptime: 1 min 19 secs
H2O cluster timezone: Asia/Shanghai
H2O data parsing timezone: UTC
H2O cluster version: 3.28.0.1
H2O cluster version age: 16 days
H2O cluster name: H2O_from_python_寮犲織娴4kdmlj
H2O cluster total nodes: 1
H2O cluster free memory: 3.512 Gb
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy: {'http': None, 'https': None}
H2O internal security: False
H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version: 3.7.4 final
train=h2o.import_file(path ="C:\\Users\\zzh\\Desktop\\dataMiningExperment\\data4\\train.csv")
test=h2o.import_file(path = "C:\\Users\\zzh\\Desktop\\dataMiningExperment\\data4\\test.csv")
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
train.head(5)
driver trip Average_speed Average_ABS_Acceleration Average_RPM Variance_speed Variance_ABS_Acceleration Variance_RPM v_a v_b v_c v_d a_a a_b a_c r_a r_b r_c Catrgory
4.10304e+10 1 6 0.218219 1209.08 33.4659 0.154504 242766 0.564121 0.224947 0.16328 0.047652 0.594954 0.288718 0.116328 0.585144 0.348283 0.066573 cluster2
4.10304e+10 2 3 0.305416 1064.18 24.5744 0.283866 185456 0.575369 0.291626 0.133005 0 0.57734 0.210837 0.211823 0.57734 0.365517 0.057143 cluster2
4.10304e+10 3 5 0.121377 1168.5 24.3105 0.012078 224469 0.574566 0.269364 0.156069 0 0.531792 0.393064 0.075145 0.56763 0.354913 0.077457 cluster2
4.10304e+10 4 7 0.185244 1175.39 41.511 0.323999 260512 0.498039 0.196078 0.214994 0.090888 0.685582 0.236217 0.078201 0.432757 0.505882 0.061361 cluster2
4.10304e+10 5 9 0.255851 1311.18 53.3696 0.440556 309292 0.39738 0.131823 0.318504 0.152293 0.543395 0.299945 0.156659 0.32369 0.60726 0.06905 cluster1

train.csv为训练数据集,该数据集是驾驶员行为识别聚类结果经处理后的数据。其中driver,trip这2列在构建模型时没有用

train=train[2:]# 删除driver trip 两个无用列
test=test[2:]# 删除driver trip 两个无用列
train.head(5)
Average_speed Average_ABS_Acceleration Average_RPM Variance_speed Variance_ABS_Acceleration Variance_RPM v_a v_b v_c v_d a_a a_b a_c r_a r_b r_c Catrgory
6 0.218219 1209.08 33.4659 0.154504 242766 0.564121 0.224947 0.16328 0.047652 0.594954 0.288718 0.116328 0.585144 0.348283 0.066573 cluster2
3 0.305416 1064.18 24.5744 0.283866 185456 0.575369 0.291626 0.133005 0 0.57734 0.210837 0.211823 0.57734 0.365517 0.057143 cluster2
5 0.121377 1168.5 24.3105 0.012078 224469 0.574566 0.269364 0.156069 0 0.531792 0.393064 0.075145 0.56763 0.354913 0.077457 cluster2
7 0.185244 1175.39 41.511 0.323999 260512 0.498039 0.196078 0.214994 0.090888 0.685582 0.236217 0.078201 0.432757 0.505882 0.061361 cluster2
9 0.255851 1311.18 53.3696 0.440556 309292 0.39738 0.131823 0.318504 0.152293 0.543395 0.299945 0.156659 0.32369 0.60726 0.06905 cluster1

1、直接建立模型,参数全部默认

准确率:0.8666666666666667

model1 = H2ORandomForestEstimator()  # 初始化(建立)模型
model1.train(x = train.names[0:-1],y = 'Catrgory',training_frame = train)  # 训练模型 train.names[0:-1]去除最后一列
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict=H2ORandomForestEstimator.predict(model1 ,test[test.names[0:-1]]) # 对测试集进行预测  test[test.names[0:-1]]删除最后一列
predict.head(5)
drf prediction progress: |████████████████████████████████████████████████| 100%
predict cluster0 cluster1 cluster2
cluster2 0.0204082 0 0.979592
cluster2 0.12963 0 0.87037
cluster2 0 0 1
cluster2 0 0 1
cluster1 0 1 0

注:准确度=预测正确的数与样本总数的比

tmp = predict[predict['predict'] == test['Catrgory']].nrow 
accuracy = tmp/test.nrow
accuracy
0.8666666666666667

查看模型深层信息,以获取对预测结果产生比较重要影响的特征

model1.deepfeatures
Model Details
=============
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  DRF_model_python_1577882615850_1


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
0 50.0 150.0 59341.0 5.0 13.0 8.14 14.0 52.0 26.773333
ModelMetricsMultinomial: drf
** Reported on train data. **

MSE: 0.048564890251647425
RMSE: 0.22037443193720868
LogLoss: 0.16320718635092735
Mean Per-Class Error: 0.07050700819826967

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
cluster0 cluster1 cluster2 Error Rate
0 138.0 1.0 14.0 0.098039 15 / 153
1 1.0 161.0 11.0 0.069364 12 / 173
2 6.0 6.0 260.0 0.044118 12 / 272
3 145.0 168.0 285.0 0.065217 39 / 598
Top-3 Hit Ratios: 
k hit_ratio
0 1 0.934783
1 2 1.000000
2 3 1.000000
Scoring History: 
timestamp duration number_of_trees training_rmse training_logloss training_classification_error
0 2020-01-01 20:45:33 0.049 sec 0.0 NaN NaN NaN
1 2020-01-01 20:45:34 0.383 sec 1.0 0.359650 3.811475 0.117391
2 2020-01-01 20:45:34 0.483 sec 2.0 0.342797 3.340081 0.105691
3 2020-01-01 20:45:34 0.515 sec 3.0 0.330296 3.012446 0.089862
4 2020-01-01 20:45:34 0.562 sec 4.0 0.320177 2.679887 0.089613
5 2020-01-01 20:45:34 0.587 sec 5.0 0.298609 2.080400 0.087361
6 2020-01-01 20:45:34 0.622 sec 6.0 0.281188 1.640286 0.083929
7 2020-01-01 20:45:34 0.653 sec 7.0 0.278461 1.430675 0.086655
8 2020-01-01 20:45:34 0.682 sec 8.0 0.269822 1.243377 0.090909
9 2020-01-01 20:45:34 0.703 sec 9.0 0.263806 1.178969 0.087179
10 2020-01-01 20:45:34 0.731 sec 10.0 0.250604 0.825163 0.078992
11 2020-01-01 20:45:34 0.753 sec 11.0 0.242310 0.759343 0.068562
12 2020-01-01 20:45:34 0.783 sec 12.0 0.239949 0.702918 0.070234
13 2020-01-01 20:45:34 0.803 sec 13.0 0.233250 0.482001 0.070234
14 2020-01-01 20:45:34 0.833 sec 14.0 0.229632 0.426821 0.061873
15 2020-01-01 20:45:34 0.863 sec 15.0 0.231505 0.429770 0.063545
16 2020-01-01 20:45:34 0.890 sec 16.0 0.229281 0.375294 0.066890
17 2020-01-01 20:45:34 0.919 sec 17.0 0.229443 0.375982 0.068562
18 2020-01-01 20:45:34 0.949 sec 18.0 0.229665 0.377334 0.068562
19 2020-01-01 20:45:34 0.974 sec 19.0 0.230373 0.379523 0.070234
See the whole table with table.as_data_frame()

Variable Importances: 
variable relative_importance scaled_importance percentage
0 Average_speed 3703.256836 1.000000 0.245570
1 r_a 2256.470947 0.609321 0.149631
2 v_a 1821.382812 0.491833 0.120779
3 v_d 1685.737915 0.455204 0.111785
4 r_b 1604.149536 0.433173 0.106374
5 Average_RPM 1018.616333 0.275060 0.067546
6 v_c 668.664001 0.180561 0.044340
7 Variance_speed 553.771790 0.149536 0.036722
8 a_a 523.651306 0.141403 0.034724
9 v_b 439.868347 0.118779 0.029169
10 a_b 200.154129 0.054048 0.013273
11 r_c 155.026993 0.041862 0.010280
12 Variance_RPM 142.054703 0.038359 0.009420
13 a_c 121.158333 0.032717 0.008034
14 Average_ABS_Acceleration 113.996506 0.030783 0.007559
15 Variance_ABS_Acceleration 72.286301 0.019520 0.004793
<bound method ModelBase.deepfeatures of >

2、进行特征选择后建立模型,参数全部默认

挑选影响最大的八个特征对数据进行处理,按影响程度从大到小是

[[‘Average_speed’,‘r_a’, ‘r_b’,‘Average_RPM’,‘v_a’,‘v_d’,‘Variance_speed’,‘v_c’,‘Catrgory’]]

准确率:0.8666666666666667 没有变

train_features= train[['Average_speed','r_a', 'r_b','Average_RPM','v_a','v_d','Variance_speed','v_c','Catrgory']]
test_features= test[['Average_speed','r_a', 'r_b','Average_RPM','v_a','v_d','Variance_speed','v_c','Catrgory']]
### 进行特征选择后建立模型,参数默认
### 准确率:
model2 = H2ORandomForestEstimator()
model2.train(x = train_features.names[0:-1],y = 'Catrgory',training_frame = train_features)
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict=H2ORandomForestEstimator.predict(model2 ,test_features[test_features.names[0:-1]]) 
drf prediction progress: |████████████████████████████████████████████████| 100%
tmp = predict[predict['predict'] == test_features['Catrgory']].nrow 
accuracy = tmp/test_features.nrow
accuracy
0.8666666666666667

3、通过调节参数,观察分类准确度的变化情况。

3.1、for循环调节参数(ntrees和max_depth),得到最大准确率,寻找最佳参数

最大准确率:0.894

ntrees: 5

max_depth : 9

这部分太大,没有展示,从这里求得最优参数(ntrees和max_depth)

max_accuracy=0
ntrees=0
max_depth=0
for i in range(1,20):
    for j in range(1,20):
        model3=H2ORandomForestEstimator(ntrees=i,max_depth =j)
        model3.train(x=train.names[0:-1],y='Catrgory',training_frame=train)
        predict=H2ORandomForestEstimator.predict(model3 ,test[test.names[0:-1]])
        tmp = predict[predict['predict'] == test['Catrgory']].nrow 
        accuracy = tmp/test.nrow
        accuracy
        print("now acc is:", accuracy, "--- max acc is :",max_accuracy)
        if max_accuracy<accuracy:
            max_accuracy=accuracy
            ntrees=i
            max_depth=j
print("最大acc:",max_accuracy)
print("最优ntrees :",ntrees)
print("最优max_depth :",max_depth)
model3 = H2ORandomForestEstimator(ntrees=3,max_depth=6)
model3.train(x = train.names[0:-1],y = 'Catrgory',training_frame = train)
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict=H2ORandomForestEstimator.predict(model3,test[test.names[0:-1]]) 
drf prediction progress: |████████████████████████████████████████████████| 100%
tmp = predict[predict['predict'] == test['Catrgory']].nrow 
accuracy = tmp/test.nrow
accuracy

test数据与预测结果合并后的数据集,命名为predict.csv

out = test.concat(predict['predict'])
h2o.download_csv(out,"predict.csv")
'C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp4\\predict.csv'

3.2、Grid Search寻找最佳参数

准确率:0.8708333333333333

ntrees: 10

max_depth : 10

rf_params = {'ntrees': [x for x in range(30,60,1)],
                'max_depth': [x for x in range(10,20,1)]
               }
 
rf_grid = H2OGridSearch(model = H2ORandomForestEstimator,
                        hyper_params=rf_params)
rf_grid.train(x = train.names[0:-1],
               y = 'Catrgory',
               training_frame = train)

这部分太大,没有展示,从这里求得最优参数(ntrees和max_depth)

rfm_grid.show()
model4 = H2ORandomForestEstimator(ntrees=3,max_depth=6)
model4.train(x = train.names[0:-1],y = 'Catrgory',training_frame = train)
predict=H2ORandomForestEstimator.predict(model4,test[test.names[0:-1]]) 
tmp = predict[predict['predict'] == test['Catrgory']].nrow 
accuracy = tmp/test.nrow
accuracy
发布了102 篇原创文章 · 获赞 101 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/weixin_43124279/article/details/103796234