利用train.csv中的数据,通过H2O框架中的随机森林算法构建分类模型,然后利用模型对test.csv中的数据进行预测,并计算分类的准确度进而评价模型的分类效果;通过调节参数,观察分类准确度的变化情况。注:准确度=预测正确的数与样本总数的比【注:可以做一些特征选择的工作,来提高准确度】
import h2o
from h2o. estimators. random_forest import H2ORandomForestEstimator
from h2o. grid. grid_search import H2OGridSearch
h2o. init( )
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O cluster uptime:
1 min 19 secs
H2O cluster timezone:
Asia/Shanghai
H2O data parsing timezone:
UTC
H2O cluster version:
3.28.0.1
H2O cluster version age:
16 days
H2O cluster name:
H2O_from_python_寮犲織娴4kdmlj
H2O cluster total nodes:
1
H2O cluster free memory:
3.512 Gb
H2O cluster total cores:
4
H2O cluster allowed cores:
4
H2O cluster status:
locked, healthy
H2O connection url:
http://localhost:54321
H2O connection proxy:
{'http': None, 'https': None}
H2O internal security:
False
H2O API Extensions:
Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version:
3.7.4 final
train= h2o. import_file( path = "C:\\Users\\zzh\\Desktop\\dataMiningExperment\\data4\\train.csv" )
test= h2o. import_file( path = "C:\\Users\\zzh\\Desktop\\dataMiningExperment\\data4\\test.csv" )
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
train. head( 5 )
driver
trip
Average_speed
Average_ABS_Acceleration
Average_RPM
Variance_speed
Variance_ABS_Acceleration
Variance_RPM
v_a
v_b
v_c
v_d
a_a
a_b
a_c
r_a
r_b
r_c
Catrgory
4.10304e+10
1
6
0.218219
1209.08
33.4659
0.154504
242766
0.564121
0.224947
0.16328
0.047652
0.594954
0.288718
0.116328
0.585144
0.348283
0.066573
cluster2
4.10304e+10
2
3
0.305416
1064.18
24.5744
0.283866
185456
0.575369
0.291626
0.133005
0
0.57734
0.210837
0.211823
0.57734
0.365517
0.057143
cluster2
4.10304e+10
3
5
0.121377
1168.5
24.3105
0.012078
224469
0.574566
0.269364
0.156069
0
0.531792
0.393064
0.075145
0.56763
0.354913
0.077457
cluster2
4.10304e+10
4
7
0.185244
1175.39
41.511
0.323999
260512
0.498039
0.196078
0.214994
0.090888
0.685582
0.236217
0.078201
0.432757
0.505882
0.061361
cluster2
4.10304e+10
5
9
0.255851
1311.18
53.3696
0.440556
309292
0.39738
0.131823
0.318504
0.152293
0.543395
0.299945
0.156659
0.32369
0.60726
0.06905
cluster1
train.csv为训练数据集,该数据集是驾驶员行为识别聚类结果经处理后的数据。其中driver,trip这2列在构建模型时没有用
train= train[ 2 : ]
test= test[ 2 : ]
train. head( 5 )
Average_speed
Average_ABS_Acceleration
Average_RPM
Variance_speed
Variance_ABS_Acceleration
Variance_RPM
v_a
v_b
v_c
v_d
a_a
a_b
a_c
r_a
r_b
r_c
Catrgory
6
0.218219
1209.08
33.4659
0.154504
242766
0.564121
0.224947
0.16328
0.047652
0.594954
0.288718
0.116328
0.585144
0.348283
0.066573
cluster2
3
0.305416
1064.18
24.5744
0.283866
185456
0.575369
0.291626
0.133005
0
0.57734
0.210837
0.211823
0.57734
0.365517
0.057143
cluster2
5
0.121377
1168.5
24.3105
0.012078
224469
0.574566
0.269364
0.156069
0
0.531792
0.393064
0.075145
0.56763
0.354913
0.077457
cluster2
7
0.185244
1175.39
41.511
0.323999
260512
0.498039
0.196078
0.214994
0.090888
0.685582
0.236217
0.078201
0.432757
0.505882
0.061361
cluster2
9
0.255851
1311.18
53.3696
0.440556
309292
0.39738
0.131823
0.318504
0.152293
0.543395
0.299945
0.156659
0.32369
0.60726
0.06905
cluster1
1、直接建立模型,参数全部默认
准确率:0.8666666666666667
model1 = H2ORandomForestEstimator( )
model1. train( x = train. names[ 0 : - 1 ] , y = 'Catrgory' , training_frame = train)
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict= H2ORandomForestEstimator. predict( model1 , test[ test. names[ 0 : - 1 ] ] )
predict. head( 5 )
drf prediction progress: |████████████████████████████████████████████████| 100%
predict
cluster0
cluster1
cluster2
cluster2
0.0204082
0
0.979592
cluster2
0.12963
0
0.87037
cluster2
0
0
1
cluster2
0
0
1
cluster1
0
1
0
注:准确度=预测正确的数与样本总数的比
tmp = predict[ predict[ 'predict' ] == test[ 'Catrgory' ] ] . nrow
accuracy = tmp/ test. nrow
accuracy
0.8666666666666667
查看模型深层信息,以获取对预测结果产生比较重要影响的特征
model1. deepfeatures
Model Details
=============
H2ORandomForestEstimator : Distributed Random Forest
Model Key: DRF_model_python_1577882615850_1
Model Summary:
number_of_trees
number_of_internal_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves
0
50.0
150.0
59341.0
5.0
13.0
8.14
14.0
52.0
26.773333
ModelMetricsMultinomial: drf
** Reported on train data. **
MSE: 0.048564890251647425
RMSE: 0.22037443193720868
LogLoss: 0.16320718635092735
Mean Per-Class Error: 0.07050700819826967
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
cluster0
cluster1
cluster2
Error
Rate
0
138.0
1.0
14.0
0.098039
15 / 153
1
1.0
161.0
11.0
0.069364
12 / 173
2
6.0
6.0
260.0
0.044118
12 / 272
3
145.0
168.0
285.0
0.065217
39 / 598
Top-3 Hit Ratios:
k
hit_ratio
0
1
0.934783
1
2
1.000000
2
3
1.000000
Scoring History:
timestamp
duration
number_of_trees
training_rmse
training_logloss
training_classification_error
0
2020-01-01 20:45:33
0.049 sec
0.0
NaN
NaN
NaN
1
2020-01-01 20:45:34
0.383 sec
1.0
0.359650
3.811475
0.117391
2
2020-01-01 20:45:34
0.483 sec
2.0
0.342797
3.340081
0.105691
3
2020-01-01 20:45:34
0.515 sec
3.0
0.330296
3.012446
0.089862
4
2020-01-01 20:45:34
0.562 sec
4.0
0.320177
2.679887
0.089613
5
2020-01-01 20:45:34
0.587 sec
5.0
0.298609
2.080400
0.087361
6
2020-01-01 20:45:34
0.622 sec
6.0
0.281188
1.640286
0.083929
7
2020-01-01 20:45:34
0.653 sec
7.0
0.278461
1.430675
0.086655
8
2020-01-01 20:45:34
0.682 sec
8.0
0.269822
1.243377
0.090909
9
2020-01-01 20:45:34
0.703 sec
9.0
0.263806
1.178969
0.087179
10
2020-01-01 20:45:34
0.731 sec
10.0
0.250604
0.825163
0.078992
11
2020-01-01 20:45:34
0.753 sec
11.0
0.242310
0.759343
0.068562
12
2020-01-01 20:45:34
0.783 sec
12.0
0.239949
0.702918
0.070234
13
2020-01-01 20:45:34
0.803 sec
13.0
0.233250
0.482001
0.070234
14
2020-01-01 20:45:34
0.833 sec
14.0
0.229632
0.426821
0.061873
15
2020-01-01 20:45:34
0.863 sec
15.0
0.231505
0.429770
0.063545
16
2020-01-01 20:45:34
0.890 sec
16.0
0.229281
0.375294
0.066890
17
2020-01-01 20:45:34
0.919 sec
17.0
0.229443
0.375982
0.068562
18
2020-01-01 20:45:34
0.949 sec
18.0
0.229665
0.377334
0.068562
19
2020-01-01 20:45:34
0.974 sec
19.0
0.230373
0.379523
0.070234
See the whole table with table.as_data_frame()
Variable Importances:
variable
relative_importance
scaled_importance
percentage
0
Average_speed
3703.256836
1.000000
0.245570
1
r_a
2256.470947
0.609321
0.149631
2
v_a
1821.382812
0.491833
0.120779
3
v_d
1685.737915
0.455204
0.111785
4
r_b
1604.149536
0.433173
0.106374
5
Average_RPM
1018.616333
0.275060
0.067546
6
v_c
668.664001
0.180561
0.044340
7
Variance_speed
553.771790
0.149536
0.036722
8
a_a
523.651306
0.141403
0.034724
9
v_b
439.868347
0.118779
0.029169
10
a_b
200.154129
0.054048
0.013273
11
r_c
155.026993
0.041862
0.010280
12
Variance_RPM
142.054703
0.038359
0.009420
13
a_c
121.158333
0.032717
0.008034
14
Average_ABS_Acceleration
113.996506
0.030783
0.007559
15
Variance_ABS_Acceleration
72.286301
0.019520
0.004793
<bound method ModelBase.deepfeatures of >
2、进行特征选择后建立模型,参数全部默认
挑选影响最大的八个特征对数据进行处理,按影响程度从大到小是
[[‘Average_speed’,‘r_a’, ‘r_b’,‘Average_RPM’,‘v_a’,‘v_d’,‘Variance_speed’,‘v_c’,‘Catrgory’]]
准确率:0.8666666666666667 没有变
train_features= train[ [ 'Average_speed' , 'r_a' , 'r_b' , 'Average_RPM' , 'v_a' , 'v_d' , 'Variance_speed' , 'v_c' , 'Catrgory' ] ]
test_features= test[ [ 'Average_speed' , 'r_a' , 'r_b' , 'Average_RPM' , 'v_a' , 'v_d' , 'Variance_speed' , 'v_c' , 'Catrgory' ] ]
model2 = H2ORandomForestEstimator( )
model2. train( x = train_features. names[ 0 : - 1 ] , y = 'Catrgory' , training_frame = train_features)
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict= H2ORandomForestEstimator. predict( model2 , test_features[ test_features. names[ 0 : - 1 ] ] )
drf prediction progress: |████████████████████████████████████████████████| 100%
tmp = predict[ predict[ 'predict' ] == test_features[ 'Catrgory' ] ] . nrow
accuracy = tmp/ test_features. nrow
accuracy
0.8666666666666667
3、通过调节参数,观察分类准确度的变化情况。
3.1、for循环调节参数(ntrees和max_depth),得到最大准确率,寻找最佳参数
最大准确率:0.894
ntrees: 5
max_depth : 9
这部分太大,没有展示,从这里求得最优参数(ntrees和max_depth)
max_accuracy= 0
ntrees= 0
max_depth= 0
for i in range ( 1 , 20 ) :
for j in range ( 1 , 20 ) :
model3= H2ORandomForestEstimator( ntrees= i, max_depth = j)
model3. train( x= train. names[ 0 : - 1 ] , y= 'Catrgory' , training_frame= train)
predict= H2ORandomForestEstimator. predict( model3 , test[ test. names[ 0 : - 1 ] ] )
tmp = predict[ predict[ 'predict' ] == test[ 'Catrgory' ] ] . nrow
accuracy = tmp/ test. nrow
accuracy
print ( "now acc is:" , accuracy, "--- max acc is :" , max_accuracy)
if max_accuracy< accuracy:
max_accuracy= accuracy
ntrees= i
max_depth= j
print ( "最大acc:" , max_accuracy)
print ( "最优ntrees :" , ntrees)
print ( "最优max_depth :" , max_depth)
model3 = H2ORandomForestEstimator( ntrees= 3 , max_depth= 6 )
model3. train( x = train. names[ 0 : - 1 ] , y = 'Catrgory' , training_frame = train)
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict= H2ORandomForestEstimator. predict( model3, test[ test. names[ 0 : - 1 ] ] )
drf prediction progress: |████████████████████████████████████████████████| 100%
tmp = predict[ predict[ 'predict' ] == test[ 'Catrgory' ] ] . nrow
accuracy = tmp/ test. nrow
accuracy
test数据与预测结果合并后的数据集,命名为predict.csv
out = test. concat( predict[ 'predict' ] )
h2o. download_csv( out, "predict.csv" )
'C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp4\\predict.csv'
3.2、Grid Search寻找最佳参数
准确率:0.8708333333333333
ntrees: 10
max_depth : 10
rf_params = { 'ntrees' : [ x for x in range ( 30 , 60 , 1 ) ] ,
'max_depth' : [ x for x in range ( 10 , 20 , 1 ) ]
}
rf_grid = H2OGridSearch( model = H2ORandomForestEstimator,
hyper_params= rf_params)
rf_grid. train( x = train. names[ 0 : - 1 ] ,
y = 'Catrgory' ,
training_frame = train)
这部分太大,没有展示,从这里求得最优参数(ntrees和max_depth)
rfm_grid. show( )
model4 = H2ORandomForestEstimator( ntrees= 3 , max_depth= 6 )
model4. train( x = train. names[ 0 : - 1 ] , y = 'Catrgory' , training_frame = train)
predict= H2ORandomForestEstimator. predict( model4, test[ test. names[ 0 : - 1 ] ] )
tmp = predict[ predict[ 'predict' ] == test[ 'Catrgory' ] ] . nrow
accuracy = tmp/ test. nrow
accuracy