在上一篇(https://blog.csdn.net/baymax_007/article/details/82775180)中,只是简单选用逻辑回归、决策树、随机森林、多层感知分类器、xgboost、朴素贝叶斯分类对资讯进行分类。然而,实际分类模型效果受模型初始化的参数影响,选取合适参数可以提高分类效果。
交叉验证(Cross-validation)是常用的模型参数优化方法。CrossValidator将数据集划分为若干子集分别地进行训练和测试。如当k=3时,CrossValidator产生3个训练数据与测试数据对,每个数据对使用2/3的数据来训练,1/3的数据来测试。对于一组特定的参数表,CrossValidator计算基于三组不同训练数据与测试数据对训练得到的模型的评估准则的平均值。确定最佳参数表后,CrossValidator最后使用最佳参数表基于全部数据来重新拟合估计器。
spark ml提供ParamBuilder来重构模型搜索参数网络,CrossValidator实例可以设置搜索网络参数param和模型pipelines。
一、试验设计
1. 特征提取参数选取
CountVectorizer将分词后关键词转换为数值型的关键词频数特征,涉及词汇表的最大含量vocabsize和词汇表中的词语至少要在多少个不同文的那个中出现的参数minDF。
2.分类模型参数选择
2.1 逻辑回归
regParam:正则化系数,防止过拟合
elasticNetParam:正则化范式比,L1:L2
2.2 决策树
impurity:不纯度,可选类型entropy,gini
maxDepth:树的最大深度
maxBins:离散化连续特征的最大划分数
minInfoGain:一个节点分裂的最小信息增益比
minInstancesPerNode:每个节点包含的最小样本数
2.3 随机森林
impurity:纯度度量函数,可选类型entropy,gini
maxDepth:树的最大深度
maxBins:连续特征离散化的最大划分数
minInfoGain:一个节点分裂的最小信息增益比
minInstancesPerNode:每个节点包含的最小样本数
numTrees:bootstrap训练需要的树个数
subsamplingRate:训练森林时每次采样的比例
2.4 XGBOOST
eta:学习速率,默认0.3
maxDepth:树最大深度
numRound:boosting次数
apha:L1正则系数
2.5 朴素贝叶斯分类
smoothing:平滑参数
3.交叉验证
ml包提供交叉验证寻找最佳参数api,使用ParamGrid通过addGrid支持对pipeline模型的网络参数设置,并在CrossValidator训练中寻找最优参数。
4.具体设计
二、代码实现
1.逻辑回归参数寻优
1.1 代码
val lrStartTime = new Date().getTime
val lrPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,lr,converts))
// 交叉验证参数设定和模型
val lrParamGrid = new ParamGridBuilder()
.addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
.addGrid(vectorizer.minDF,Array(1.0,2.0))
.addGrid(lr.regParam,Array(0.1,0.01))
.addGrid(lr.elasticNetParam,Array(0.1,0.0))
.build()
val lrCv = new CrossValidator()
.setEstimator(lrPipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(lrParamGrid)
.setNumFolds(2)
.setParallelism(4)
val lrModel = lrCv.fit(train)
val lrValiad = lrModel.transform(train)
val lrPredictions = lrModel.transform(test)
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracyLrt = evaluator.evaluate(lrValiad)
println(s"逻辑回归验证集分类准确率 = $accuracyLrt")
val accuracyLrv = evaluator.evaluate(lrPredictions)
println(s"逻辑回归测试集分类准确率 = $accuracyLrv")
val lrEndTime = new Date().getTime
val lrCostTime = (lrEndTime - lrStartTime)/lrParamGrid.length
println(s"逻辑回归分类耗时:$lrCostTime")
// 获取最优模型
val bestLrModel = lrModel.bestModel.asInstanceOf[PipelineModel]
val bestLrVectorizer = bestLrModel.stages(3).asInstanceOf[CountVectorizerModel]
val blvv = bestLrVectorizer.getVocabSize
val blvm = bestLrVectorizer.getMinDF
val bestLr = bestLrModel.stages(4).asInstanceOf[LogisticRegressionModel]
val blr = bestLr.getRegParam
val ble = bestLr.getElasticNetParam
println(s"countVectorizer模型最优参数:\ngetVocabSize= $blvv,minDF = $blvm\n逻辑回归模型最优参数:\nregParam = $blr,elasticNetParam = $ble")
1.2 寻优结果
逻辑回归验证集分类准确率 = 1.0
逻辑回归测试集分类准确率 = 0.8187134502923976
逻辑回归分类耗时:17404
countVectorizer模型最优参数:
vocabSize = 262144,minDF = 2.0
逻辑回归模型最优参数:
regParam = 0.01,elasticNetParam = 0.1
2.决策树参数寻优
2.1 代码
val dtStartTime = new Date().getTime
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setSeed(123456L)
val dtPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,dt,converts))
// 交叉验证参数设定和模型
val dtParamGrid = new ParamGridBuilder()
.addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
.addGrid(vectorizer.minDF,Array(1.0,2.0))
.addGrid(dt.impurity,Array("entropy","gini"))
.addGrid(dt.maxDepth,Array(5,10))
.addGrid(dt.maxBins,Array(32,500))
.addGrid(dt.minInfoGain,Array(0.1,0.01))
.addGrid(dt.minInstancesPerNode,Array(5,10))
.build()
val dtCv = new CrossValidator()
.setEstimator(dtPipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(dtParamGrid)
.setNumFolds(2)
.setParallelism(7)
val dtModel = dtCv.fit(train)
val dtValiad = dtModel.transform(train)
val dtPredictions = dtModel.transform(test)
val accuracyDtt = evaluator.evaluate(dtValiad)
println(s"决策树验证集分类准确率 = $accuracyDtt")
val accuracyDtv = evaluator.evaluate(dtPredictions)
println(s"决策树测试集分类准确率 = $accuracyDtv")
val dtEndTime = new Date().getTime
val dtCostTime = (dtEndTime - dtStartTime)/dtParamGrid.length
println(s"决策树分类耗时:$dtCostTime")
// 获取最优模型
val bestDtModel = dtModel.bestModel.asInstanceOf[PipelineModel]
val bestDtVectorizer = bestDtModel.stages(3).asInstanceOf[CountVectorizerModel]
val bdvv = bestDtVectorizer.getVocabSize
val bdvm = bestDtVectorizer.getMinDF
val bestDt = bestDtModel.stages(4).asInstanceOf[DecisionTreeClassificationModel]
val bdi = bestDt.getImpurity
val bdmd = bestDt.getMaxDepth
val bdmb = bestDt.getMaxBins
val bdmig = bestDt.getMinInfoGain
val bdmipn = bestDt.getMinInstancesPerNode
println(s"countVectorizer模型最优参数:\nvocabSize = $bdvv,minDF = $bdvm\n决策树分类模型最优参数:\nmpurity = $bdi,maxDepth = $bdmd,maxBins = $bdmb,minInfoGain = $bdmig,minInstancesPerNode = $bdmipn")
2.2 寻优结果
决策树验证集分类准确率 = 0.823021582733813
决策树测试集分类准确率 = 0.7894736842105263
决策树分类耗时:17698
countVectorizer模型最优参数:
vocabSize = 262144,minDF = 1.0
决策树分类模型最优参数:
mpurity = entropy,maxDepth = 10,maxBins = 32,minInfoGain = 0.1,minInstancesPerNode = 5
3.随机森林参数寻优
3.1 代码
val rfStartTime = new Date().getTime
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setSeed(123456L)
val rfPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,rf,converts))
// 交叉验证参数设定和模型
val rfParamGrid = new ParamGridBuilder()
.addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
.addGrid(vectorizer.minDF,Array(1.0,2.0))
.addGrid(rf.impurity,Array("entropy","gini"))
.addGrid(rf.maxDepth,Array(5,10))
.addGrid(rf.maxBins,Array(32,500))
.addGrid(rf.minInfoGain,Array(0.1,0.01))
.addGrid(rf.minInstancesPerNode,Array(5,10))
.addGrid(rf.numTrees,Array(20,50))
.addGrid(rf.subsamplingRate,Array(0.2,0.1))
.build()
val rfCv = new CrossValidator()
.setEstimator(rfPipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(rfParamGrid)
.setNumFolds(2)
.setParallelism(9)
val rfModel = rfCv.fit(train)
val rfValiad = rfModel.transform(train)
val rfPredictions = rfModel.transform(test)
val accuracyRft = evaluator.evaluate(rfValiad)
println(s"随机森林验证集分类准确率为:$accuracyRft")
val accuracyRfv = evaluator.evaluate(rfPredictions)
println(s"随机森林测试集分类准确率为:$accuracyRfv")
val rfEndTime = new Date().getTime
val rfCostTime = (rfEndTime - rfStartTime)/rfParamGrid.length
println(s"随机森林分类耗时:$rfCostTime")
// 获取最优模型
val bestRfModel = rfModel.bestModel.asInstanceOf[PipelineModel]
val bestRfVectorizer = bestRfModel.stages(3).asInstanceOf[CountVectorizerModel]
val brvv = bestRfVectorizer.getVocabSize
val brvm = bestRfVectorizer.getMinDF
val bestRf = bestRfModel.stages(4).asInstanceOf[RandomForestClassificationModel]
val bri = bestRf.getImpurity
val brmd = bestRf.getMaxDepth
val brmb = bestRf.getMaxBins
val brmig = bestRf.getMinInfoGain
val brmipn = bestRf.getMinInstancesPerNode
val brnt = bestRf.getNumTrees
val brsr = bestRf.getSubsamplingRate
println(s"countVectorizer模型最优参数:\nvocabSize = $brvv,eminDF = $brvm\n随机森林分类模型最优参数:\nmpurity = $bri,maxDepth = $brmd,maxBins = $brmb,minInfoGain = $brmig,minInstancesPerNode = $brmipn,numTrees = $brnt,subsamplingRate = $brsr")
3.2 寻优结果
随机森林验证集分类准确率为:0.9510791366906475
随机森林测试集分类准确率为:0.8140350877192983
随机森林分类耗时:15715
countVectorizer模型最优参数:
vocabSize = 1024,minDF = 2.0
随机森林分类模型最优参数:
mpurity = gini,maxDepth = 10,maxBins = 32,minInfoGain = 0.01,minInstancesPerNode = 5,numTrees = 50,subsamplingRate = 0.2
4.XGBOOST参数寻优
4.1 代码
val xgbStartTime = new Date().getTime
val xgb = new XGBoostClassifier()
.setObjective("multi:softprob")
.setNumClass(outputLayers)
.setFeaturesCol("features")
.setLabelCol("label")
.setNumWorkers(1)
val xgbPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,xgb,converts))
// 交叉验证参数设定和模型
val xgbParamGrid = new ParamGridBuilder()
.addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
.addGrid(vectorizer.minDF,Array(1.0,2.0))
.addGrid(xgb.eta,Array(0.3,0.1))
.addGrid(xgb.maxDepth,Array(6,10))
.addGrid(xgb.numRound,Array(10,100))
.addGrid(xgb.alpha,Array(0.1,0.0))
.build()
val xgbCv = new CrossValidator()
.setEstimator(xgbPipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(xgbParamGrid)
.setNumFolds(2)
.setParallelism(6)
val xgbModel = xgbCv.fit(train)
val xgbValiad = xgbModel.transform(train)
val xgbPredictions = xgbModel.transform(test)
val accuracyXgbt = evaluator.evaluate(xgbValiad)
println(s"xgboost验证集分类准确率为:$accuracyXgbt")
val accuracyXgbv = evaluator.evaluate(xgbPredictions)
println(s"xgboost测试集分类准确率为:$accuracyXgbv")
val xgbEndTime = new Date().getTime
val xgbCostTime = (xgbEndTime - xgbStartTime)/xgbParamGrid.length
println(s"xgboost分类耗时:$xgbCostTime")
// 获取最优模型
val bestXgbModel = xgbModel.bestModel.asInstanceOf[PipelineModel]
val bestXgbVectorizer = bestXgbModel.stages(3).asInstanceOf[CountVectorizerModel]
val bxvv = bestXgbVectorizer.getVocabSize
val bxvm = bestXgbVectorizer.getMinDF
val bestXgb = bestXgbModel.stages(4).asInstanceOf[XGBoostClassificationModel]
val bxe = bestXgb.getEta
val bxmd = bestXgb.getMaxDepth
val bxnr = bestXgb.getNumRound
val bxa = bestXgb.getAlpha
println(s"countVectorizer模型最优参数:\nvocabSize = $bxvv,minDF = $bxvm\nXGBOOST分类模型最优参数:\neta = $bxe,maxDepth = $bxmd,numRound = $bxnr,alpha = $bxa")
4.2 寻优结果
xgboost验证集分类准确率为:1.0
xgboost测试集分类准确率为:0.8654970760233918
xgboost分类耗时:32023
countVectorizer模型最优参数:
vocabSize = 262144,minDF = 2.0
XGBOOST分类模型最优参数:
eta = 0.1,maxDepth = 6,numRound = 100,alpha = 0.0
5.朴素贝叶斯参数寻优
5.1 代码
val nvbStartTime = new Date().getTime
val nvb = new NaiveBayes()
val nvbPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,nvb,converts))
// 交叉验证参数设定和模型
val nvbParamGrid = new ParamGridBuilder()
.addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
.addGrid(vectorizer.minDF,Array(1.0,2.0))
.addGrid(nvb.smoothing,Array(1, 0.5))
.build()
val nvbCv = new CrossValidator()
.setEstimator(nvbPipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(nvbParamGrid)
.setNumFolds(2)
.setParallelism(3)
val nvbModel = nvbCv.fit(train)
val nvbValiad = nvbModel.transform(train)
val nvbPredictions = nvbModel.transform(test)
val accuracyNvbt = evaluator.evaluate(nvbValiad)
println(s"朴素贝叶斯验证集分类准确率:$accuracyNvbt")
val accuracyNvbv = evaluator.evaluate(nvbPredictions)
println(s"朴素贝叶斯测试集分类准确率:$accuracyNvbv")
val nvbEndTime = new Date().getTime
val nvbCostTime = (nvbEndTime - nvbStartTime)/nvbParamGrid.length
println(s"朴素贝叶斯分类耗时:$nvbCostTime")
// 获取最优模型
val bestNvbModel = nvbModel.bestModel.asInstanceOf[PipelineModel]
val bestNvbVectorizer = bestNvbModel.stages(3).asInstanceOf[CountVectorizerModel]
val bnvv = bestNvbVectorizer.getVocabSize
val bnvm = bestNvbVectorizer.getMinDF
val bestNvb = bestNvbModel.stages(4).asInstanceOf[NaiveBayesModel]
val bns = bestNvb.getSmoothing
println(s"countVectorizer模型最优参数:\nvocabSize = $bnvv,minDF = $bnvm\n朴素贝叶斯分类模型最优参数:\nsmoothing = $bns")
5.2 寻优结果
朴素贝叶斯验证集分类准确率:0.9280575539568345
朴素贝叶斯测试集分类准确率:0.7192982456140351
朴素贝叶斯分类耗时:10987
countVectorizer模型最优参数:
vocabSize = 262144,minDF = 2.0
朴素贝叶斯分类模型最优参数:
smoothing = 0.5
三、优化模型性能对比
交叉验证可以提升模型分类准确率(如下图所示),XGBOOST优化后在验证集和测试集都获取最佳的分类效果,但耗时反而增加了。逻辑回归、决策树、随机森林和朴素贝叶斯变换优化后,验证集分类正确率和测试集分类正确率都有一定程度提高,并且耗时有所下降。
验证集分类正确率 | 测试集分类正确率 | 耗时(ms) | ||
逻辑回归 | 优化前 | 100% | 79.53% | 44697 |
优化后 | 100% | 81.87% | 17404 | |
决策树 | 优化前 | 81.58% | 73.68% | 34597 |
优化后 | 82.30% | 78.94% | 17698 | |
随机森林 | 优化前 | 94.24% | 73.68% | 56608 |
优化后 | 95.10% | 81.40% | 15715 | |
XGBOOST | 优化前 | 99.71% | 79.53% | 31947 |
优化后 | 100% | 86.54% | 32023 | |
朴素贝叶斯分类 | 优化前 | 83.74% | 71.34% | 11510 |
优化后 | 92.80% | 71.92% | 10987 |
参考文献
https://blog.csdn.net/baymax_007/article/details/82775180