用sklearn做一个完整的机器学习工程——以波士顿房价预测为例（三、调参，找最优参数）

前言

本来是打算介绍一下：GBDT、XGboost的原理的，看着看着发现网络上的博客资料都不是很全，讲的云里雾里。

建议直接阅读一下Friedman的论文和陈天奇的论文，链接: https://pan.baidu.com/s/14TmsZTorZmOAHEwZU5fiNA 密码: g2g1

下面我要开始介绍sklearn中的调参步骤啦

如果你现在有了一个列表，列表里有几个有希望的模型。你现在需要对它们进行微调。让我们来看sklearn中提供的俩种微调的方法

网格化调整参数和随机参数

由于这里训练时间的关系我只展示 Xgboost的调参和随机森林的调参，GBDT训练的速度有点慢了

GridSearch

网格化调整超参数所需要做的是告诉GridSearchCV要试验有哪些超参数，要试验什么值，GridSearchCV就能用交叉验证试验所有可能超参数值的组合

比如下面对随机森林的网格化调参

from sklearn.model_selection import GridSearchCV

param_grid = [
    
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

这里的参数列表研究一下：

在gridsearch 中一定要注意，列表和字典等格式的区别

这里的选择的参数是默认bootstrap为正的情况下 3*4 个组合 + boostrap为负的情况下 2*3 个组合外加 cv=5 选择参数共计算18*5 = 90 次

你也能得到得到一些有用东西比如最佳估计器，比如每次交叉验证的得分

grid_search.best_estimator_
grid_search.best_params_
#在得到相应的分数
housing_predictions = grid_search.predict(housing_prepared)
scores = cross_val_score(grid_search,housing_prepared,housing_labels.values,scoring="neg_mean_squared_error", cv=5)
grid_score= np.sqrt(-scores)
display(grid_score)

输出得到最优参数后，我们可以代入到原来的模型中然后进行计算：

rnd_clffited = RandomForestRegressor(n_estimators=30,max_features=6,random_state=42)

scores = cross_val_score(rnd_clffited,housing_prepared,housing_labels.values,scoring="neg_mean_squared_error", cv=5)
rnd_clfscore= np.sqrt(-scores)
display(rnd_clfscore)

Mean 50650.4403252
Score [ 49212.53293747 51332.95138125 52421.12147664 48815.91684412
51469.67898662]
Std 1393.22210178

尴尬似乎比原来也好不了多少

现在在给一个支持向量机的调参方式

from sklearn.model_selection import GridSearchCV

param_grid = [
{'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000., 10000., 30000.0]},
{'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0],
'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
]

svm_reg = SVR()
grid_search = GridSearchCV(svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search.fit(housing_prepared, housing_labels)

这里的scoring 就是sklearn metric中的一些评估指标

这里的参数列表我们可以研究一下：

这里的交叉验证参数是分别尝试线性核和rbf 核，如gama = 0.1 kernel =‘linear’，‘C’ = 10 ，gama = 0.1 kernel =‘linear’，‘C’ = 10 共（8*6+7*6）*5 这里之所以乘以5 是因为 grid_search 里的cv =5

Random 调参

是当超参数的搜索空间很大时，最好使用RandomizedSearchCV。这个类的使用方法和类GridSearchCV很相似，但它不是尝试所有可能的组合，而是通过选择每个超参数的一个随机值的特定数量的随机组合。这个方法有两个优点：

如果你让随机搜索运行，比如 1000 次，它会探索每个超参数的 1000 个不同的值（而不是像网格搜索那样，只搜索每个超参数的几个值）。
你可以方便地通过设定搜索次数，控制超参数搜索的计算量。
你可以假设你的参数服从某一个特定分布，比如服从均匀分布，正态分布，然后从中采样出超参数

如下面一个例子：

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
'n_estimators': randint(low=1, high=200),
'max_features': randint(low=1, high=8),
}

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

这个是假设服从randint分布

基本上 scipy.stats 里面所有的分布都可以进行选择，但是注意要在RandomSearchCV中要额外加上参数param_distributions=param_distribs，iter表示采样的次数默认为10

其余的地方就喝网格化调参方式一样啦！，至于每个参数的意义，还是得了解每种算法的原理和相应sklearn中的设置才行。

TreeBoost 参数设置指南：

这里主要参考的寒小阳的博客，其实主要的方法还是gridsearch和randomsearch 俩种，但是对于不同的模型，参数设置不同

https://blog.csdn.net/han_xiaoyang/article/details/52665396

https://blog.csdn.net/han_xiaoyang/article/details/52663170

这俩篇文章基本上把tree-boost的参数都介绍了一遍，如果了解GBM原理的话，很快就能明白意思。当然我们不需要背这么多参数，只需要当我们需要的时候再去查就行，

随便给个例子：额，计算时间有点长。。。。

from sklearn.model_selection import GridSearchCV
param_grid ={'n_estimators':[300,400,500,600,700],'gamma':[0.1,0.2,0.3],'subsample':[0.6,0.7,0.8],'reg_lambda': [0.05, 0.1, 1],
             'colsample_bytree': [0.6, 0.7, 0.8, 0.9],'learning_rate': [0.01, 0.05, 0.07, 0.1]}
grid_search = GridSearchCV(xgb_reg,param_grid=param_grid,cv=3,scoring='neg_mean_squared_error',return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

不过运行应该没啥问题，这样我们这样的完整的一个机器学习项目就做完啦！