前两篇博客写了在anacoda下安装tpot库和使用tpot做分类的例子,这篇是写做回归的例子
使用TPOT自动选择scikit-learn机器学习模型和参数--分类示例
环境:win10+pycharm+anacoda
数据集:sklearn自带的波士顿房价数据集
代码:
'''
回归,预测波士顿房价
'''
from tpot import TPOTRegressor
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target)
# ,train_size=0.75, test_size=0.25)
tpot = TPOTRegressor(generations=20, verbosity=2) #迭代20次
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('pipeline.py')
运行结果:
Best pipeline: XGBRegressor(RidgeCV(input_matrix), learning_rate=0.1, max_depth=5, min_child_weight=2, n_estimators=100, nthread=1, subsample=0.8)
可以看出tpot给出的模型是用XGBRegressor。
预测代码:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import xgboost as xgb
housing=load_boston()
#print(housing)
da=pd.DataFrame(housing.data)
da.columns =housing.feature_names
#print(da.head())
ta=pd.DataFrame(housing.target)
ta.columns=['target']
#print(ta.head())
boston=pd.concat([da,ta],axis=1) #记住啊,axis=0:作用对象是index; axis=1:作用对象是columns。,
#print(boston.head())
featuress=np.array(boston.drop(['target'],axis=1))
target=np.array(boston['target'])
#print(featuress)
train_features,test_feratures,train_target,test_target=train_test_split(featuress,target,random_state=42)
xgbr=xgb.XGBRegressor(learning_rate=0.1, max_depth=5, min_child_weight=2,
n_estimators=100, nthread=1, subsample=0.8)
xgbr.fit(train_features,train_target)
result=xgbr.score(test_feratures,test_target)
print("xgbr_result: %s"%result)
运行结果:
这个结果不是很好。
问题:
我之前分别用tpot迭代5次和10次输出模型,输出的时候cv score的值不明白为什么是负的?
迭代5次:
迭代10次:
一下是根据迭代10次的模型跑预测的代码:
gdbt = GradientBoostingRegressor(alpha=0.9, learning_rate=0.1, loss='huber',
max_depth=7, max_features=0.4,
min_samples_leaf=3, min_samples_split=8,
n_estimators=100, subsample=0.9000000000000001)
gdbt.fit(train_features,train_target)
#restlt=gdbt.predict(test_feratures)
restlt2=gdbt.score(test_feratures,test_target)
print("gdbt_result: %s"%restlt2)
运行结果:
可以看出这个迭代20次的模型预测结果略有提升。20次是远远不够的,tpot默认的迭代次数的100次,理论上说迭代次数越多模型效果越好,特别是数据集较大的时候!但是也特别耗费时间。
问题:CV score的值为什么是负的?
参考: