1 参数
sklearn 的 LinearRegression 存在一个参数可以在训练前进行标准化
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
文档介绍
normalizebool, default=False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use StandardScaler before calling fit on an estimator with normalize=False.
有意思的是 normalized 和 standardize 都是标准化,减去均值除 l2 范数
2 系数
训练完成的线性回归模型,其系数可以代表该特征的重要性
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
也可以绘图
model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
3 检查模型
训练完模型,要比对真实和预测的差距,确定模型是否可行
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()
如果偏差较多,说明模型有问题
可能是标签的问题,若标签是长尾分布,则不符合模型的假设
需要调整为正态分布
train_y_ln = np.log(train_y + 1)
再训练,就好很多了