经过前四个任务后,模型的分数已经比较稳定。
为了进一步提高分数,我们可以采取模型融合的方式。
网上有太多介绍融合的文章,不过我看完还是一头雾水,决定不在这里班门弄斧了,大家还是bing的好。
1. 模型建立
1.1 建立一个用于比较的线性回归模型
def build_model_lr(x_train, y_train):
reg_model = linear_model.LinearRegression()
reg_model.fit(x_train, y_train)
return reg_model
1.2 建立三个用于融合的模型
1.2.1 GBDT:梯度提升决策树模型
def build_model_gbdt(x_train, y_train):
gbdt = GradientBoostingRegressor(learning_rate=0.1,
loss='ls',
subsample=0.85,
max_depth=5,
n_estimators=100)
gbdt.fit(x_train, y_train)
return gbdt
1.2.2 一般比赛中效果最为显著的两种方法
XGBoost和LightGBM
def build_model_xgb(x_train, y_train):
model = xgb.XGBRegressor(n_estimators=120,
learning_rate=0.08,
gamma=0,
subsample=0.8,
colsample_bytree=0.9,
max_depth=5)
model.fit(x_train, y_train)
return model
def build_model_lgb(x_train, y_train):
gbm = lgb.LGBMRegressor(learning_rate=0.1, num_leaves=63, n_estimators=100)
gbm.fit(x_train, y_train)
return gbm
2. 训练和预测
2.1 线性回归模型的训练、预测
## Train and Predict
print('Predict LR...')
model_lr = build_model_lr(x_train,y_train)
val_lr = model_lr.predict(x_val)
subA_lr = model_lr.predict(X_test)
LR的MAE是2582,是很高的。
2.2 GBDT的训练和预测
print('Predict GBDT...')
model_gbdt = build_model_gbdt(x_train,y_train)
val_gbdt = model_gbdt.predict(x_val)
subA_gbdt = model_gbdt.predict(X_test)
MAE_gbdt = mean_absolute_error(y_val, val_gbdt)
print('MAE of val with LR:', MAE_gbdt)
GBDT的MAE是770,GBDT的训练时间最长,大约是50秒,其它的模型都是10秒以内。
2.3 XGB和LGB的训练和预测
print('predict XGB...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
subA_xgb = model_xgb.predict(X_test)
MAE_xgb = mean_absolute_error(y_val, val_xgb)
print('MAE of val with xgb:', MAE_xgb)
XGB的MAE是772.
print('predict lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
subA_lgb = model_lgb.predict(X_test)
MAE_lgb = mean_absolute_error(y_val, val_lgb)
print('MAE of val with lgb:', MAE_lgb)
LGB的MAE是716.
3. 模型融合
3.1 加权融合
def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
return Weighted_result
## Init the Weight
w = [0.3,0.4,0.3]
## 测试验证集准确度
val_pre = Weighted_method(val_lgb,val_xgb,val_gbdt,w)
MAE_Weighted = mean_absolute_error(y_val,val_pre)
print('MAE of Weighted of val:',MAE_Weighted)
## 预测数据部分
subA = Weighted_method(subA_lgb,subA_xgb,subA_gbdt,w)
print('Sta inf:')
Sta_inf(subA)
## 生成提交文件
sub = pd.DataFrame()
sub['SaleID'] = X_test.index
sub['price'] = subA
sub.to_csv('./sub2/sub_Weighted.csv',index=False)
加权融合后的MAE是736。
3.2 Stacking融合
第一层融合模型:
## Stacking
## 第一层
train_lgb_pred = model_lgb.predict(x_train)
train_xgb_pred = model_xgb.predict(x_train)
train_gbdt_pred = model_gbdt.predict(x_train)
Stack_X_train = pd.DataFrame()
Stack_X_train['Method_1'] = train_lgb_pred
Stack_X_train['Method_2'] = train_xgb_pred
Stack_X_train['Method_3'] = train_gbdt_pred
Stack_X_val = pd.DataFrame()
Stack_X_val['Method_1'] = val_lgb
Stack_X_val['Method_2'] = val_xgb
Stack_X_val['Method_3'] = val_gbdt
Stack_X_test = pd.DataFrame()
Stack_X_test['Method_1'] = subA_lgb
Stack_X_test['Method_2'] = subA_xgb
Stack_X_test['Method_3'] = subA_gbdt
第二层合模型:
## level2-method
model_lr_Stacking = build_model_lr(Stack_X_train,y_train)
## 训练集
train_pre_Stacking = model_lr_Stacking.predict(Stack_X_train)
print('MAE of Stacking-LR:',mean_absolute_error(y_train,train_pre_Stacking))
## 验证集
val_pre_Stacking = model_lr_Stacking.predict(Stack_X_val)
print('MAE of Stacking-LR:',mean_absolute_error(y_val,val_pre_Stacking))
## 预测集
print('Predict Stacking-LR...')
subA_Stacking = model_lr_Stacking.predict(Stack_X_test)
训练集上的MAE是632,验证集上的MAE是719.
4. 总结
4.1 加权融合
加权融合的MAE比GBDT和XGB好,但是比LGB稍差。
4.2 Stacking融合
Stacking融合后的MAE比加权融合稍好。
综上所述,此次的模型融合不是很成功,还没有做线上验证,还有需要改进的地方。