版权声明:博主原创文章属私人所有,未经允许 不可转发和使用 https://blog.csdn.net/a1066196847/article/details/84105633
1:比赛链接
https://jdder.jd.com/index/jddDetail?matchId=3dca1a91ad2a4a6da201f125ede9601a
2:总体概括
这次比赛初赛和复赛有很大的不同点。初赛要进行预测的15天这个区段内,没有明显的异常点,所以传统的划窗处理进行建模是一个很合理的办法。但是到了复赛和决赛,主办方将这15天拆分成5天和10天,其中5天是包含在训练集中间的预测区段,相当于从训练集的那一长段时间内抠出来5天,要预测的10天区段是整个训练集时段的未来10天。这15天内有一些极端变化的天数,所以复赛的方式要和初赛有些不同,我们需要专门的队这10天的预测区段进行处理。
总体的技术思路还是规则+模型,然后调整不同的加权系数。
3:代码分享
因为我们建模的方法有好几种进行加权,以及规则方面也是多种规则进行加权,在这里就不一一展开了。分别取其中的核心代码进行分享。
首先是模型部分:
import numpy as np
import pandas as pd
import string
import re
from collections import Counter
import pickle
##第一步:数据处理
flow_norate = pd.read_csv('flow_train.csv')
rate = pd.read_csv("flow_rate_train.csv")
flow_mutilply_rate = pd.merge(flow_norate,rate,on=['date_dt','city_code','district_code'],how='left')
flow_mutilply_rate['dwell_new']=flow_mutilply_rate['dwell']/flow_mutilply_rate['dwell_rate']
flow_mutilply_rate['flow_in_new']=flow_mutilply_rate['flow_in']/flow_mutilply_rate['flow_in_rate']
flow_mutilply_rate['flow_out_new']=flow_mutilply_rate['flow_out']/flow_mutilply_rate['flow_out_rate']
flow_train=flow_mutilply_rate[['date_dt','city_code','district_code','dwell_new','flow_in_new','flow_out_new']]
flow_train.rename(columns={'dwell_new':'dwell', 'flow_in_new':'flow_in', 'flow_out_new':'flow_out'}, inplace = True)
##第二步:特征提取
flow_train_s = flow_train.loc[((flow_train.date_dt >= 20170523) & (flow_train.date_dt < 20170819 + 5))]
flow_train_n = flow_train.loc[((flow_train.date_dt >= 20170819) & (flow_train.date_dt <= 20170930))]
import datetime
class Date_Process:
def __init__(self):
self.rank_dic = {}
def _dateinfo_trans(self,df):
df['date_dt'] = df['date_dt'].apply(lambda x: datetime.datetime.strptime(str(x), '%Y%m%d'))
df['year'] = df['date_dt'].map(lambda x:x.year)
df['month'] = df['date_dt'].map(lambda x:x.month)
df['day'] = df['date_dt'].map(lambda x:x.day)
df['day_of_week'] = df['date_dt'].map(lambda x:x.weekday())
return df
def _ranks_fit_transform(self,df):
df['ranks'] = df['year'] * 400 + df['month'] * 40 + df['day']
rank_sort = np.sort(df['ranks'].unique())
rank_dic = {}
for i,val in enumerate(rank_sort):
rank_dic[val] = i
df['ranks'] = df['ranks'].map(rank_dic)
self.rank_dic = rank_dic
return df
def _ranks_transform(self,df):
df['ranks'] = df['year'] * 400 + df['month'] * 40 + df['day']
try:
df['ranks'] = df['ranks'].map(self.rank_dic)
except:
print('Date not in the same range!')
return df
再次是规则部分:
a = flow_train[(flow_train.date_dt <= about_nine_days_other) & (flow_train.date_dt >= about_nine_days)]
a_1 = a.groupby(['wks_1','wkend','address'], as_index=False)['dwell'].agg({'a_dwell':'mean'})
a_2 = a.groupby(['wks_1','wkend','address'], as_index=False)['flow_in'].agg({'a_flow_in':'mean'})
a_3 = a.groupby(['wks_1','wkend','address'], as_index=False)['flow_out'].agg({'a_flow_out':'mean'})
a_4 = a.groupby(['wks_1','wkend','address'], as_index=False)['dwell'].agg({'a_dwell_median':'median'})
a_5 = a.groupby(['wks_1','wkend','address'], as_index=False)['flow_in'].agg({'a_flow_in_median':'median'})
a_6 = a.groupby(['wks_1','wkend','address'], as_index=False)['flow_out'].agg({'a_flow_out_median':'median'})
a = pd.merge(a_1, a_2, on=['wks_1','wkend','address'], how='left')
a = pd.merge(a, a_3, on=['wks_1','wkend','address'], how='left')
a = pd.merge(a, a_4, on=['wks_1','wkend','address'], how='left')
a = pd.merge(a, a_5, on=['wks_1','wkend','address'], how='left')
a = pd.merge(a, a_6, on=['wks_1','wkend','address'], how='left')
4:赛后总结
这次比赛的重点是要准确根据线上的成绩反馈动态调整自己的方案,并且设计不同的方法进行验证自己的想法。实际上在众多时序类比赛中这也是一个常见的思路。