用页面解析的方式从twitter爬下来的帖子时间有时候是中文的,如下:
由于时间处理的细节很多,所以在这里做一个小结,首先要明白处理的目标数据应该是24小时制,并且形式如下
format = "%Y-%m-%d %H:%M:%S"
也就是要将字符串转换为datetime.datetime类型
代码如下:
from datetime import datetime
format = "%Y-%m-%d %H:%M:%S"
def chineseTime2National(time):
if time[0] == "上":
time = time.replace(r'上午','').split(' ')
houmin = time[0].split(':')
if houmin[0] == '12': #要将凌晨12点换为00
houmin = "00"+":"+houmin[1]
else:
houmin = time[0]
time = time[2]+" "+houmin
#print(time)
time = time.replace(r'年','-').replace(r'月','-').replace(r'日','')
#print(time) #输出2017-04-27
#print(type(time)) #<type 'str'>
restime = datetime.strptime(time,'%Y-%m-%d %H:%M')
#print (restime) #输出结果:2017-04-27 00:00:00
#print (type(restime)) #<type 'datetime.datetime'>
elif time[0] == "下":
time = time.replace(r'下午','').split(' ')
houmin = time[0].split(':')
if houmin[0] == '12':
hour = '12'
else:
hour = int(houmin[0])+12#下午时间转换为24小时制
houmin = str(hour)+":"+houmin[1]
time = time[2]+" "+houmin
#print(time)
time = time.replace(r'年','-').replace(r'月','-').replace(r'日','')#连续替换年月日为‘-’
#print(time) #输出2017-04-27
#print(type(time)) #<type 'str'>
restime = datetime.strptime(time,'%Y-%m-%d %H:%M')#将字符串转为datetime用strptime
#print (restime) #输出结果:2017-04-27 00:00:00
return restime
得到datetime类型时间以后,由于需要统计发帖的小时、星期,我们需要借助几个简单的函数,代码如下
with open('time_feature_of_user.json','w') as f:
for name,group in an_traces_df.groupby(['screen_name']):
dic = {}
dic["screen_name"] = name
hours = np.zeros(24)#统计小时的数组
weekdays = np.zeros(7)#统计星期的数组
for t in group["created_at"].values:
t = chineseTime2National(t)#转为datetime
day = t.date()#datetime类型数据的函数date()获取日期
weekday = day.weekday()#通过日期获取星期:0代表monday以此类推
hour = t.time().hour - 1#通过datetime的time()函数的hour属性获取小时
hours[hour] += 1
weekdays[weekday] += 1
dic["hour_feature"] = (hours/len(group["created_at"].values)).tolist()
dic["weekday_feature"] = (weekdays/len(group["created_at"].values)).tolist()
f.write(json.dumps(dic)+'\n')