1、读取指定路径的数据
读取json类型数据,注意python2和python3的路径表示不一样,我使用的python3中使用 \\ ,而python2中使用反斜杠 /
import json
path='E:\\DataAnalysis\\pydata-book\\pydata-book-1st-edition\\ch02\\usagov_bitly_data2012-03-16-1331923249.txt'
records=[json.loads(line) for line in open(path)]
records[0]
Out[4]:
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
'c': 'US',
'nk': 1,
'tz': 'America/New_York',
'gr': 'MA',
'g': 'A6qOVH',
'h': 'wfLQtf',
'l': 'orofrog',
'al': 'en-US,en;q=0.8',
'hh': '1.usa.gov',
'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991',
't': 1331923247,
'hc': 1331822918,
'cy': 'Danvers',
'll': [42.576698, -70.954903]}
records[0]['tz']
Out[5]: 'America/New_York'
2、读取字典中某一字段
time_zones=[rec('tz') for rec in records]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-0672c6c590cc> in <module>()
----> 1 time_zones=[rec('tz') for rec in records]
<ipython-input-6-0672c6c590cc> in <listcomp>(.0)
----> 1 time_zones=[rec('tz') for rec in records]
TypeError: 'dict' object is not callable
字段需要用[],此处错用了()
3、计算字段每个值出现的次数
方法一:
def get_counts(sequence):
counts={}
for x in sequence:
if x in counts:
counts[x]+=1
else :
counts[x]=1
return counts
方法二
from collections import defaultdict
def get_counts2(sequence):
counts=defaultdict(int)
for x in sequence:
counts[x]+=1
return counts
4、取前10位及计数值
方法一写一个函数
def top_counts(count_dict,n=10):
value_key_pairs=[(count,tz) for tz, count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
top_counts(counts)
方法二 调用标准库collection
from collections import Counter
counts=Counter(time_zones)
counts.most_common(10)
利用pandas对时区进行计数
from pandas import DataFrame,Series
import pandas as pd; import numpy as np
frame=DataFrame(records)
自带的计数
tz_counts=frame['tz'].value_counts()
tz_counts[:10]
5 处理缺省值和缺失值
clean_tz=frame['tz'].fillna('Missing')
clean_tz[clean_tz=='']='unknown'
tz_counts=clean_tz.value_counts()
tz_counts[:10]
clean_tz=frame['tz'].fillna('Missing')
clean_tz[clean_tz=='']='Unknown'
tz_counts=clean_tz.value_counts()
tz_counts[:10]
Out[36]:
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
Name: tz, dtype: int64
tz_counts[:10].plot(kind='barh',rot=0)
Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x29412298780>

frame['a'][1]
Out[39]: 'GoogleMaps/RochesterNY'
frame['a'][50]
Out[40]: 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'
frame['a'][51]
Out[41]: 'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'
results=Series([x.split()[0] for x in frame.a.dropna()])