版权声明:https://blog.csdn.net/thfyshz版权所有 https://blog.csdn.net/thfyshz/article/details/83818797
缺失数据处理
对时间序列中缺失值的操作:
In [79]: df = pd.DataFrame(np.random.randn(6,1), index=pd.date_range('2013-08-01', periods=6, freq='B'), columns=list('A'))
In [80]: df.loc[df.index[3], 'A'] = np.nan
In [81]: df
Out[81]:
A
2013-08-01 -1.054874
2013-08-02 -0.179642
2013-08-05 0.639589
2013-08-06 NaN
2013-08-07 1.906684
2013-08-08 0.104050
# 向下填充
In [82]: df.reindex(df.index[::-1]).ffill()
Out[82]:
A
2013-08-08 0.104050
2013-08-07 1.906684
2013-08-06 1.906684
2013-08-05 0.639589
2013-08-02 -0.179642
2013-08-01 -1.054874
分组
使用apply:
In [83]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
....: 'size': list('SSMMMLL'),
....: 'weight': [8, 10, 11, 1, 20, 12, 12],
....: 'adult' : [False] * 5 + [True] * 2}); df
....:
Out[83]:
animal size weight adult
0 cat S 8 False
1 dog S 10 False
2 cat M 11 False
3 fish M 1 False
4 dog M 20 False
5 cat L 12 True
6 cat L 12 True
#每种动物中最大的体型
In [84]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out[84]:
animal
cat L
dog M
fish M
dtype: object
Using get_group
In [85]: gb = df.groupby(['animal'])
#得到cat这一组的数据
In [86]: gb.get_group('cat')
Out[86]:
animal size weight adult
0 cat S 8 False
2 cat M 11 False
5 cat L 12 True
6 cat L 12 True
#对一个组中不同项目应用函数
In [87]: def GrowUp(x):
....: avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
....: avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
....: avg_weight += sum(x[x['size'] == 'L'].weight)
....: avg_weight /= len(x)
....: return pd.Series(['L',avg_weight,True], index=['size', 'weight', 'adult'])
....:
In [88]: expected_df = gb.apply(GrowUp)
In [89]: expected_df
Out[89]:
size weight adult
animal
cat L 12.4375 True
dog L 20.0000 True
fish L 1.2500 True
apply的扩展应用:
In [90]: S = pd.Series([i / 100.0 for i in range(1,11)])
In [91]: def CumRet(x,y):
....: return x * (1 + y)
....:
In [92]: def Red(x):
....: return functools.reduce(CumRet,x,1.0)
....:
In [93]: S.expanding().apply(Red, raw=True)
Out[93]:
0 1.010000
1 1.030200
2 1.061106
3 1.103550
4 1.158728
5 1.228251
6 1.314229
7 1.419367
8 1.547110
9 1.701821
dtype: float64