一、GroupBy对象
数据对象:
>>>planets.head()
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
1、按列取值:
注意每次取值后的对象变化:
>>>planets.groupby('method')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C4DFFB3940>
>>>planets.groupby('method')['orbital_period']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001C4D903E908>
>>>planets.groupby('method')['orbital_period'].median() # 取这一列的中位数
method
Astrometry 631.180000
Eclipse Timing Variations 4343.500000
Imaging 27500.000000
Microlensing 3300.000000
Orbital Brightness Modulation 0.342887
Pulsar Timing 66.541900
Pulsation Timing Variations 1170.000000
Radial Velocity 360.200000
Transit 5.714932
Transit Timing Variations 57.011000
Name: orbital_period, dtype: float64
2、按组迭代:
>>>for (method, group) in planets.groupby('method'):
>>> print("{0:30s} shape={1}".format(method, group.shape))
Astrometry shape=(2, 6)
Eclipse Timing Variations shape=(9, 6)
Imaging shape=(38, 6)
Microlensing shape=(23, 6)
Orbital Brightness Modulation shape=(3, 6)
Pulsar Timing shape=(5, 6)
Pulsation Timing Variations shape=(1, 6)
Radial Velocity shape=(553, 6)
Transit shape=(397, 6)
Transit Timing Variations shape=(4, 6)
3、调用方法:
>>>planets.groupby('method')['year'].describe().unstack()
method
count Astrometry 2.000000
Eclipse Timing Variations 9.000000
Imaging 38.000000
Microlensing 23.000000
Orbital Brightness Modulation 3.000000
Pulsar Timing 5.000000
Pulsation Timing Variations 1.000000
Radial Velocity 553.000000
Transit 397.000000
Transit Timing Variations 4.000000
mean Astrometry 2011.500000
Eclipse Timing Variations 2010.000000
Imaging 2009.131579
Microlensing 2009.782609
Orbital Brightness Modulation 2011.666667
Pulsar Timing 1998.400000
Pulsation Timing Variations 2007.000000
Radial Velocity 2007.518987
Transit 2011.236776
Transit Timing Variations 2012.500000
std Astrometry 2.121320
Eclipse Timing Variations 1.414214
Imaging 2.781901
Microlensing 2.859697
Orbital Brightness Modulation 1.154701
Pulsar Timing 8.384510
Pulsation Timing Variations NaN
Radial Velocity 4.249052
Transit 2.077867
Transit Timing Variations 1.290994
...
50% Astrometry 2011.500000
Eclipse Timing Variations 2010.000000
Imaging 2009.000000
Microlensing 2010.000000
Orbital Brightness Modulation 2011.000000
Pulsar Timing 1994.000000
Pulsation Timing Variations 2007.000000
Radial Velocity 2009.000000
Transit 2012.000000
Transit Timing Variations 2012.500000
75% Astrometry 2012.250000
Eclipse Timing Variations 2011.000000
Imaging 2011.000000
Microlensing 2012.000000
Orbital Brightness Modulation 2012.000000
Pulsar Timing 2003.000000
Pulsation Timing Variations 2007.000000
Radial Velocity 2011.000000
Transit 2013.000000
Transit Timing Variations 2013.250000
max Astrometry 2013.000000
Eclipse Timing Variations 2012.000000
Imaging 2013.000000
Microlensing 2013.000000
Orbital Brightness Modulation 2013.000000
Pulsar Timing 2011.000000
Pulsation Timing Variations 2007.000000
Radial Velocity 2014.000000
Transit 2014.000000
Transit Timing Variations 2014.000000
Length: 80, dtype: float64
二、累计,过滤,转换和应用
创造一个新的数据对象:
>>>rng = np.random.RandomState(0)
>>>df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(1, 10, 6)},
columns=['key', 'data1', 'data2'])
key data1 data2
0 A 0 6
1 B 1 1
2 C 2 4
3 A 3 4
4 B 4 8
5 C 5 4
1、累计
下面代码是以key为基准,把数据都聚合计算后列出了一张新的表。
>>>df.groupby('key').aggregate(['min', np.median, max])
data1 data2
min median max min median max
key
A 0 1.5 3 4 5.0 6
B 1 2.5 4 1 4.5 8
C 2 3.5 5 4 4.0 4
2、过滤
可以按照分组的属性丢弃若干数据。
>>>def filter_func(x):
return x['data2'].std() > 4
>>>df
key data1 data2
0 A 0 6
1 B 1 1
2 C 2 4
3 A 3 4
4 B 4 8
5 C 5 4
>>>df.groupby('key').std()
data1 data2
key
A 2.12132 1.414214
B 2.12132 4.949747
C 2.12132 0.000000
>>>df.groupby('key').filter(filter_func)
key data1 data2
1 B 1 1
4 B 4 8
fliter ( )函数会返回一个布尔值,表示每个数组是否通过过滤。由于A组的"data2"列的标准差不大于4,所以被丢弃。
3、转换
transform的用法:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html?highlight=transform#pandas.DataFrame.transform
>>>df.groupby('key').transform(lambda x: x - x.mean())
data1 data2
0 -1.5 1.0
1 -1.5 -3.5
2 -1.5 0.0
3 1.5 -1.0
4 1.5 3.5
5 1.5 0.0
4、apply()方法
apply() 方法可以实现很多功能,可参考 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html?highlight=apply#pandas.DataFrame.apply
下面举一个传入函数的例子,这个函数改变了data1的值:
>>>def norm_by_data2(x):
>>> x['data1'] /= x['data2'].sum()
>>> return x
>>>df
key data1 data2
0 A 0 6
1 B 1 1
2 C 2 4
3 A 3 4
4 B 4 8
5 C 5 4
>>>df.groupby('key').apply(norm_by_data2)
key data1 data2
0 A 0.000000 6
1 B 0.111111 1
2 C 0.250000 4
3 A 0.300000 4
4 B 0.444444 8
5 C 0.625000 4
三、设置分隔的键:
1、将列表、数组、Series或索引作为分组键。
列表的长度要匹配DataFrame的任意Series(列长度)。
>>>L = [0, 1, 0, 1, 2, 0]
>>>df
key data1 data2
0 A 0 6
1 B 1 1
2 C 2 4
3 A 3 4
4 B 4 8
5 C 5 4
>>>df.groupby(L).sum()
data1 data2
0 7 14
1 4 5
2 4 8
这个例子表示:以列表L中的数据作为分组键,即L中的数据种类为新的对象的行索引,L中的数据对应的位置 即在原来对象中要合并的行索引对应的值。在本例中就是把原来对象的 0, 2, 5 处的数据相加,1, 3 处的数据相加,4 处的数据相加。其结果分别对应L列表中的 0 , 1, 2.
2、用字典或Series将索引映射到分组名称
用字典将索引替换:
>>>df2 = df.set_index('key')
>>>mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
>>>df2
data1 data2
key
A 0 6
B 1 1
C 2 4
A 3 4
B 4 8
C 5 4
>>>df2.groupby(mapping).sum()
data1 data2
consonant 12 17
vowel 3 10
3、任意python函数:
可以将任意的python函数传入groupby:
下面是把索引变成小写字母:
>>>df2.groupby(str.lower).mean()
data1 data2
a 1.5 5.0
b 2.5 4.5
c 3.5 4.0
4、多个有效键构成的列表:
>>>df2.groupby([str.lower, mapping]).mean()
data1 data2
a vowel 1.5 5.0
b consonant 2.5 4.5
c consonant 3.5 4.0