There are two major differences between the
transform
andapply
groupby methods.
apply
implicitly passes all the columns for each group as a DataFrame to the custom function, whiletransform
passes each column for each group as a Series to the custom function- The custom function passed to
apply
can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed totransform
must return a sequence (a one dimensional Series, array or list) the same length as the group.(transform必须返回与组合相同长度的序列(一维的序列、数组或列表))So,
transform
works on just one Series at a time andapply
works on the entire DataFrame at once.
from :https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object#
transform 函数:
1.只允许在同一时间在一个Series上进行一次转换,如果定义列‘a’ 减去列‘b’, 则会出现异常;
2.必须返回与 group相同的单个维度的序列(行)
3. 返回单个标量对象也可以使用,如 . transform(sum)
apply函数:
1. 不同于transform只允许在Series上进行一次转换, apply对整个DataFrame 作用
2.apply隐式地将group 上所有的列作为自定义函数
栗子:
#coding=gbk
import numpy as np
import pandas as pd
data = pd.DataFrame({'state':['Florida','Florida','Texas','Texas'],
'a':[4,5,1,3],
'b':[6,10,3,11]
})
print(data)
# a b state
# 0 4 6 Florida
# 1 5 10 Florida
# 2 1 3 Texas
# 3 3 11 Texas
def sub_two(X):
return X['a'] - X['b']
data1 = data.groupby(data['state']).apply(sub_two) # 此处使用transform 则会出现错误
print(data1)
# state
# Florida 0 -2
# 1 -5
# Texas 2 -2
# 3 -8
# dtype: int64
返回单个标量可以使用transform:
:我们可以看到使用transform 和apply 的输出结果形式是不一样的,transform返回与数据同样长度的行,而apply则进行了聚合
此时,使用apply说明的信息更明确
def group_sum(x):
return x.sum()
data3 = data.groupby(data['state']).transform(group_sum) #返回与数据一样的 行
print(data3)
# a b
# 0 9 16
# 1 9 16
# 2 4 14
# 3 4 14
#但是使用apply时
data4 = data.groupby(data['state']).apply(group_sum)
print(data4)
# a b state
# state
# Florida 9 16 FloridaFlorida
# Texas 4 14 TexasTexas
The other difference is that
transform
must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, sotransform
must return a sequence of two rows. If it does not then an error is raised:
栗子2:
np.random.seed(666)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8), 'D' : np.random.randn(8)})
print(df)
# A B C D
# 0 foo one 0.824188 0.640573
# 1 bar one 0.479966 -0.786443
# 2 foo two 1.173468 0.608870
# 3 bar three 0.909048 -0.931012
# 4 foo two -0.571721 0.978222
# 5 bar two -0.109497 -0.736918
# 6 foo one 0.019028 -0.298733
# 7 foo three -0.943761 -0.460587
def zscore(x):
return (x - x.mean())/ x.var()
print(df.groupby('A').transform(zscore)) #自动识别CD列
print(df.groupby('A')['C','D'].apply(zscore)) #此种形式则两种输出数据是一样的
# df.groupby('A').apply(zscore) 此种情况则会报错,apply对整个dataframe作用
df['sum_c'] = df.groupby('A')['C'].transform(sum) #先对A列进行分组, 计算C列的和
df = df.sort_values('A')
print(df)
# A B C D sum_c
# 1 bar one 0.479966 -0.786443 1.279517
# 3 bar three 0.909048 -0.931012 1.279517
# 5 bar two -0.109497 -0.736918 1.279517
# 0 foo one 0.824188 0.640573 0.501202
# 2 foo two 1.173468 0.608870 0.501202
# 4 foo two -0.571721 0.978222 0.501202
# 6 foo one 0.019028 -0.298733 0.501202
# 7 foo three -0.943761 -0.460587 0.501202
print(df.groupby('A')['C'].apply(sum))
# A
# bar 1.279517
# foo 0.501202
# Name: C, dtype: float64
The function passed to
transform
must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group.函数传递给
transform
必须返回一个数字,一行,或者与参数相同的形状。 如果是一个数字,那么数字将被设置为组中的所有元素,如果是一行,它将会被广播到组中的所有行。
参考:https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object#