1.pivot用法
pandas中,pivot源码中是这样解释的
Return reshaped DataFrame organized by given index / column values.
Reshape data (produce a "pivot" table) based on column values. Uses
unique values from specified `index` / `columns` to form axes of the
resulting DataFrame. This function does not support data
aggregation, multiple values will result in a MultiIndex in the
columns. See the :ref:`User Guide <reshaping>` for more on reshaping.
Parameters
----------%s
index : str or object, optional
Column to use to make new frame's index. If None, uses
existing index.
columns : str or object
Column to use to make new frame's columns.
values : str, object or a list of the previous, optional
Column(s) to use for populating new frame's values. If not
specified, all remaining columns will be used and the result will
have hierarchically indexed columns.
.. versionchanged:: 0.23.0
Also accept list of column names.
Returns
-------
DataFrame
Returns reshaped DataFrame.
上面的注释很精髓的描述了pivot方法的用途:
返回由给定索引/列值组织的重塑 DataFrame。
如果对这句话暂时不太理解,没关系,我们后面会继续分析。
看个pivot的实例。
def t2():
data = {
"a": [1, 2, 3, 1, 2],
"b": [10, 20, 30, 40, 50],
"c": ['x', 'y', 'z', 'm', 'n']
}
data = pd.DataFrame(data)
result = data.pivot(index='a', columns='b', values='c')
print(result, "\n")
result = result.fillna('un')
print(result)
代码的输出为:
b 10 20 30 40 50
a
1 x NaN NaN m NaN
2 NaN y NaN NaN n
3 NaN NaN z NaN NaN
b 10 20 30 40 50
a
1 x un un m un
2 un y un un n
3 un un z un un
2.pivot_table
pivot_table与pivot区别在于,pivot仅仅是对数据进行重塑,无法对数据进行聚合。同时,pivot方法中,指定的index与columns构成的数据里面如果存在重复的情况,代码将会报错。
pivot_table可以重塑数据,重塑数据的好处是使得数据更加的直观和容易分析,俗称数据透视,经常使用excel的同学对透视表就不陌生了。同时,pivot_table还可以进一步对数据进行聚合,下面我们看一个例子。
def t3():
data = {
"a": [1, 1, 2, 2, 3, 1, 2, 3],
"b": [1, 1, 1, 1, 1, 2, 2, 2],
"c": [1, 2, 3, 4, 5, 6, 7, 8]
}
df = pd.DataFrame(data)
# <class 'pandas.core.frame.DataFrame'>
result = pd.pivot_table(df, index=['a'], columns=['b'], values=['c'], aggfunc=np.sum)
print(result, "\n")
代码输出为
c
b 1 2
a
1 3 6
2 7 7
3 5 8
3.groupby
前面提到的pivot可以对数组进行分组聚合,其实我们平时日常对数据进行分组聚合使用最多的是groupby。比如我们可以用groupby实现第二部分中的结果
def t4():
data = {
"a": [1, 1, 2, 2, 3, 1, 2, 3],
"b": [1, 1, 1, 1, 1, 2, 2, 2],
"c": [1, 2, 3, 4, 5, 6, 7, 8]
}
# <class 'pandas.core.series.Series'>
df = pd.DataFrame(data).groupby(['a', 'b']).c.sum()
for ele in df.items():
print(ele[0], ele[1])
(1, 1) 3
(1, 2) 6
(2, 1) 7
(2, 2) 7
(3, 1) 5
(3, 2) 8
可以看到,输出与pivot_table是完全一样的。
那么pivot_table与groupby的区别在哪里呢?
如果用一句话来解释就是:pivot_table 和 groupby 都是用来聚合数据的,区别仅在于结果的形状。pivot/pivot_table是为了让数据重新排列组合更为直观,即俗称的数据透视;而groupby方法则主要是对数据进行分组聚合运算,所以我们一般进行数据聚合时就直接使用groupby方法。
4.crosstab
crosstab是用来统计分组频率的特殊透视表,是pivot_table的一种特殊情况。
"""
Compute a simple cross tabulation of two (or more) factors. By default
computes a frequency table of the factors unless an array of values and an
aggregation function are passed.
Parameters
----------
index : array-like, Series, or list of arrays/Series
Values to group by in the rows.
columns : array-like, Series, or list of arrays/Series
Values to group by in the columns.
values : array-like, optional
Array of values to aggregate according to the factors.
Requires `aggfunc` be specified.
rownames : sequence, default None
If passed, must match number of row arrays passed.
colnames : sequence, default None
If passed, must match number of column arrays passed.
aggfunc : function, optional
If specified, requires `values` be specified as well.
margins : bool, default False
Add row/column margins (subtotals).
margins_name : str, default 'All'
Name of the row/column that will contain the totals
when margins is True.
.. versionadded:: 0.21.0
dropna : bool, default True
Do not include columns whose entries are all NaN.
normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False
Normalize by dividing all values by the sum of values.
- If passed 'all' or `True`, will normalize over all values.
- If passed 'index' will normalize over each row.
- If passed 'columns' will normalize over each column.
- If margins is `True`, will also normalize margin values.
Returns
-------
DataFrame
Cross tabulation of the data.
See Also
--------
DataFrame.pivot : Reshape data based on column values.
pivot_table : Create a pivot table as a DataFrame.
下面同样来看一个例子。
def t5():
data = {
"a": [1, 1, 2, 2, 3, 1, 2, 3],
"b": [1, 1, 1, 1, 1, 2, 2, 2],
"c": [1, 2, 3, 4, 5, 6, 7, 8]
}
data = pd.DataFrame(data)
# <class 'pandas.core.frame.DataFrame'>
df = pd.crosstab(index=data.a, columns=data.b)
print(df, "\n")
df2 = data.groupby(['a', 'b']).agg({'c': 'count'})
print(df2, "\n")
df3 = pd.crosstab(index=data['a'], columns=data['b']).cumsum(axis=0)
print(df3)
代码输出为:
b 1 2
a
1 2 1
2 2 1
3 1 1
c
a b
1 1 2
2 1
2 1 2
2 1
3 1 1
2 1
b 1 2
a
1 2 1
2 4 2
3 5 3