1. apply
- Series
Series.
apply
(func, convert_dtype=True, args=(), **kwds)- func: 要进行数据聚合的函数,自动对Series内的每个数据调用func
>>> import pandas as pd
>>> import numpy as np
>>> series = pd.Series([20, 21, 12], index=['London',
... 'New York','Helsinki'])
>>> series
London 20
New York 21
Helsinki 12
dtype: int64
>>> def square(x):
... return x**2
>>> series.apply(square)
London 400
New York 441
Helsinki 144
dtype: int64
- DataFrame
DataFrame.apply
(func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None, args=(), **kwds)- func: 同上
- axis: 沿某个轴进行聚合
每次调用func的都是一个Series,
axis = 0: apply函数会自动遍历每一列DataFrame的数据,最后将所有结果组合成一个Series数据结构并返回
axis = 1: apply函数会自动遍历每一行DataFrame的数据,最后将所有结果组合成一个Series数据结构并返回
>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
A B
0 4 9
1 4 9
2 4 9
>>> df.apply(np.sum, axis=0)
A 12
B 27
dtype: int64
>>> df.apply(np.sum, axis=1)
0 13
1 13
2 13
dtype: int64
2. iloc, loc
- iloc: 用整数进行索引
- loc: 用字符串进行索引
- Series
>>> s2 = pd.Series(['a', 'b', 'c'], index=['one', 'two', 'three'])
>>> s2
one a
two b
three c
dtype: object
# loc是用字符串进行索引
>>> s2.loc['one']
'a'
# iloc是用整数进行索引
>>> s2.iloc[0]
'a'
# []用整数进行索引
>>> s2[0]
'a'
- DataFrame
>>> s3 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=['one', 'two', 'three'])
>>> s3
a b
one 1 4
two 2 5
three 3 6
# loc与iloc都用来索引rows
>>> s3.iloc[0]
a 1
b 4
Name: one, dtype: int64
>>> s3.loc['one']
a 1
b 4
Name: one, dtype: int64
# []用来索引cols
>>> s3['a']
one 1
two 2
three 3
Name: a, dtype: int64
3. merge
DataFrame.
merge
(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)- how: 两表连接的方式('inner': 内连接, 'outer': 外连接, 'left': 左外连接, 'right': 右外连接)
- on: 要连接的键(会自动寻找列名相同的列)
- left_on: 左表要连接的列名
- right_on: 右表要连接的列名
- left_index: 是否将左表的索引作为连接的键
- right_index: 是否将右表的索引作为连接的键
>>> A >>> B
lkey value rkey value
0 foo 1 0 foo 5
1 bar 2 1 bar 6
2 baz 3 2 qux 7
3 foo 4 3 bar 8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 4 foo 5
2 bar 2 bar 6
3 bar 2 bar 8
4 baz 3 NaN NaN
5 NaN NaN qux 7
4. sort_values, sort_index
DataFrame.
sort_values
(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')- axis: 要排序的轴, axis=0则根据rows排序, axis=1则根据cols排序
- by: 要排序的键, axis=0则是columns, axis=1则是index
- ascending: True为升序, False为降序
- inplace: 在原表改变
>>> p1 = pd.DataFrame({'col1': [1, 2, 3],
... 'col2': [4, 5, 6],
... 'col3': [7, 8, 9]})
>>> p1
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
>>> p1.sort_values(by=0, ascending=False, axis=1)
col3 col2 col1
0 7 4 1
1 8 5 2
2 9 6 3
>>> p1.sort_values(by='col1', ascending=False, axis=0)
col1 col2 col3
2 3 6 9
1 2 5 8
0 1 4 7
DataFrame.sort_index
(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None)- axis: 根据轴排序索引, axis=0则排序row index, axis=1则排序column index
>>> p1
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
>>> p1.sort_index(axis=0, ascending=False)
col1 col2 col3
2 3 6 9
1 2 5 8
0 1 4 7
>>> p1.sort_index(axis=1, ascending=False)
col3 col2 col1
0 7 4 1
1 8 5 2
2 9 6 3
扫描二维码关注公众号,回复:
4191549 查看本文章
5. idxmax
- Series
Series.
idxmax
(axis=0, skipna=True, *args, **kwargs)- 返回最大值的索引
>>> s = pd.Series(data=[1, None, 4, 3, 4],
... index=['A', 'B', 'C', 'D', 'E'])
>>> s
A 1.0
B NaN
C 4.0
D 3.0
E 4.0
dtype: float64
>>> s.idxmax()
'C'
- DataFrame
DataFrame.
idxmax
(axis=0, skipna=True)- 返回最大索引组成的Series
>>> p2
col1 col2 col3
0 1 6 2
1 2 1 8
2 3 0 1
>>> p2.idxmax()
col1 2
col2 0
col3 1
dtype: int64
>>> p2.idxmax(axis=1)
0 col2
1 col3
2 col1
dtype: object
6. 新构建的特征需要转换成numpy类型
result['citable doc per person'] = (result['Citable documents'] / result['Population']).astype(np.float64)
7.corr
- Series
Series.
corr
(other, method='pearson', min_periods=None)- 求出两个Series的相关系数
>>> p2
col1 col2 col3
0 1 6 2
1 2 1 8
2 3 0 1
>>> p2.col2.corr(p2.col3)
-0.23281119015753007
- DataFrame
DataFrame.
corr
(method='pearson', min_periods=1)- 求出DataFrame各列的相关系数
>>> p2
col1 col2 col3
0 1 6 2
1 2 1 8
2 3 0 1
>>> p2.corr()
col1 col2 col3
col1 1.000000 -0.933257 -0.132068
col2 -0.933257 1.000000 -0.232811
col3 -0.132068 -0.232811 1.000000
8. map
- Series
-
接收函数作为或字典对象作为参数,返回经过函数或字典映射处理后的值
>>> p2
col1 col2 col3
0 1 6 2
1 2 1 8
2 3 0 1
>>> mapp1 = {1: 10, 2: 12}
>>> p2.col1.map(mapp1)
0 10.0
1 12.0
2 NaN
Name: col1, dtype: float64
>>> p2
col1 col2 col3
0 1 6 2
1 2 1 8
2 3 0 1
>>> p2.col2.map(lambda x: x+1)
0 7
1 2
2 1
Name: col2, dtype: int64
9. agg
DataFrameGroupBy.
agg
(arg, *args, **kwargs)- func: string function name / function / list of functions / dict of column names -> functions (or list of functions)
>>> df
A B C
0 1 1 0.362838
1 1 2 0.227877
2 2 3 1.267767
3 2 4 -0.562860
>>> df.groupby('A').agg(['min', 'max'])
B C
min max min max
A
1 1 2 0.227877 0.362838
2 3 4 -0.562860 1.267767
10. cut, qcut
- cut
pandas.
cut
(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')- 将连续值均分区间转换为离散值
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.994, 3.0]]
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]]
>>>
- qcut
pandas.
qcut
(x, q, labels=None, retbins=False, precision=3, duplicates='raise')- 按照频率将值转换为离散值
>>> pd.qcut(range(5), 4)
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]