第2章 索引
import numpy as np
import pandas as pd
df = pd. read_csv( 'data/table.csv' , index_col= 'ID' )
df. head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1102
S_1
C_1
F
street_2
192
73
32.5
B+
1103
S_1
C_1
M
street_2
186
82
87.2
B+
1104
S_1
C_1
F
street_2
167
81
80.4
B-
1105
S_1
C_1
F
street_4
159
64
84.8
B+
import numpy as np
import pandas as pd
df= pd. read_csv( 'data/table.csv' , index_col= 'ID' )
df. head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1102
S_1
C_1
F
street_2
192
73
32.5
B+
1103
S_1
C_1
M
street_2
186
82
87.2
B+
1104
S_1
C_1
F
street_2
167
81
80.4
B-
1105
S_1
C_1
F
street_4
159
64
84.8
B+
df. describe( )
Height
Weight
Math
count
35.000000
35.000000
35.000000
mean
174.142857
74.657143
61.351429
std
13.541098
12.895377
19.915164
min
155.000000
53.000000
31.500000
25%
161.000000
63.000000
47.400000
50%
173.000000
74.000000
61.700000
75%
187.500000
82.000000
77.100000
max
195.000000
100.000000
97.000000
一、单级索引
1. loc方法、iloc方法、[]操作符
最常用的索引方法可能就是这三类,其中iloc表示位置索引,loc表示标签索引,[]也具有很大的便利性,各有特点
(a)loc方法(注意:所有在loc中使用的切片全部包含右端点!)
① 单行索引:
(注意:所有在loc中使用的切片全部包含右端点!这是因为如果作为Pandas的使用者,那么肯定不太关心最后一个标签再往后一位是什么,但是如果是左闭右开,那么就很麻烦,先要知道再后面一列的名字是什么,非常不方便,因此Pandas中将loc设计为左右全闭)
df. loc[ 1103 ]
School S_1
Class C_1
Gender M
Address street_2
Height 186
Weight 82
Math 87.2
Physics B+
Name: 1103, dtype: object
df. loc[ 1101 ]
School S_1
Class C_1
Gender M
Address street_1
Height 173
Weight 63
Math 34
Physics A+
Name: 1101, dtype: object
② 多行索引:
df. loc[ [ 1102 , 2304 ] ]
df. loc[ [ 1102 , 2201 ] ]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1102
S_1
C_1
F
street_2
192
73
32.5
B+
2201
S_2
C_2
M
street_5
193
100
39.1
B
df. loc[ 1304 : ] . head( )
df. loc[ 1304 : ] . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1304
S_1
C_3
M
street_2
195
70
85.2
A
1305
S_1
C_3
F
street_5
187
69
61.7
B-
2101
S_2
C_1
M
street_7
174
84
83.3
C
2102
S_2
C_1
F
street_6
161
61
50.6
B+
2103
S_2
C_1
M
street_4
157
61
52.5
B-
df. loc[ 2402 : : - 1 ] . head( )
df. loc[ 2402 : 2401 : - 1 ] . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
2402
S_2
C_4
M
street_7
166
82
48.7
B
2401
S_2
C_4
F
street_2
192
62
45.3
A
③ 单列索引:
df. loc[ : , 'Height' ] . head( )
df[ 'Height' ] . head( )
df. loc[ 1101 : 1102 , 'Height' ]
ID
1101 173
1102 192
Name: Height, dtype: int64
④ 多列索引:
df. loc[ : , [ 'Height' , 'Math' ] ] . head( )
df. loc[ : , [ 'Height' , 'Math' ] ] . head( )
df[ [ 'Height' , 'Math' ] ] . head( )
Height
Math
ID
1101
173
34.0
1102
192
32.5
1103
186
87.2
1104
167
80.4
1105
159
84.8
df. loc[ : , 'Height' : 'Math' ] . head( )
df. loc[ : , 'Height' : 'Math' ] . head( )
Height
Weight
Math
ID
1101
173
63
34.0
1102
192
73
32.5
1103
186
82
87.2
1104
167
81
80.4
1105
159
64
84.8
⑤ 联合索引:
df. loc[ 1102 : 2401 : 3 , 'Height' : 'Math' ] . head( )
df. loc[ 1102 : 2201 : 3 , 'Height' : 'Math' ] . head( )
df. loc[ 1102 : 2201 : 3 , 'Height' : 'Math' : 2 ] . head( )
Height
Math
ID
1102
192
32.5
1105
159
84.8
1203
160
58.8
1301
161
31.5
1304
195
85.2
⑥ 函数式索引:
df. loc[ lambda x: x[ 'Gender' ] == 'M' ] . head( )
df. loc[ lambda x: x[ 'Gender' ] == 'M' ] . head( )
df[ df[ 'Gender' ] == 'M' ] . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1103
S_1
C_1
M
street_2
186
82
87.2
B+
1201
S_1
C_2
M
street_5
188
68
97.0
A-
1203
S_1
C_2
M
street_6
160
53
58.8
A+
1301
S_1
C_3
M
street_4
161
68
31.5
B+
def f ( x) :
return [ 1101 , 1103 ]
df. loc[ f]
def lam ( x) :
return x[ 'Gender' ] == 'M'
df. loc[ lam] . head( )
def f ( x) :
return [ 1101 , 1103 ]
df. loc[ f]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1103
S_1
C_1
M
street_2
186
82
87.2
B+
⑦ 布尔索引(将重点在第2节介绍)
df. loc[ df[ 'Address' ] . isin( [ 'street_7' , 'street_4' ] ) ] . head( )
df. loc[ df[ 'Address' ] . isin( [ 'street_7' , 'street_4' ] ) ] . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1105
S_1
C_1
F
street_4
159
64
84.8
B+
1202
S_1
C_2
F
street_4
176
94
63.5
B-
1301
S_1
C_3
M
street_4
161
68
31.5
B+
1303
S_1
C_3
M
street_7
188
82
49.7
B
2101
S_2
C_1
M
street_7
174
84
83.3
C
df. loc[ [ True if i[ - 1 ] == '4' or i[ - 1 ] == '7' else False for i in df[ 'Address' ] . values] ] . head( )
df. loc[ [ True if i[ - 1 ] == '4' or i[ - 1 ] == '7' else False for i in df[ 'Address' ] . values] ] . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1105
S_1
C_1
F
street_4
159
64
84.8
B+
1202
S_1
C_2
F
street_4
176
94
63.5
B-
1301
S_1
C_3
M
street_4
161
68
31.5
B+
1303
S_1
C_3
M
street_7
188
82
49.7
B
2101
S_2
C_1
M
street_7
174
84
83.3
C
小节:本质上说,loc中能传入的只有布尔列表和索引子集构成的列表,只要把握这个原则就很容易理解上面那些操作
(b)iloc方法(注意与loc不同,切片右端点不包含)
① 单行索引:
df. iloc[ 3 ]
df. iloc[ 0 ]
School S_1
Class C_1
Gender M
Address street_1
Height 173
Weight 63
Math 34
Physics A+
Name: 1101, dtype: object
② 多行索引:
df. iloc[ [ 0 , 3 ] ]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1104
S_1
C_1
F
street_2
167
81
80.4
B-
df. iloc[ 3 : 5 ]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1104
S_1
C_1
F
street_2
167
81
80.4
B-
1105
S_1
C_1
F
street_4
159
64
84.8
B+
df. iloc[ 4 : 2 : - 1 ]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1105
S_1
C_1
F
street_4
159
64
84.8
B+
1104
S_1
C_1
F
street_2
167
81
80.4
B-
③ 单列索引:
df. iloc[ : , 3 ] . head( )
ID
1101 street_1
1102 street_2
1103 street_2
1104 street_2
1105 street_4
Name: Address, dtype: object
④ 多列索引:
df. iloc[ : , 7 : : - 2 ] . head( )
df. iloc[ : , 7 : : - 2 ] . head( )
Physics
Weight
Address
Class
ID
1101
A+
63
street_1
C_1
1102
B+
73
street_2
C_1
1103
B+
82
street_2
C_1
1104
B-
81
street_2
C_1
1105
B+
64
street_4
C_1
⑤ 混合索引:
df. iloc[ 3 : : 4 , 7 : : - 2 ] . head( )
df. iloc[ 3 : : 4 , 7 : : - 2 ]
Physics
Weight
Address
Class
ID
1104
B-
81
street_2
C_1
1203
A+
53
street_6
C_2
1302
A-
57
street_1
C_3
2101
C
84
street_7
C_1
2105
A
81
street_4
C_1
2204
B-
74
street_1
C_2
2303
C
99
street_7
C_3
2402
B
82
street_7
C_4
⑥ 函数式索引:
df. iloc[ lambda x: [ 3 ] ] . head( )
df. iloc[ lambda x: [ 3 ] ] . head( )
df. iloc[ [ i for i in range ( 5 ) ] ]
def f ( x) :
temp= [ ]
index= 0
for i in x[ 'Physics' ] . values:
if i == 'A+' :
temp. append( index)
index+= 1
return temp
df. iloc[ f]
df. iloc[ list ( df[ 'Physics' ] == 'A+' ) ]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1203
S_1
C_2
M
street_6
160
53
58.8
A+
2203
S_2
C_2
M
street_4
155
91
73.8
A+
小节:iloc中接收的参数只能为整数或整数列表或布尔列表,不能使用布尔Series,如果要用就必须如下把values拿出来
df. iloc[ ( df[ 'School' ] == 'S_1' ) . values] . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1102
S_1
C_1
F
street_2
192
73
32.5
B+
1103
S_1
C_1
M
street_2
186
82
87.2
B+
1104
S_1
C_1
F
street_2
167
81
80.4
B-
1105
S_1
C_1
F
street_4
159
64
84.8
B+
(c) []操作符
如果不想陷入困境,请不要在行索引为浮点时使用[]操作符,因为在Series中的浮点[]并不是进行位置比较,而是值比较,非常特殊
(c.1)Series的[]操作
① 单元素索引:
s = pd. Series( df[ 'Math' ] , index= df. index)
s[ 1101 ]
s= pd. Series( df[ 'Math' ] , index= df. index)
s[ 1101 ]
34.0
② 多行索引:
s[ 0 : 4 ]
Series([], Name: Math, dtype: float64)
③ 函数式索引:
s[ lambda x: x. index[ 16 : : - 6 ] ]
ID
2102 50.6
1301 31.5
1105 84.8
Name: Math, dtype: float64
④ 布尔索引:
s. head( )
ID
1101 34.0
1102 32.5
1103 87.2
1104 80.4
1105 84.8
Name: Math, dtype: float64
s[ s> 80 ]
ID
1103 87.2
1104 80.4
1105 84.8
1201 97.0
1302 87.7
1304 85.2
2101 83.3
2205 85.4
2304 95.5
Name: Math, dtype: float64
【注意】如果不想陷入困境,请不要在行索引为浮点时使用[]操作符,因为在Series中[]的浮点切片并不是进行位置比较,而是值比较,非常特殊
s_int = pd. Series( [ 1 , 2 , 3 , 4 ] , index= [ 1 , 3 , 5 , 6 ] )
s_float = pd. Series( [ 1 , 2 , 3 , 4 ] , index= [ 1 . , 3 . , 5 . , 6 . ] )
s_int
3.0 2
5.0 3
6.0 4
dtype: int64
s_int[ 2 : ]
5 3
6 4
dtype: int64
s_float
1.0 1
3.0 2
5.0 3
6.0 4
dtype: int64
s_float[ 2 : ]
3.0 2
5.0 3
6.0 4
dtype: int64
(c.2)DataFrame的[]操作
① 单行索引:
df[ 1 : 3 ]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1102
S_1
C_1
F
street_2
192
73
32.5
B+
1103
S_1
C_1
M
street_2
186
82
87.2
B+
row = df. index. get_loc( 1102 )
df[ row: row+ 1 ]
row= df. index. get_loc( 1102 )
df[ row: row+ 1 ]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1102
S_1
C_1
F
street_2
192
73
32.5
B+
② 多行索引:
df[ 3 : 5 ]
df[ 2 : 4 ]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1103
S_1
C_1
M
street_2
186
82
87.2
B+
1104
S_1
C_1
F
street_2
167
81
80.4
B-
③ 单列索引:
df[ 'School' ] . head( )
df[ 'School' ] . head( )
ID
1101 S_1
1102 S_1
1103 S_1
1104 S_1
1105 S_1
Name: School, dtype: object
④ 多列索引:
df[ [ 'School' , 'Math' ] ] . head( )
df[ [ 'School' , 'Math' ] ] . head( )
School
Math
ID
1101
S_1
34.0
1102
S_1
32.5
1103
S_1
87.2
1104
S_1
80.4
1105
S_1
84.8
⑤函数式索引:
df[ lambda x: [ 'Math' , 'Physics' ] ] . head( )
df[ lambda x: [ 'Math' , 'Physics' ] ] . head( )
Math
Physics
ID
1101
34.0
A+
1102
32.5
B+
1103
87.2
B+
1104
80.4
B-
1105
84.8
B+
⑥ 布尔索引:
df[ df[ 'Gender' ] == 'F' ] . head( )
df[ df[ 'Gender' ] == 'F' ] . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1102
S_1
C_1
F
street_2
192
73
32.5
B+
1104
S_1
C_1
F
street_2
167
81
80.4
B-
1105
S_1
C_1
F
street_4
159
64
84.8
B+
1202
S_1
C_2
F
street_4
176
94
63.5
B-
1204
S_1
C_2
F
street_5
162
63
33.8
B
小节:一般来说,[]操作符常用于列选择或布尔选择,尽量避免行的选择
2. 布尔索引
(a)布尔符号:’&’,’|’,’~’:分别代表和and,或or,取反not
df[ ( df[ 'Gender' ] == 'F' ) & ( df[ 'Address' ] == 'street_2' ) ] . head( )
df[ ( df[ 'Gender' ] == 'F' ) & ( df[ 'Address' ] == 'street_2' ) ] . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1102
S_1
C_1
F
street_2
192
73
32.5
B+
1104
S_1
C_1
F
street_2
167
81
80.4
B-
2401
S_2
C_4
F
street_2
192
62
45.3
A
2404
S_2
C_4
F
street_2
160
84
67.7
B
df[ ( df[ 'Math' ] > 85 ) | ( df[ 'Address' ] == 'street_7' ) ] . head( )
df[ ( df[ 'Math' ] > 90 ) | ( df[ 'Address' ] == 'street_7' ) ] . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1201
S_1
C_2
M
street_5
188
68
97.0
A-
1303
S_1
C_3
M
street_7
188
82
49.7
B
2101
S_2
C_1
M
street_7
174
84
83.3
C
2202
S_2
C_2
F
street_7
194
77
68.5
B+
2205
S_2
C_2
F
street_7
183
76
85.4
B
df[ ~ ( ( df[ 'Math' ] > 75 ) | ( df[ 'Address' ] == 'street_1' ) ) ] . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1102
S_1
C_1
F
street_2
192
73
32.5
B+
1202
S_1
C_2
F
street_4
176
94
63.5
B-
1203
S_1
C_2
M
street_6
160
53
58.8
A+
1204
S_1
C_2
F
street_5
162
63
33.8
B
1205
S_1
C_2
F
street_6
167
63
68.4
B-
loc和[]中相应位置都能使用布尔列表选择:
df. loc[ df[ 'Math' ] > 60 , ( df[ : 8 ] [ 'Address' ] == 'street_6' ) . values] . head( )
df. loc[ df[ 'Math' ] > 60 , df. columns== 'Physics' ] . head( )
ID
1101 street_1
1102 street_2
1103 street_2
1104 street_2
1105 street_4
1201 street_5
1202 street_4
1203 street_6
Name: Address, dtype: object
df[ : 8 ] [ 'Address' ] == 'street_6'
ID
1101 False
1102 False
1103 False
1104 False
1105 False
1201 False
1202 False
1203 True
Name: Address, dtype: bool
(b) isin方法
df[ df[ 'Address' ] . isin( [ 'street_1' , 'street_4' ] ) & df[ 'Physics' ] . isin( [ 'A' , 'A+' ] ) ]
df[ df[ 'Address' ] . isin( [ 'street_1' , 'street_4' ] ) & ( df[ 'Physics' ] . isin( [ 'A' , 'A+' ] ) ) ]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
2105
S_2
C_1
M
street_4
170
81
34.2
A
2203
S_2
C_2
M
street_4
155
91
73.8
A+
df[ df[ [ 'Address' , 'Physics' ] ] . isin( { 'Address' : [ 'street_1' , 'street_4' ] , 'Physics' : [ 'A' , 'A+' ] } ) . all ( 1 ) ]
df[ df[ [ 'Address' , 'Physics' ] ] . isin( { 'Address' : [ 'street_1' , 'street_4' ] , 'Physics' : [ 'A' , 'A+' ] } ) . all ( 1 ) ]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
2105
S_2
C_1
M
street_4
170
81
34.2
A
2203
S_2
C_2
M
street_4
155
91
73.8
A+
3. 快速标量索引
当只需要取一个元素时,at和iat方法能够提供更快的实现:
display( df. at[ 1101 , 'School' ] )
display( df. loc[ 1101 , 'School' ] )
display( df. iat[ 0 , 0 ] )
display( df. iloc[ 0 , 0 ] )
% timeit df. at[ 1101 , 'School' ]
% timeit df. loc[ 1101 , 'School' ]
% timeit df. iat[ 0 , 0 ]
% timeit df. iloc[ 0 , 0 ]
'S_1'
'S_1'
'S_1'
'S_1'
3.9 µs ± 796 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
6.08 µs ± 11 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.49 µs ± 881 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
6.73 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4. 区间索引
此处介绍并不是说只能在单级索引中使用区间索引,只是作为一种特殊类型的索引方式,在此处先行介绍
(a)利用interval_range方法
pd. interval_range( start= 0 , end= 5 )
pd. interval_range( start= 0 , end= 5 , closed= 'both' )
IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4], [4, 5]],
closed='both',
dtype='interval[int64]')
pd. interval_range( start= 0 , periods= 8 , freq= 5 )
pd. interval_range( start= 0 , periods= 9 , freq= 5 )
IntervalIndex([[0, 5], [5, 10], [10, 15], [15, 20], [20, 25], [25, 30], [30, 35], [35, 40], [40, 45]],
closed='both',
dtype='interval[int64]')
(b)利用cut将数值列转为区间为元素的分类变量,例如统计数学成绩的区间情况:
math_interval = pd. cut( df[ 'Math' ] , bins= [ 0 , 40 , 60 , 80 , 100 ] )
math_interval. head( )
math_interval= pd. cut( df[ 'Math' ] , bins= [ 0 , 40 , 60 , 80 , 100 ] ) . astype( 'interval' )
math_interval. head( )
ID
1101 (0, 40]
1102 (0, 40]
1103 (80, 100]
1104 (80, 100]
1105 (80, 100]
Name: Math, dtype: interval
(c)区间索引的选取
df_i = df. join( math_interval, rsuffix= '_interval' ) [ [ 'Math' , 'Math_interval' ] ] \
. reset_index( ) . set_index( 'Math_interval' )
df_i. head( )
ID
Math
Math_interval
(0, 40]
1101
34.0
(0, 40]
1102
32.5
(80, 100]
1103
87.2
(80, 100]
1104
80.4
(80, 100]
1105
84.8
df_i. loc[ 65 ] . head( )
df_i. loc[ 80 ] . head( )
ID
Math
Math_interval
(60, 80]
1202
63.5
(60, 80]
1205
68.4
(60, 80]
1305
61.7
(60, 80]
2104
72.2
(60, 80]
2202
68.5
df_i. loc[ [ 65 , 90 ] ] . head( )
ID
Math
Math_interval
(60, 80]
1202
63.5
(60, 80]
1205
68.4
(60, 80]
1305
61.7
(60, 80]
2104
72.2
(60, 80]
2202
68.5
如果想要选取某个区间,先要把分类变量转为区间变量,再使用overlap方法:
display( df_i[ df_i. index. astype( 'interval' ) . overlaps( pd. Interval( 70 , 85 ) ) ] . head( ) )
display( df_i. index. astype( 'interval' ) . overlaps( pd. Interval( 70 , 85 ) ) )
ID
Math
Math_interval
(80, 100]
1103
87.2
(80, 100]
1104
80.4
(80, 100]
1105
84.8
(80, 100]
1201
97.0
(60, 80]
1202
63.5
array([False, False, True, True, True, True, True, False, False,
True, False, True, False, True, True, True, False, False,
True, False, False, True, True, False, True, True, False,
True, True, False, False, False, False, True, False])
二、多级索引
1. 创建多级索引
(a)通过from_tuple或from_arrays
① 直接创建元组
tuples = [ ( 'A' , 'a' ) , ( 'A' , 'b' ) , ( 'B' , 'a' ) , ( 'B' , 'b' ) ]
mul_index = pd. MultiIndex. from_tuples( tuples, names= ( 'Upper' , 'Lower' ) )
mul_index
tuples= [ ( 'A' , 'a' ) , ( 'A' , 'b' ) , ( 'B' , 'a' ) , ( 'B' , 'b' ) ]
mul_index= pd. MultiIndex. from_tuples( tuples, names= ( 'Upper' , 'lower' ) )
mul_index
MultiIndex([('A', 'a'),
('A', 'b'),
('B', 'a'),
('B', 'b')],
names=['Upper', 'lower'])
pd. DataFrame( { 'Score' : [ 'perfect' , 'good' , 'fair' , 'bad' ] } , index= mul_index)
pd. DataFrame( { 'Score' : [ 'perfect' , 'good' , 'fair' , 'bad' ] } , index= mul_index)
Score
Upper
lower
A
a
perfect
b
good
B
a
fair
b
bad
② 利用zip创建元组
L1 = list ( 'AABB' )
L2 = list ( 'abab' )
tuples = list ( zip ( L1, L2) )
mul_index = pd. MultiIndex. from_tuples( tuples, names= ( 'Upper' , 'Lower' ) )
pd. DataFrame( { 'Score' : [ 'perfect' , 'good' , 'fair' , 'bad' ] } , index= mul_index)
L1= list ( 'AABB' )
L2= list ( 'aabb' )
tuples= list ( zip ( L1, L2) )
mul_index= pd. MultiIndex. from_tuples( tuples, names= ( 'Upper' , 'Lower' ) )
pd. DataFrame( { 'Score' : [ 'oerfect' , 'good' , 'fair' , 'bad' ] } , index= mul_index)
Score
Upper
Lower
A
a
oerfect
a
good
B
b
fair
b
bad
③ 通过Array创建
arrays = [ [ 'A' , 'a' ] , [ 'A' , 'b' ] , [ 'B' , 'a' ] , [ 'B' , 'b' ] ]
mul_index = pd. MultiIndex. from_tuples( arrays, names= ( 'Upper' , 'Lower' ) )
pd. DataFrame( { 'Score' : [ 'perfect' , 'good' , 'fair' , 'bad' ] } , index= mul_index)
arrays= [ [ 'A' , 'a' ] , [ 'A' , 'b' ] , [ 'B' , 'a' ] , [ 'B' , 'b' ] ]
mul_index= pd. MultiIndex. from_tuples( arrays, names= ( 'Upper' , 'Lower' ) )
pd. DataFrame( { 'Score' : [ 'perfect' , 'good' , 'fair' , 'bad' ] } , index= mul_index)
Score
Upper
Lower
A
a
perfect
b
good
B
a
fair
b
bad
mul_index
MultiIndex([('A', 'a'),
('A', 'b'),
('B', 'a'),
('B', 'b')],
names=['Upper', 'Lower'])
(b)通过from_product
L1 = [ 'A' , 'B' ]
L2 = [ 'a' , 'b' ]
pd. MultiIndex. from_product( [ L1, L2] , names= ( 'Upper' , 'Lower' ) )
L1= [ 'A' , 'B' ]
L2= [ 'a' , 'b' ]
pd. MultiIndex. from_product( [ L1, L2] , names= ( 'Upper' , 'lower' ) )
MultiIndex([('A', 'a'),
('A', 'b'),
('B', 'a'),
('B', 'b')],
names=['Upper', 'lower'])
(c)指定df中的列创建(set_index方法)
df_using_mul = df. set_index( [ 'Class' , 'Address' ] )
df_using_mul. head( )
School
Gender
Height
Weight
Math
Physics
Class
Address
C_1
street_1
S_1
M
173
63
34.0
A+
street_2
S_1
F
192
73
32.5
B+
street_2
S_1
M
186
82
87.2
B+
street_2
S_1
F
167
81
80.4
B-
street_4
S_1
F
159
64
84.8
B+
2. 多层索引切片
df_using_mul. head( )
School
Gender
Height
Weight
Math
Physics
Class
Address
C_1
street_1
S_1
M
173
63
34.0
A+
street_2
S_1
F
192
73
32.5
B+
street_2
S_1
M
186
82
87.2
B+
street_2
S_1
F
167
81
80.4
B-
street_4
S_1
F
159
64
84.8
B+
(a)一般切片
display( df_using_mul. loc[ 'C_2' ] )
display( df_using_mul. index. is_lexsorted( ) )
df_using_mul. index. is_lexsorted( )
display( df_using_mul. sort_index( ) . loc[ 'C_2' ] )
display( df_using_mul. sort_index( ) . index. is_lexsorted( ) )
School
Gender
Height
Weight
Math
Physics
Address
street_5
S_1
M
188
68
97.0
A-
street_4
S_1
F
176
94
63.5
B-
street_6
S_1
M
160
53
58.8
A+
street_5
S_1
F
162
63
33.8
B
street_6
S_1
F
167
63
68.4
B-
street_5
S_2
M
193
100
39.1
B
street_7
S_2
F
194
77
68.5
B+
street_4
S_2
M
155
91
73.8
A+
street_1
S_2
M
175
74
47.2
B-
street_7
S_2
F
183
76
85.4
B
False
School
Gender
Height
Weight
Math
Physics
Address
street_1
S_2
M
175
74
47.2
B-
street_4
S_1
F
176
94
63.5
B-
street_4
S_2
M
155
91
73.8
A+
street_5
S_1
M
188
68
97.0
A-
street_5
S_1
F
162
63
33.8
B
street_5
S_2
M
193
100
39.1
B
street_6
S_1
M
160
53
58.8
A+
street_6
S_1
F
167
63
68.4
B-
street_7
S_2
F
194
77
68.5
B+
street_7
S_2
F
183
76
85.4
B
True
display( df_using_mul. sort_index( ) . loc[ 'C_2' ] )
df_using_mul. sort_index( ) . loc[ 'C_2' , 'street_5' ]
School
Gender
Height
Weight
Math
Physics
Address
street_1
S_2
M
175
74
47.2
B-
street_4
S_1
F
176
94
63.5
B-
street_4
S_2
M
155
91
73.8
A+
street_5
S_1
M
188
68
97.0
A-
street_5
S_1
F
162
63
33.8
B
street_5
S_2
M
193
100
39.1
B
street_6
S_1
M
160
53
58.8
A+
street_6
S_1
F
167
63
68.4
B-
street_7
S_2
F
194
77
68.5
B+
street_7
S_2
F
183
76
85.4
B
School
Gender
Height
Weight
Math
Physics
Class
Address
C_2
street_5
S_1
M
188
68
97.0
A-
street_5
S_1
F
162
63
33.8
B
street_5
S_2
M
193
100
39.1
B
df_using_mul. sort_index( ) . loc[ ( 'C_2' , 'street_6' ) : ( 'C_3' , 'street_4' ) ]
df_using_mul. sort_index( ) . loc[ ( 'C_2' , 'street_6' ) : ( 'C_3' , 'street_4' ) ]
School
Gender
Height
Weight
Math
Physics
Class
Address
C_2
street_6
S_1
M
160
53
58.8
A+
street_6
S_1
F
167
63
68.4
B-
street_7
S_2
F
194
77
68.5
B+
street_7
S_2
F
183
76
85.4
B
C_3
street_1
S_1
F
175
57
87.7
A-
street_2
S_1
M
195
70
85.2
A
street_4
S_1
M
161
68
31.5
B+
street_4
S_2
F
157
78
72.3
B+
street_4
S_2
M
187
73
48.9
B
df_using_mul. sort_index( ) . loc[ ( 'C_2' , 'street_7' ) : 'C_3' ] . head( )
Class Address
C_2 street_7 S_2
street_7 S_2
C_3 street_1 S_1
street_2 S_1
street_4 S_1
street_4 S_2
street_4 S_2
street_5 S_1
street_5 S_2
street_6 S_2
street_7 S_1
street_7 S_2
Name: School, dtype: object
(b)第一类特殊情况:由元组构成列表
df_using_mul. sort_index( ) . loc[ [ ( 'C_2' , 'street_7' ) , ( 'C_3' , 'street_2' ) ] ]
School
Gender
Height
Weight
Math
Physics
Class
Address
C_2
street_7
S_2
F
194
77
68.5
B+
street_7
S_2
F
183
76
85.4
B
C_3
street_2
S_1
M
195
70
85.2
A
(c)第二类特殊情况:由列表构成元组
df_using_mul. sort_index( ) . loc[ ( [ 'C_2' , 'C_3' ] , [ 'street_4' , 'street_7' ] ) , : ]
School
Gender
Height
Weight
Math
Physics
Class
Address
C_2
street_4
S_1
F
176
94
63.5
B-
street_4
S_2
M
155
91
73.8
A+
street_7
S_2
F
194
77
68.5
B+
street_7
S_2
F
183
76
85.4
B
C_3
street_4
S_1
M
161
68
31.5
B+
street_4
S_2
F
157
78
72.3
B+
street_4
S_2
M
187
73
48.9
B
street_7
S_1
M
188
82
49.7
B
street_7
S_2
F
190
99
65.9
C
3. 多层索引中的slice对象
np. random. rand( 9 , 9 )
array([[0.00157292, 0.3521813 , 0.57414635, 0.7402888 , 0.04642124,
0.66619584, 0.92281977, 0.03996369, 0.65053856],
[0.08477172, 0.88126615, 0.99788337, 0.7568739 , 0.48090064,
0.80362277, 0.3358786 , 0.37949268, 0.01001926],
[0.59348037, 0.69988113, 0.17082218, 0.99141101, 0.12301197,
0.54402624, 0.62715081, 0.31135379, 0.21957649],
[0.66512945, 0.9106095 , 0.25563962, 0.07774486, 0.88233467,
0.96112214, 0.94485486, 0.57277522, 0.54257246],
[0.62443182, 0.20058009, 0.01188025, 0.22773041, 0.00204232,
0.67665924, 0.32854489, 0.35794382, 0.6500615 ],
[0.68680186, 0.97791972, 0.21070428, 0.40517635, 0.96220453,
0.58241295, 0.77434508, 0.01228948, 0.02735097],
[0.75966199, 0.79494387, 0.45020853, 0.93519435, 0.26549524,
0.35259274, 0.04451145, 0.65311556, 0.65076007],
[0.11867561, 0.78091157, 0.82840483, 0.36694628, 0.27188274,
0.86574978, 0.49296668, 0.6349486 , 0.10321662],
[0.31004969, 0.9974477 , 0.72932672, 0.84089302, 0.24069463,
0.88026845, 0.65472614, 0.90057078, 0.94011198]])
L1, L2 = [ 'A' , 'B' , 'C' ] , [ 'a' , 'b' , 'c' ]
mul_index1 = pd. MultiIndex. from_product( [ L1, L2] , names= ( 'Upper' , 'Lower' ) )
L3, L4 = [ 'D' , 'E' , 'F' ] , [ 'd' , 'e' , 'f' ]
mul_index2 = pd. MultiIndex. from_product( [ L3, L4] , names= ( 'Big' , 'Small' ) )
df_s = pd. DataFrame( np. random. rand( 9 , 9 ) , index= mul_index1, columns= mul_index2)
df_s
Big
D
E
F
Small
d
e
f
d
e
f
d
e
f
Upper
Lower
A
a
0.727690
0.349177
0.462134
0.756151
0.595001
0.854182
0.564885
0.032514
0.584503
b
0.456652
0.064605
0.162614
0.943543
0.358184
0.320619
0.255107
0.848212
0.608838
c
0.527055
0.771120
0.354109
0.020767
0.305929
0.062331
0.315193
0.953083
0.354281
B
a
0.810611
0.222145
0.167663
0.398043
0.533769
0.798719
0.942496
0.895687
0.200293
b
0.914262
0.014343
0.725297
0.145157
0.661077
0.998086
0.986214
0.350872
0.799577
c
0.657973
0.002625
0.593380
0.782111
0.023865
0.848250
0.163945
0.969662
0.919264
C
a
0.378590
0.826283
0.422204
0.554652
0.843198
0.362767
0.726439
0.425965
0.914429
b
0.413699
0.189651
0.134559
0.165747
0.348509
0.922714
0.973793
0.348467
0.743193
c
0.552581
0.273073
0.620954
0.650415
0.235731
0.418648
0.170382
0.526475
0.677653
idx= pd. IndexSlice
idx= pd. IndexSlice
IndexSlice 本质上是对多个Slice对象的包装
idx[ 1 : 9 : 2 , 'A' : 'C' , 'start' : 'end' : 2 ]
(slice(1, 9, 2), slice('A', 'C', None), slice('start', 'end', 2))
索引Slice可以与loc一起完成切片操作,主要有两种用法
(a)loc[idx[*,*]]型
第一个星号表示行,第二个表示列,且使用布尔索引时,需要索引对齐
df_s. loc[ idx[ 'B' : , df_s. iloc[ 0 ] > 0.6 ] ]
Big
D
E
Small
d
d
f
Upper
Lower
B
a
0.810611
0.398043
0.798719
b
0.914262
0.145157
0.998086
c
0.657973
0.782111
0.848250
C
a
0.378590
0.554652
0.362767
b
0.413699
0.165747
0.922714
c
0.552581
0.650415
0.418648
df_s. loc[ idx[ df_s. iloc[ : , 0 ] > 0.6 , : ( 'E' , 'f' ) ] ]
Big
D
E
Small
d
e
f
d
e
f
Upper
Lower
A
a
0.727690
0.349177
0.462134
0.756151
0.595001
0.854182
B
a
0.810611
0.222145
0.167663
0.398043
0.533769
0.798719
b
0.914262
0.014343
0.725297
0.145157
0.661077
0.998086
c
0.657973
0.002625
0.593380
0.782111
0.023865
0.848250
(b)loc[idx[*,*],idx[*,*]]型
这里与上面的区别在于(a)中的loc是没有逗号隔开的,但(b)是用逗号隔开,前面一个idx表示行索引,后面一个idx为列索引
这种用法非常灵活,因此多举几个例子方便理解
df_s. loc[ idx[ 'A' ] , idx[ 'D' : ] ]
Big
D
E
F
Small
d
e
f
d
e
f
d
e
f
Lower
a
0.727690
0.349177
0.462134
0.756151
0.595001
0.854182
0.564885
0.032514
0.584503
b
0.456652
0.064605
0.162614
0.943543
0.358184
0.320619
0.255107
0.848212
0.608838
c
0.527055
0.771120
0.354109
0.020767
0.305929
0.062331
0.315193
0.953083
0.354281
df_s. loc[ idx[ : 'B' , 'b' : ] , : ]
Big
D
E
F
Small
d
e
f
d
e
f
d
e
f
Upper
Lower
A
b
0.456652
0.064605
0.162614
0.943543
0.358184
0.320619
0.255107
0.848212
0.608838
c
0.527055
0.771120
0.354109
0.020767
0.305929
0.062331
0.315193
0.953083
0.354281
B
b
0.914262
0.014343
0.725297
0.145157
0.661077
0.998086
0.986214
0.350872
0.799577
c
0.657973
0.002625
0.593380
0.782111
0.023865
0.848250
0.163945
0.969662
0.919264
df_s. iloc[ : , 0 ] > 0.6
Upper Lower
A a True
b False
c False
B a True
b True
c True
C a False
b False
c False
Name: (D, d), dtype: bool
df_s. loc[ idx[ : 'B' , df_s. iloc[ : , 0 ] > 0.6 ] , : ]
Big
D
E
F
Small
d
e
f
d
e
f
d
e
f
Upper
Lower
A
a
0.727690
0.349177
0.462134
0.756151
0.595001
0.854182
0.564885
0.032514
0.584503
B
a
0.810611
0.222145
0.167663
0.398043
0.533769
0.798719
0.942496
0.895687
0.200293
b
0.914262
0.014343
0.725297
0.145157
0.661077
0.998086
0.986214
0.350872
0.799577
c
0.657973
0.002625
0.593380
0.782111
0.023865
0.848250
0.163945
0.969662
0.919264
L1, L2 = [ 'A' , 'B' ] , [ 'a' , 'b' , 'c' ]
mul_index1 = pd. MultiIndex. from_product( [ L1, L2] , names= ( 'Upper' , 'Lower' ) )
L3, L4 = [ 'D' , 'E' , 'F' ] , [ 'd' , 'e' , 'f' ]
mul_index2 = pd. MultiIndex. from_product( [ L3, L4] , names= ( 'Big' , 'Small' ) )
df_s = pd. DataFrame( np. random. rand( 6 , 9 ) , index= mul_index1, columns= mul_index2)
df_s
Big
D
E
F
Small
d
e
f
d
e
f
d
e
f
Upper
Lower
A
a
0.582610
0.641514
0.560705
0.910591
0.599434
0.208767
0.515083
0.884898
0.403068
b
0.294533
0.924421
0.124904
0.880993
0.002462
0.785282
0.519103
0.719144
0.867035
c
0.904616
0.315742
0.313072
0.376997
0.474177
0.317675
0.591629
0.857103
0.345019
B
a
0.372711
0.389083
0.756115
0.504690
0.380259
0.743078
0.235606
0.477790
0.864240
b
0.381907
0.088245
0.858773
0.801386
0.140712
0.363459
0.582477
0.592419
0.935077
c
0.764331
0.021464
0.543272
0.819539
0.334704
0.771924
0.766925
0.054824
0.016175
df_s. loc[ idx[ : 'B' , ( df_s. iloc[ 0 ] > 0.6 ) [ : 6 ] ] , : ]
Big
D
E
F
Small
d
e
f
d
e
f
d
e
f
Upper
Lower
A
b
0.294533
0.924421
0.124904
0.880993
0.002462
0.785282
0.519103
0.719144
0.867035
B
a
0.372711
0.389083
0.756115
0.504690
0.380259
0.743078
0.235606
0.477790
0.864240
df_s. loc[ idx[ : 'B' , 'c' : , ( df_s. iloc[ : , 0 ] > 0.6 ) ] , : ]
Big
D
E
F
Small
d
e
f
d
e
f
d
e
f
Upper
Lower
A
c
0.904616
0.315742
0.313072
0.376997
0.474177
0.317675
0.591629
0.857103
0.345019
B
c
0.764331
0.021464
0.543272
0.819539
0.334704
0.771924
0.766925
0.054824
0.016175
df_s. loc[ idx[ : 'B' , ( df_s. iloc[ : , 0 ] > 0.6 ) , ( df_s. iloc[ : , 0 ] > 0.6 ) ] , : ]
Big
D
E
F
Small
d
e
f
d
e
f
d
e
f
Upper
Lower
A
c
0.904616
0.315742
0.313072
0.376997
0.474177
0.317675
0.591629
0.857103
0.345019
B
c
0.764331
0.021464
0.543272
0.819539
0.334704
0.771924
0.766925
0.054824
0.016175
索引Slice的使用非常灵活:
loc加逗号,idx索引不对齐(loc不加都好,idx索引对齐)注意!
加逗号的时候,前面的idx表示行,后面idx的表示列
不加逗号的时候,idx里面前面的表示行,后面的表示列而且要索引对齐
display( df_s. loc[ idx[ 'B' : , df_s[ 'D' ] [ 'd' ] > 0.3 ] , idx[ df_s. sum ( ) > 4 ] ] )
display( df_s[ 'D' ] [ 'd' ] > 0.3 )
display( df_s. sum ( ) > 4 )
a= df_s[ 'D' ] [ 'd' ] > 0.3
df_s. loc[ idx[ 'B' : , df_s. sum ( ) > 4 ] ]
Big
D
E
F
Small
d
d
f
d
e
f
Upper
Lower
B
a
0.810611
0.398043
0.798719
0.942496
0.895687
0.200293
b
0.914262
0.145157
0.998086
0.986214
0.350872
0.799577
c
0.657973
0.782111
0.848250
0.163945
0.969662
0.919264
C
a
0.378590
0.554652
0.362767
0.726439
0.425965
0.914429
b
0.413699
0.165747
0.922714
0.973793
0.348467
0.743193
c
0.552581
0.650415
0.418648
0.170382
0.526475
0.677653
Upper Lower
A a True
b True
c True
B a True
b True
c True
C a True
b True
c True
Name: d, dtype: bool
Big Small
D d True
e False
f False
E d True
e False
f True
F d True
e True
f True
dtype: bool
Big
D
E
F
Small
d
d
f
d
e
f
Upper
Lower
B
a
0.810611
0.398043
0.798719
0.942496
0.895687
0.200293
b
0.914262
0.145157
0.998086
0.986214
0.350872
0.799577
c
0.657973
0.782111
0.848250
0.163945
0.969662
0.919264
C
a
0.378590
0.554652
0.362767
0.726439
0.425965
0.914429
b
0.413699
0.165747
0.922714
0.973793
0.348467
0.743193
c
0.552581
0.650415
0.418648
0.170382
0.526475
0.677653
4. 索引层的交换
(a)swaplevel方法(两层交换)
df_using_mul. head( )
School
Gender
Height
Weight
Math
Physics
Class
Address
C_1
street_1
S_1
M
173
63
34.0
A+
street_2
S_1
F
192
73
32.5
B+
street_2
S_1
M
186
82
87.2
B+
street_2
S_1
F
167
81
80.4
B-
street_4
S_1
F
159
64
84.8
B+
df_using_mul. swaplevel( i= 1 , j= 0 , axis= 0 ) . sort_index( ) . head( )
df_using_mul. swaplevel( i= 1 , j= 0 , axis= 0 ) . sort_index( ) . head( )
School
Gender
Height
Weight
Math
Physics
Address
Class
street_1
C_1
S_1
M
173
63
34.0
A+
C_2
S_2
M
175
74
47.2
B-
C_3
S_1
F
175
57
87.7
A-
street_2
C_1
S_1
F
192
73
32.5
B+
C_1
S_1
M
186
82
87.2
B+
(b)reorder_levels方法(多层交换)
df_muls = df. set_index( [ 'School' , 'Class' , 'Address' ] )
df_muls. head( )
df_mul= df. set_index( [ 'Physics' , 'School' , 'Class' ] )
df_mul. head( )
Gender
Address
Height
Weight
Math
Physics
School
Class
A+
S_1
C_1
M
street_1
173
63
34.0
B+
S_1
C_1
F
street_2
192
73
32.5
C_1
M
street_2
186
82
87.2
B-
S_1
C_1
F
street_2
167
81
80.4
B+
S_1
C_1
F
street_4
159
64
84.8
df_muls. reorder_levels( [ 2 , 0 , 1 ] , axis= 0 ) . sort_index( ) . head( )
df_mul. reorder_levels( [ 1 , 2 , 0 ] , axis= 0 ) . sort_index( ) . head( )
Gender
Address
Height
Weight
Math
School
Class
Physics
S_1
C_1
A+
M
street_1
173
63
34.0
B+
F
street_2
192
73
32.5
B+
M
street_2
186
82
87.2
B+
F
street_4
159
64
84.8
B-
F
street_2
167
81
80.4
df_muls. reorder_levels( [ 'Address' , 'School' , 'Class' ] , axis= 0 ) . sort_index( ) . head( )
df_mul. reorder_levels( [ 'School' , 'Class' , 'Physics' ] , axis= 0 ) . sort_index( ) . head( )
Gender
Address
Height
Weight
Math
School
Class
Physics
S_1
C_1
A+
M
street_1
173
63
34.0
B+
F
street_2
192
73
32.5
B+
M
street_2
186
82
87.2
B+
F
street_4
159
64
84.8
B-
F
street_2
167
81
80.4
三、索引设定
1. index_col参数
index_col是read_csv中的一个参数,而不是某一个方法:
pd. read_csv( 'data/table.csv' , index_col= [ 'Address' , 'School' ] ) . head( )
pd. read_csv( 'data/table.csv' , index_col= [ 'Address' , 'School' ] ) . head( )
Class
ID
Gender
Height
Weight
Math
Physics
Address
School
street_1
S_1
C_1
1101
M
173
63
34.0
A+
street_2
S_1
C_1
1102
F
192
73
32.5
B+
S_1
C_1
1103
M
186
82
87.2
B+
S_1
C_1
1104
F
167
81
80.4
B-
street_4
S_1
C_1
1105
F
159
64
84.8
B+
2. reindex和reindex_like
reindex是指重新索引,它的重要特性在于索引对齐,很多时候用于重新排序
df
df. loc[ 1105 : 2402 ]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1105
S_1
C_1
F
street_4
159
64
84.8
B+
1201
S_1
C_2
M
street_5
188
68
97.0
A-
1202
S_1
C_2
F
street_4
176
94
63.5
B-
1203
S_1
C_2
M
street_6
160
53
58.8
A+
1204
S_1
C_2
F
street_5
162
63
33.8
B
1205
S_1
C_2
F
street_6
167
63
68.4
B-
1301
S_1
C_3
M
street_4
161
68
31.5
B+
1302
S_1
C_3
F
street_1
175
57
87.7
A-
1303
S_1
C_3
M
street_7
188
82
49.7
B
1304
S_1
C_3
M
street_2
195
70
85.2
A
1305
S_1
C_3
F
street_5
187
69
61.7
B-
2101
S_2
C_1
M
street_7
174
84
83.3
C
2102
S_2
C_1
F
street_6
161
61
50.6
B+
2103
S_2
C_1
M
street_4
157
61
52.5
B-
2104
S_2
C_1
F
street_5
159
97
72.2
B+
2105
S_2
C_1
M
street_4
170
81
34.2
A
2201
S_2
C_2
M
street_5
193
100
39.1
B
2202
S_2
C_2
F
street_7
194
77
68.5
B+
2203
S_2
C_2
M
street_4
155
91
73.8
A+
2204
S_2
C_2
M
street_1
175
74
47.2
B-
2205
S_2
C_2
F
street_7
183
76
85.4
B
2301
S_2
C_3
F
street_4
157
78
72.3
B+
2302
S_2
C_3
M
street_5
171
88
32.7
A
2303
S_2
C_3
F
street_7
190
99
65.9
C
2304
S_2
C_3
F
street_6
164
81
95.5
A-
2305
S_2
C_3
M
street_4
187
73
48.9
B
2401
S_2
C_4
F
street_2
192
62
45.3
A
2402
S_2
C_4
M
street_7
166
82
48.7
B
df. reindex( index= [ 1101 , 1203 , 1206 , 2402 ] )
df. reindex( index= [ 1101 , 1203 , 1205 , 2402 ] )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1203
S_1
C_2
M
street_6
160
53
58.8
A+
1205
S_1
C_2
F
street_6
167
63
68.4
B-
2402
S_2
C_4
M
street_7
166
82
48.7
B
df. reindex( columns= [ 'Height' , 'Gender' , 'Average' ] ) . head( )
df. reindex( columns= [ 'Height' , 'Gender' , 'Avergae' ] ) . head( )
Height
Gender
Avergae
ID
1101
173
M
NaN
1102
192
F
NaN
1103
186
M
NaN
1104
167
F
NaN
1105
159
F
NaN
可以选择缺失值的填充方法:fill_value和method(bfill/ffill/nearest),其中method参数必须索引单调
a= df. reindex( index= [ 1101 , 1203 , 1206 , 2402 ] , method= 'bfill' )
display( a. index. is_monotonic)
display( a. sort_index( ) . index. is_monotonic)
a
True
True
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1203
S_1
C_2
M
street_6
160
53
58.8
A+
1206
S_1
C_3
M
street_4
161
68
31.5
B+
2402
S_2
C_4
M
street_7
166
82
48.7
B
df. reindex( index= [ 1101 , 1203 , 1206 , 2402 ] , method= 'nearest' )
df. reindex( index= [ 1101 , 1203 , 1206 , 2402 ] , method= 'nearest' )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1203
S_1
C_2
M
street_6
160
53
58.8
A+
1206
S_1
C_2
F
street_6
167
63
68.4
B-
2402
S_2
C_4
M
street_7
166
82
48.7
B
reindex_like的作用为生成一个横纵索引完全与参数列表一致的DataFrame,数据使用被调用的表
df_temp = pd. DataFrame( { 'Weight' : np. zeros( 5 ) ,
'Height' : np. zeros( 5 ) ,
'ID' : [ 1101 , 1104 , 1103 , 1106 , 1102 ] } ) . set_index( 'ID' )
df_temp. reindex_like( df[ 0 : 5 ] [ [ 'Weight' , 'Height' ] ] )
df_temp1= pd. DataFrame( { 'Weight' : np. zeros( 5 ) ,
'Height' : np. zeros( 5 ) ,
'ID' : [ 1101 , 1104 , 1103 , 1106 , 1102 ] } ) . set_index( 'ID' )
display( df_temp1)
display( df[ 0 : 5 ] [ [ 'Weight' , 'Height' ] ] )
df_temp1. reindex_like( df[ 0 : 5 ] [ [ 'Weight' , 'Height' ] ] )
Weight
Height
ID
1101
0.0
0.0
1104
0.0
0.0
1103
0.0
0.0
1106
0.0
0.0
1102
0.0
0.0
Weight
Height
ID
1101
63
173
1102
73
192
1103
82
186
1104
81
167
1105
64
159
Weight
Height
ID
1101
0.0
0.0
1102
0.0
0.0
1103
0.0
0.0
1104
0.0
0.0
1105
NaN
NaN
如果df_temp单调还可以使用method参数:
df_temp = pd. DataFrame( { 'Weight' : range ( 5 ) ,
'Height' : range ( 5 ) ,
'ID' : [ 1101 , 1104 , 1103 , 1106 , 1102 ] } ) . set_index( 'ID' ) . sort_index( )
df_temp. reindex_like( df[ 0 : 5 ] [ [ 'Weight' , 'Height' ] ] , method= 'bfill' )
df_temp= pd. DataFrame( { 'Weight' : range ( 5 ) ,
'Height' : range ( 5 ) ,
'ID' : [ 1101 , 1104 , 1103 , 1106 , 1102 ] } ) . set_index( 'ID' ) . sort_index( )
df_temp. reindex_like( df[ : 5 ] [ [ 'Weight' , 'Height' ] ] , method= 'bfill' )
Weight
Height
ID
1101
0
0
1102
4
4
1103
2
2
1104
1
1
1105
3
3
3. set_index和reset_index
先介绍set_index:从字面意思看,就是将某些列作为索引
使用表内列作为索引:
df. head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1102
S_1
C_1
F
street_2
192
73
32.5
B+
1103
S_1
C_1
M
street_2
186
82
87.2
B+
1104
S_1
C_1
F
street_2
167
81
80.4
B-
1105
S_1
C_1
F
street_4
159
64
84.8
B+
df. set_index( 'Class' ) . head( )
df. set_index( 'Class' ) . head( )
School
Gender
Address
Height
Weight
Math
Physics
Class
C_1
S_1
M
street_1
173
63
34.0
A+
C_1
S_1
F
street_2
192
73
32.5
B+
C_1
S_1
M
street_2
186
82
87.2
B+
C_1
S_1
F
street_2
167
81
80.4
B-
C_1
S_1
F
street_4
159
64
84.8
B+
利用append参数可以将当前索引维持不变
df. set_index( 'Class' , append= True ) . head( )
df. set_index( 'Class' , append= True ) . head( )
School
Gender
Address
Height
Weight
Math
Physics
ID
Class
1101
C_1
S_1
M
street_1
173
63
34.0
A+
1102
C_1
S_1
F
street_2
192
73
32.5
B+
1103
C_1
S_1
M
street_2
186
82
87.2
B+
1104
C_1
S_1
F
street_2
167
81
80.4
B-
1105
C_1
S_1
F
street_4
159
64
84.8
B+
当使用与表长相同的列作为索引(需要先转化为Series,否则报错):
df. set_index( pd. Series( range ( df. shape[ 0 ] ) ) ) . head( )
df. set_index( pd. Series( range ( df. shape[ 0 ] ) ) ) . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
0
S_1
C_1
M
street_1
173
63
34.0
A+
1
S_1
C_1
F
street_2
192
73
32.5
B+
2
S_1
C_1
M
street_2
186
82
87.2
B+
3
S_1
C_1
F
street_2
167
81
80.4
B-
4
S_1
C_1
F
street_4
159
64
84.8
B+
可以直接添加多级索引:
df. set_index( [ pd. Series( range ( df. shape[ 0 ] ) ) , pd. Series( np. ones( df. shape[ 0 ] ) ) ] ) . head( )
df. set_index( [ pd. Series( range ( df. shape[ 0 ] ) ) , pd. Series( np. ones( df. shape[ 0 ] ) ) ] ) . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
0
1.0
S_1
C_1
M
street_1
173
63
34.0
A+
1
1.0
S_1
C_1
F
street_2
192
73
32.5
B+
2
1.0
S_1
C_1
M
street_2
186
82
87.2
B+
3
1.0
S_1
C_1
F
street_2
167
81
80.4
B-
4
1.0
S_1
C_1
F
street_4
159
64
84.8
B+
下面介绍reset_index方法,它的主要功能是将索引重置
默认状态直接恢复到自然数索引:
df. reset_index( ) . head( )
df. reset_index( ) . head( )
ID
School
Class
Gender
Address
Height
Weight
Math
Physics
0
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1
1102
S_1
C_1
F
street_2
192
73
32.5
B+
2
1103
S_1
C_1
M
street_2
186
82
87.2
B+
3
1104
S_1
C_1
F
street_2
167
81
80.4
B-
4
1105
S_1
C_1
F
street_4
159
64
84.8
B+
用level参数指定哪一层被reset,用col_level参数指定set到哪一层:
L1, L2 = [ 'A' , 'B' , 'C' ] , [ 'a' , 'b' , 'c' ]
mul_index1 = pd. MultiIndex. from_product( [ L1, L2] , names= ( 'Upper' , 'Lower' ) )
L3, L4 = [ 'D' , 'E' , 'F' ] , [ 'd' , 'e' , 'f' ]
mul_index2 = pd. MultiIndex. from_product( [ L3, L4] , names= ( 'Big' , 'Small' ) )
df_temp = pd. DataFrame( np. random. rand( 9 , 9 ) , index= mul_index1, columns= mul_index2)
df_temp. head( )
Big
A
B
C
Small
a
b
c
a
b
c
a
b
c
Upper
Lower
A
a
0.779649
0.850942
0.498576
0.384237
0.070980
0.154522
0.029674
0.566804
0.607491
b
0.192239
0.861111
0.473844
0.375134
0.071124
0.954574
0.811846
0.130554
0.621900
c
0.241607
0.969242
0.425643
0.745032
0.400758
0.316587
0.510980
0.099285
0.442423
B
a
0.152618
0.079949
0.427708
0.091412
0.050334
0.504813
0.562932
0.033970
0.098759
b
0.722760
0.914694
0.061265
0.946915
0.699833
0.858146
0.235059
0.637032
0.672221
df_temp1 = df_temp. reset_index( level= 1 , col_level= 1 )
display( df_temp1. head( ) )
df_temp2= df_temp. reset_index( level= 1 , col_level= 0 )
display( df_temp2. head( ) )
Big
A
B
C
Small
Lower
a
b
c
a
b
c
a
b
c
Upper
A
a
0.779649
0.850942
0.498576
0.384237
0.070980
0.154522
0.029674
0.566804
0.607491
A
b
0.192239
0.861111
0.473844
0.375134
0.071124
0.954574
0.811846
0.130554
0.621900
A
c
0.241607
0.969242
0.425643
0.745032
0.400758
0.316587
0.510980
0.099285
0.442423
B
a
0.152618
0.079949
0.427708
0.091412
0.050334
0.504813
0.562932
0.033970
0.098759
B
b
0.722760
0.914694
0.061265
0.946915
0.699833
0.858146
0.235059
0.637032
0.672221
Big
Lower
A
B
C
Small
a
b
c
a
b
c
a
b
c
Upper
A
a
0.779649
0.850942
0.498576
0.384237
0.070980
0.154522
0.029674
0.566804
0.607491
A
b
0.192239
0.861111
0.473844
0.375134
0.071124
0.954574
0.811846
0.130554
0.621900
A
c
0.241607
0.969242
0.425643
0.745032
0.400758
0.316587
0.510980
0.099285
0.442423
B
a
0.152618
0.079949
0.427708
0.091412
0.050334
0.504813
0.562932
0.033970
0.098759
B
b
0.722760
0.914694
0.061265
0.946915
0.699833
0.858146
0.235059
0.637032
0.672221
display( df_temp1. columns)
display( df_temp2. columns)
MultiIndex([( '', 'Lower'),
('A', 'a'),
('A', 'b'),
('A', 'c'),
('B', 'a'),
('B', 'b'),
('B', 'c'),
('C', 'a'),
('C', 'b'),
('C', 'c')],
names=['Big', 'Small'])
MultiIndex([('Lower', ''),
( 'A', 'a'),
( 'A', 'b'),
( 'A', 'c'),
( 'B', 'a'),
( 'B', 'b'),
( 'B', 'c'),
( 'C', 'a'),
( 'C', 'b'),
( 'C', 'c')],
names=['Big', 'Small'])
df_temp1. index
df_temp1. index
Index(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'], dtype='object', name='Upper')
4. rename_axis和rename
rename_axis是针对多级索引的方法,作用是修改某一层的索引名,而不是索引标签
df_temp. rename_axis( index= { 'Lower' : 'LowerLower' } , columns= { 'Big' : 'BigBig' } )
df_temp. rename_axis( index= { 'Lower' : 'LowerLower' } , columns= { 'Big' : 'BigBig' } )
BigBig
A
B
C
Small
a
b
c
a
b
c
a
b
c
Upper
LowerLower
A
a
0.779649
0.850942
0.498576
0.384237
0.070980
0.154522
0.029674
0.566804
0.607491
b
0.192239
0.861111
0.473844
0.375134
0.071124
0.954574
0.811846
0.130554
0.621900
c
0.241607
0.969242
0.425643
0.745032
0.400758
0.316587
0.510980
0.099285
0.442423
B
a
0.152618
0.079949
0.427708
0.091412
0.050334
0.504813
0.562932
0.033970
0.098759
b
0.722760
0.914694
0.061265
0.946915
0.699833
0.858146
0.235059
0.637032
0.672221
c
0.300491
0.622385
0.441322
0.039829
0.643909
0.055289
0.783965
0.311105
0.030763
C
a
0.203551
0.101941
0.954500
0.845587
0.690844
0.801937
0.290936
0.780660
0.475275
b
0.320449
0.726794
0.613314
0.827977
0.646678
0.059252
0.422571
0.942435
0.692632
c
0.511147
0.142134
0.861073
0.667099
0.307494
0.327941
0.858620
0.464931
0.394346
rename方法用于修改列或者行索引标签,而不是索引名:
df_temp. rename( index= { 'A' : 'T' } , columns= { 'e' : 'changed_e' } ) . head( )
df_temp. rename( index= { 'A' : 'T' } , columns= { 'e' : 'changed_e' } ) . head( )
Big
A
B
C
Small
a
b
c
a
b
c
a
b
c
Upper
Lower
T
a
0.779649
0.850942
0.498576
0.384237
0.070980
0.154522
0.029674
0.566804
0.607491
b
0.192239
0.861111
0.473844
0.375134
0.071124
0.954574
0.811846
0.130554
0.621900
c
0.241607
0.969242
0.425643
0.745032
0.400758
0.316587
0.510980
0.099285
0.442423
B
a
0.152618
0.079949
0.427708
0.091412
0.050334
0.504813
0.562932
0.033970
0.098759
b
0.722760
0.914694
0.061265
0.946915
0.699833
0.858146
0.235059
0.637032
0.672221
四、常用索引型函数
1. where函数
当对条件为False的单元进行填充:
df. head( )
df. head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1102
S_1
C_1
F
street_2
192
73
32.5
B+
1103
S_1
C_1
M
street_2
186
82
87.2
B+
1104
S_1
C_1
F
street_2
167
81
80.4
B-
1105
S_1
C_1
F
street_4
159
64
84.8
B+
df. where( df[ 'Gender' ] == 'M' ) . head( )
df. where( df[ 'Gender' ] == 'M' ) . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173.0
63.0
34.0
A+
1102
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1103
S_1
C_1
M
street_2
186.0
82.0
87.2
B+
1104
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1105
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
通过这种方法筛选结果和[]操作符的结果完全一致:
df. where( df[ 'Gender' ] == 'M' ) . dropna( ) . head( )
df. where( df[ 'Gender' ] == 'M' ) . dropna( ) . head( )
df[ df[ 'Gender' ] == 'M' ] . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1103
S_1
C_1
M
street_2
186
82
87.2
B+
1201
S_1
C_2
M
street_5
188
68
97.0
A-
1203
S_1
C_2
M
street_6
160
53
58.8
A+
1301
S_1
C_3
M
street_4
161
68
31.5
B+
第一个参数为布尔条件,第二个参数为填充值:
a= np. random. rand( df. shape[ 0 ] , df. shape[ 1 ] )
print ( a)
df. where( df[ 'Gender' ] == 'M' , a) . head( )
[[0.7161581 0.10276669 0.40238542 0.08958898 0.81354208 0.93618131
0.07624998 0.76951042]
[0.5989863 0.33505363 0.02580452 0.97710051 0.66610509 0.87578929
0.10579943 0.32587477]
[0.72323026 0.54804755 0.6574625 0.29000164 0.56393993 0.60193394
0.13987508 0.66423202]
[0.18904524 0.72877605 0.6523432 0.36142768 0.69567451 0.39747439
0.30469684 0.37125112]
[0.00551586 0.88153884 0.54566294 0.97107711 0.48850192 0.10998194
0.90682808 0.41378463]
[0.50033165 0.7738362 0.40988636 0.5300674 0.48312333 0.8504542
0.04959469 0.09917488]
[0.02734842 0.96474427 0.2802373 0.51467892 0.28559232 0.31821509
0.66709526 0.27763044]
[0.51644131 0.32083044 0.64969486 0.89021328 0.67345517 0.04336458
0.09367256 0.75744686]
[0.22493445 0.00813298 0.22216915 0.86470618 0.43776919 0.28244148
0.30980727 0.58315116]
[0.48258786 0.84567059 0.69134007 0.34076216 0.62695586 0.63140201
0.84808381 0.37466675]
[0.56076384 0.58164287 0.46504719 0.47365642 0.19272048 0.58133442
0.24137399 0.05412933]
[0.86008694 0.47449294 0.99349784 0.41395056 0.21880338 0.57870774
0.08266912 0.61736266]
[0.92363794 0.65247997 0.23419656 0.49394314 0.39020457 0.9076912
0.1787334 0.44567342]
[0.75460692 0.87818376 0.9506545 0.0726399 0.19348714 0.99663363
0.15222477 0.45247023]
[0.61724088 0.93745891 0.27255907 0.39972334 0.426046 0.32858071
0.76331933 0.89152397]
[0.34580174 0.80287412 0.43827906 0.80077107 0.04982336 0.20480553
0.76536398 0.96840615]
[0.46040455 0.75916605 0.07480359 0.22483045 0.51732291 0.38761611
0.89792331 0.21509696]
[0.06037899 0.54175144 0.69026867 0.37691624 0.63080779 0.47401334
0.29484837 0.60367724]
[0.51877573 0.13529959 0.03847122 0.56332936 0.43954205 0.14320149
0.53932022 0.94058724]
[0.48122222 0.70688747 0.33347702 0.22922328 0.63617473 0.12761284
0.01558933 0.42228307]
[0.30437386 0.97890045 0.87982765 0.43573368 0.96842477 0.7905138
0.78644784 0.65063758]
[0.82491619 0.84071325 0.45118883 0.52083531 0.61488637 0.61067191
0.49898609 0.28876092]
[0.22509524 0.28846881 0.46573631 0.47336345 0.92948157 0.77863736
0.39102284 0.50983845]
[0.52390029 0.09511125 0.37396754 0.90084271 0.68217899 0.64254994
0.74261885 0.58969606]
[0.16566242 0.07908791 0.75588982 0.57909687 0.73887746 0.10960639
0.27713074 0.41753722]
[0.07614926 0.10933826 0.94807557 0.29773441 0.01351866 0.39391314
0.22296291 0.38958623]
[0.01210252 0.1967451 0.80815367 0.17967125 0.54335179 0.8655415
0.07346332 0.69454703]
[0.5345165 0.16660525 0.40307645 0.52590277 0.86212669 0.03425073
0.65680325 0.14333681]
[0.58388521 0.58904123 0.47712613 0.27945041 0.06372401 0.75644174
0.14670172 0.5711495 ]
[0.43603091 0.2733335 0.10495188 0.67239001 0.57908462 0.15874856
0.42475036 0.90589905]
[0.8144674 0.9944558 0.68966555 0.09715326 0.24935761 0.95897933
0.30676529 0.31195192]
[0.64092077 0.27853017 0.76853506 0.37189139 0.57513649 0.7483777
0.66384054 0.19440796]
[0.87374265 0.19244049 0.52948927 0.91302972 0.45222102 0.08765404
0.12991008 0.50151027]
[0.17282369 0.57290435 0.48890098 0.94771301 0.12483773 0.10294902
0.2430428 0.92204093]
[0.47430317 0.49779933 0.68934569 0.7553649 0.82349556 0.96427937
0.17411753 0.8015759 ]]
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173.000000
63.000000
34.000000
A+
1102
0.598986
0.335054
0.0258045
0.977101
0.666105
0.875789
0.105799
0.325875
1103
S_1
C_1
M
street_2
186.000000
82.000000
87.200000
B+
1104
0.189045
0.728776
0.652343
0.361428
0.695675
0.397474
0.304697
0.371251
1105
0.00551586
0.881539
0.545663
0.971077
0.488502
0.109982
0.906828
0.413785
2. mask函数
mask函数与where功能上相反,其余完全一致,即对条件为True的单元进行填充
df. mask( df[ 'Gender' ] == 'M' ) . dropna( ) . head( )
df. mask( df[ 'Gender' ] == 'M' ) . dropna( ) . head( )
df[ df[ "Gender" ] != 'M' ] . head( )
df. where( df[ 'Gender' ] != 'M' ) . dropna( ) . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1102
S_1
C_1
F
street_2
192.0
73.0
32.5
B+
1104
S_1
C_1
F
street_2
167.0
81.0
80.4
B-
1105
S_1
C_1
F
street_4
159.0
64.0
84.8
B+
1202
S_1
C_2
F
street_4
176.0
94.0
63.5
B-
1204
S_1
C_2
F
street_5
162.0
63.0
33.8
B
df. mask( df[ 'Gender' ] == 'M' , np. random. rand( df. shape[ 0 ] , df. shape[ 1 ] ) ) . head( )
df. mask( df[ 'Gender' ] == 'M' , np. random. rand( df. shape[ 0 ] , df. shape[ 1 ] ) ) . head( )
df. where( df[ 'Gender' ] != 'M' , np. random. rand( df. shape[ 0 ] , df. shape[ 1 ] ) ) . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
0.0524246
0.778497
0.906421
0.693452
0.196364
0.155836
0.594141
0.337383
1102
S_1
C_1
F
street_2
192.000000
73.000000
32.500000
B+
1103
0.848831
0.444878
0.914591
0.36934
0.332417
0.605245
0.649709
0.613518
1104
S_1
C_1
F
street_2
167.000000
81.000000
80.400000
B-
1105
S_1
C_1
F
street_4
159.000000
64.000000
84.800000
B+
3. query函数
df. head( )
df. head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1102
S_1
C_1
F
street_2
192
73
32.5
B+
1103
S_1
C_1
M
street_2
186
82
87.2
B+
1104
S_1
C_1
F
street_2
167
81
80.4
B-
1105
S_1
C_1
F
street_4
159
64
84.8
B+
query函数中的布尔表达式中,下面的符号都是合法的:行列索引名、字符串、and/not/or/&/|/~/not in/in/==/!=、四则运算符
df. query( '(Address in ["street_6","street_7"])&(Weight>(70+10))&(ID in [1303,2304,2402])' )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1303
S_1
C_3
M
street_7
188
82
49.7
B
2304
S_2
C_3
F
street_6
164
81
95.5
A-
2402
S_2
C_4
M
street_7
166
82
48.7
B
五、重复元素处理
1. duplicated方法
该方法返回了是否重复的布尔列表
df. duplicated( 'Class' ) . head( )
df. duplicated( 'Class' ) . head( )
ID
1101 False
1102 True
1103 True
1104 True
1105 True
dtype: bool
可选参数keep默认为first,即首次出现设为不重复,若为last,则最后一次设为不重复,若为False,则所有重复项为True
df. duplicated( 'Class' , keep= 'last' ) . tail( )
df. duplicate( 'Class' , keep= 'last' ) . tail( )
ID
2401 True
2402 True
2403 True
2404 True
2405 False
dtype: bool
df. duplicated( 'Class' , keep= False ) . head( )
df. duplicated( 'Class' , keep= False ) . tail( )
ID
2401 True
2402 True
2403 True
2404 True
2405 True
dtype: bool
2. drop_duplicates方法
从名字上看出为剔除重复项,这在后面章节中的分组操作中可能是有用的,例如需要保留每组的第一个值:
df. drop_duplicates( 'Class' )
df. drop_duplicates( 'Class' )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1201
S_1
C_2
M
street_5
188
68
97.0
A-
1301
S_1
C_3
M
street_4
161
68
31.5
B+
2401
S_2
C_4
F
street_2
192
62
45.3
A
参数与duplicate函数类似:
df. drop_duplicates( 'Class' , keep= 'last' )
df. drop_duplicates( 'Class' , keep= 'last' )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
2105
S_2
C_1
M
street_4
170
81
34.2
A
2205
S_2
C_2
F
street_7
183
76
85.4
B
2305
S_2
C_3
M
street_4
187
73
48.9
B
2405
S_2
C_4
F
street_6
193
54
47.6
B
在传入多列时等价于将多列共同视作一个多级索引,比较重复项:
df. drop_duplicates( [ 'School' , 'Class' ] )
df. drop_duplicates( [ 'School' , 'Class' ] )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1101
S_1
C_1
M
street_1
173
63
34.0
A+
1201
S_1
C_2
M
street_5
188
68
97.0
A-
1301
S_1
C_3
M
street_4
161
68
31.5
B+
2101
S_2
C_1
M
street_7
174
84
83.3
C
2201
S_2
C_2
M
street_5
193
100
39.1
B
2301
S_2
C_3
F
street_4
157
78
72.3
B+
2401
S_2
C_4
F
street_2
192
62
45.3
A
六、抽样函数
这里的抽样函数指的就是sample函数
(a)n为样本量
df. sample( n= 5 )
df. sample( n= 5 )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1102
S_1
C_1
F
street_2
192
73
32.5
B+
1201
S_1
C_2
M
street_5
188
68
97.0
A-
1101
S_1
C_1
M
street_1
173
63
34.0
A+
2204
S_2
C_2
M
street_1
175
74
47.2
B-
2405
S_2
C_4
F
street_6
193
54
47.6
B
(b)frac为抽样比
df. sample( frac= 0.05 )
df. sample( frac= 0.05 )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1105
S_1
C_1
F
street_4
159
64
84.8
B+
2402
S_2
C_4
M
street_7
166
82
48.7
B
(c)replace为是否放回
df. sample( n= df. shape[ 0 ] , replace= True ) . head( )
df. sample( n= df. shape[ 0 ] , replace= True ) . head( )
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
2402
S_2
C_4
M
street_7
166
82
48.7
B
2101
S_2
C_1
M
street_7
174
84
83.3
C
2302
S_2
C_3
M
street_5
171
88
32.7
A
2401
S_2
C_4
F
street_2
192
62
45.3
A
2405
S_2
C_4
F
street_6
193
54
47.6
B
df. sample( n= 35 , replace= True ) . index. is_unique
df. sample( n= 35 , replace= True ) . index. is_unique
False
(d)axis为抽样维度,默认为0,即抽行
df. sample( n= 3 , axis= 1 ) . head( )
df. sample( n= 3 , axis= 1 ) . head( )
Math
Weight
School
ID
1101
34.0
63
S_1
1102
32.5
73
S_1
1103
87.2
82
S_1
1104
80.4
81
S_1
1105
84.8
64
S_1
(e)weights为样本权重,自动归一化
就是比如五个样本,权重分别是1 2 3 4 5 那个抽到的概率就是1/15 2/15 3/15,… pandas这个sample的weight参数提供语法糖,只要你给1 2 3 4 5的绝对量,自动会把你输入转成归一化之后的概率进行抽样
df. sample( n= 3 , weights= np. random. rand( df. shape[ 0 ] ) ) . head( )
df. sample( n= 3 , weights= list ( range ( df. shape[ 0 ] ) ) ) . head( )
array([0.42302361, 0.48301209, 0.07106683, 0.20566321, 0.04810366,
0.24624529, 0.68941873, 0.3931274 , 0.86344223, 0.5209619 ,
0.70827892, 0.60694938, 0.55769355, 0.8964586 , 0.58146323,
0.32480769, 0.40078874, 0.24710269, 0.13704178, 0.23662958,
0.289439 , 0.08274463, 0.1015726 , 0.3779845 , 0.76932914,
0.69591814, 0.08994818, 0.87528269, 0.24571314, 0.0634573 ,
0.81463873, 0.67125488, 0.65952737, 0.34445748, 0.72742361])
,
df. sample( n= 3 , weights= df[ 'Math' ] ) . head( ) ,
School
Class
Gender
Address
Height
Weight
Math
Physics
ID
1305
S_1
C_3
F
street_5
187
69
61.7
B-
2103
S_2
C_1
M
street_4
157
61
52.5
B-
2105
S_2
C_1
M
street_4
170
81
34.2
A
七、问题与练习
1. 问题
【问题一】 如何更改列或行的顺序?如何交换奇偶行(列)的顺序?
【问题二】 如果要选出DataFrame的某个子集,请给出尽可能多的方法实现。
【问题三】 query函数比其他索引方法的速度更慢吗?在什么场合使用什么索引最高效?
【问题四】 单级索引能使用Slice对象吗?能的话怎么使用,请给出一个例子。
【问题五】 如何快速找出某一列的缺失值所在索引?
【问题六】 索引设定中的所有方法分别适用于哪些场合?怎么直接把某个DataFrame的索引换成任意给定同长度的索引?
【问题七】 多级索引有什么适用场合?
【问题八】 什么时候需要重复元素处理?
df= pd. read_csv( 'data/UFO.csv' )
df. head( )
datetime
shape
duration (seconds)
latitude
longitude
0
10/10/1949 20:30
cylinder
2700.0
29.883056
-97.941111
1
10/10/1949 21:00
light
7200.0
29.384210
-98.581082
2
10/10/1955 17:00
circle
20.0
53.200000
-2.916667
3
10/10/1956 21:00
circle
20.0
28.978333
-96.645833
4
10/10/1960 20:00
light
900.0
21.418056
-157.803611
2. 练习
【练习一】 现有一份关于UFO的数据集,请解决下列问题:
pd. read_csv( 'data/UFO.csv' ) . head( )
datetime
shape
duration (seconds)
latitude
longitude
0
10/10/1949 20:30
cylinder
2700.0
29.883056
-97.941111
1
10/10/1949 21:00
light
7200.0
29.384210
-98.581082
2
10/10/1955 17:00
circle
20.0
53.200000
-2.916667
3
10/10/1956 21:00
circle
20.0
28.978333
-96.645833
4
10/10/1960 20:00
light
900.0
21.418056
-157.803611
(a)在所有被观测时间超过60s的时间中,哪个形状最多?
(b)对经纬度进行划分:-180°至180°以30°为一个划分,-90°至90°以18°为一个划分,请问哪个区域中报告的UFO事件数量最多?
(a)
df= pd. read_csv( 'data/UFO.csv' )
df. rename( columns= { 'duration (seconds)' : 'duration' } , inplace= True )
df. head( )
df[ 'duration' ] . astype( 'float' )
df. query( 'duration>60' ) [ 'shape' ] . value_counts( ) . nlargest( 1 )
light 10713
Name: shape, dtype: int64
(b)
bins_long= list ( range ( - 180 , 180 , 13 ) )
bins_la= list ( range ( - 90 , 90 , 18 ) )
cuts_long= pd. cut( df[ 'longitude' ] , bins= bins_long)
df[ 'cuts_long' ] = cuts_long
cuts_la= pd. cut( df[ 'latitude' ] , bins= bins_la)
df[ 'cuts_la' ] = cuts_la
df. head( )
datetime
shape
duration
latitude
longitude
cuts_long
cuts_la
0
10/10/1949 20:30
cylinder
2700.0
29.883056
-97.941111
(-102, -89]
(18, 36]
1
10/10/1949 21:00
light
7200.0
29.384210
-98.581082
(-102, -89]
(18, 36]
2
10/10/1955 17:00
circle
20.0
53.200000
-2.916667
(-11, 2]
(36, 54]
3
10/10/1956 21:00
circle
20.0
28.978333
-96.645833
(-102, -89]
(18, 36]
4
10/10/1960 20:00
light
900.0
21.418056
-157.803611
(-167, -154]
(18, 36]
pd. Series( list ( zip ( df[ 'cuts_long' ] , df[ 'cuts_la' ] ) ) ) . value_counts( ) . head( )
((-89.0, -76.0], (36.0, 54.0]) 16685
((-128.0, -115.0], (36.0, 54.0]) 12109
((-76.0, -63.0], (36.0, 54.0]) 10188
((-89.0, -76.0], (18.0, 36.0]) 9499
((-102.0, -89.0], (36.0, 54.0]) 6663
dtype: int64
df. set_index( [ 'cuts_long' , 'cuts_la' ] ) . index. value_counts( ) . head( )
((-89.0, -76.0], (36.0, 54.0]) 16685
((-128.0, -115.0], (36.0, 54.0]) 12109
((-76.0, -63.0], (36.0, 54.0]) 10188
((-89.0, -76.0], (18.0, 36.0]) 9499
((-102.0, -89.0], (36.0, 54.0]) 6663
dtype: int64
【练习二】 现有一份关于口袋妖怪的数据集,请解决下列问题:
pd. read_csv( 'data/Pokemon.csv' ) . head( )
#
Name
Type 1
Type 2
Total
HP
Attack
Defense
Sp. Atk
Sp. Def
Speed
Generation
Legendary
0
1
Bulbasaur
Grass
Poison
318
45
49
49
65
65
45
1
False
1
2
Ivysaur
Grass
Poison
405
60
62
63
80
80
60
1
False
2
3
Venusaur
Grass
Poison
525
80
82
83
100
100
80
1
False
3
3
VenusaurMega Venusaur
Grass
Poison
625
80
100
123
122
120
80
1
False
4
4
Charmander
Fire
NaN
309
39
52
43
60
50
65
1
False
(a)双属性的Pokemon占总体比例的多少?
(b)在所有种族值(Total)不小于580的Pokemon中,非神兽(Legendary=False)的比例为多少?
(c)在第一属性为格斗系(Fighting)的Pokemon中,物攻排名前三高的是哪些?
(d)请问六项种族指标(HP、物攻、特攻、物防、特防、速度)极差的均值最大的是哪个属性(只考虑第一属性,且均值是对属性而言)?
(e)哪个属性(只考虑第一属性)的神兽比例最高?该属性神兽的种族值也是最高的吗?
(a)
df= pd. read_csv( 'data/Pokemon.csv' )
df[ 'Type 2' ] . count( ) / df. shape[ 0 ]
0.5175
(b)
n1= df. query( '(Total>=580)&(Legendary==False)' ) [ 'Legendary' ] . count( )
n2= df. query( '(Total>=580)&(Legendary==True)' ) [ 'Legendary' ] . count( )
n1/ ( n1+ n2)
0.4247787610619469
df. query( 'Total>=580' ) [ 'Legendary' ] . value_counts( normalize= True )
True 0.575221
False 0.424779
Name: Legendary, dtype: float64
©
df[ df[ 'Type 1' ] == 'Fighting' ] . sort_values( by= 'Attack' , ascending= False ) . iloc[ : 3 ]
#
Name
Type 1
Type 2
Total
HP
Attack
Defense
Sp. Atk
Sp. Def
Speed
Generation
Legendary
498
448
LucarioMega Lucario
Fighting
Steel
625
70
145
88
140
70
112
4
False
594
534
Conkeldurr
Fighting
NaN
505
105
140
95
55
65
45
5
False
74
68
Machamp
Fighting
NaN
505
90
130
80
65
85
55
1
False
(d)
df[ 'range' ] = df. iloc[ : , 5 : 11 ] . max ( axis= 1 ) - df. iloc[ : , 5 : 11 ] . min ( axis= 1 )
attribute= df[ [ 'Type 1' , 'range' ] ] . set_index( 'Type 1' )
max_range= 0
result= ''
for i in attribute. index. unique( ) :
temp= attribute. loc[ i] . mean( )
if temp[ 0 ] > max_range:
max_range= temp[ 0 ]
result= i
result
'Steel'
(e)
df. query( 'Legendary==True' ) [ 'Type 1' ] . value_counts( ) . index[ 0 ]
'Psychic'
attribute= df. query( 'Legendary==True' ) [ [ 'Type 1' , 'Total' ] ] . set_index( 'Type 1' )
max_range= 0
result= ''
for i in attribute. index. unique( ) [ : - 1 ] :
temp= attribute. loc[ i] . mean( )
if temp[ 0 ] > max_range:
max_range= temp[ 0 ]
result= i
attribute. loc[ 'Grass' ] . mean( )
result
'Normal'
attribute= df. query( 'Legendary==True' ) [ [ 'Type 1' , 'Total' ] ] . set_index( 'Type 1' )
max_range= 0
result= ''
for i in attribute. index. unique( ) :
temp= float ( attribute. loc[ i] . mean( ) )
if temp> max_range:
max_range= temp
result= i
result
'Normal'