Pandas基本介绍
- Numpy是列表形式的,没有数值标签,而Pandas是字典形式。Pandas是基于Numpy构建的,让Numpy为中心的应用变得更加简单。
- Pandas主要有两个数据结构,Series和DataFrame。
Series
import pandas as pd
import numpy as np
s = pd.Series([1,3,6,np.nan,44,1])
print(s)
print(s[1]) #可以直接访问
Series
的字符串表现形式为:索引在左边,值在右边。由于没有指定索引,默认创建0到N-1的整数型索引。下面是加上索引的Series
grade = pd.Series([100,59,80],index=["李明","李红","王美"])
print(grade.values)
print(grade.index)
print(grade["李明"])
DataFrame
dates = pd.date_range("20160101",periods=6)
df = pd.DataFrame(np.random.randn(6,4),index = dates,columns=['a','b','c','d'])
print(df)
a b c d
2016-01-01 -0.378199 -0.300236 -1.207843 -1.658223
2016-01-02 -1.031397 -0.834695 -0.417703 -0.318720
2016-01-03 -2.346667 1.615651 1.726296 1.152253
2016-01-04 1.389872 0.952453 -0.737092 1.555059
2016-01-05 0.735490 0.297005 -0.542341 0.559540
2016-01-06 -1.962791 1.776028 -1.917368 -0.679542
DataFrame
是一个表格型的数据结构,它包含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔值等)。
DataFrame
既有行索引也有列索引,它可以被看作由Series
组成的大字典。
下面访问DataFrame中的数据,注意访问具体元素是先列标签后行标签
print(df['b'])
2016-01-01 -0.300236
2016-01-02 -0.834695
2016-01-03 1.615651
2016-01-04 0.952453
2016-01-05 0.297005
2016-01-06 1.776028
Freq: D, Name: b, dtype: float64
print(df['b']['2016-01-05'])
0.2970052798746942
创建一组没有给定行标签和列标签的数据并访问
df = pd.DataFrame(np.arange(12).reshape((3,4)))
print(df)
print(df[1][0])
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
1
自定义每列的类型
df1 = pd.DataFrame({
'A':1,
'B':pd.Timestamp('20180928'),
'C':pd.Series(1,index=list(range(4)),dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(["test","train","test","train"]),
'F':'foo'})
print(df1)
print(df1['B'])
print(df1['B'][1])
A B C D E F
0 1 2018-09-28 1.0 3 test foo
1 1 2018-09-28 1.0 3 train foo
2 1 2018-09-28 1.0 3 test foo
3 1 2018-09-28 1.0 3 train foo
0 2018-09-28
1 2018-09-28
2 2018-09-28
3 2018-09-28
Name: B, dtype: datetime64[ns]
2018-09-28 00:00:00
查看每行的名称
print(df1.index)
Int64Index([0, 1, 2, 3], dtype='int64')
查看每列的名称
print(df1.columns)
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
查看所有的值
print(df1.values)
[[1 Timestamp('2018-09-28 00:00:00') 1.0 3 'test' 'foo']
[1 Timestamp('2018-09-28 00:00:00') 1.0 3 'train' 'foo']
[1 Timestamp('2018-09-28 00:00:00') 1.0 3 'test' 'foo']
[1 Timestamp('2018-09-28 00:00:00') 1.0 3 'train' 'foo']]
查看数据总结
print(df1.describe())
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
对数据的index
进行排序输出
print(df1.sort_index(axis=1,ascending=False))
F E D C B A
0 foo test 3 1.0 2018-09-28 1
1 foo train 3 1.0 2018-09-28 1
2 foo test 3 1.0 2018-09-28 1
3 foo train 3 1.0 2018-09-28 1
对数据的value
进行排序输出
print(df1.sort_values(by='B'))
A B C D E F
0 1 2018-09-28 1.0 3 test foo
1 1 2018-09-28 1.0 3 train foo
2 1 2018-09-28 1.0 3 test foo
3 1 2018-09-28 1.0 3 train foo
Pandas选择数据
简单筛选
print(df)
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
print(df['A'])
2013-01-01 0
2013-01-02 4
2013-01-03 8
2013-01-04 12
2013-01-05 16
2013-01-06 20
Freq: D, Name: A, dtype: int64
print(df.A)
2013-01-01 0
2013-01-02 4
2013-01-03 8
2013-01-04 12
2013-01-05 16
2013-01-06 20
Freq: D, Name: A, dtype: int64
print(df[0:3])
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
print(df['20130102':'20130104'])
A B C D
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
标签 loc 选择
print(df.loc['20130102'])
A 4
B 5
C 6
D 7
Name: 2013-01-02 00:00:00, dtype: int64
print(df.loc[:,['A','B']])
A B
2013-01-01 0 1
2013-01-02 4 5
2013-01-03 8 9
2013-01-04 12 13
2013-01-05 16 17
2013-01-06 20 21
print(df.loc['20130102',['A','B']])
A 4
B 5
Name: 2013-01-02 00:00:00, dtype: int64
序列 iloc 选择
print(df.iloc[3,1])
13
print(df.iloc[3:5,1:3])
B C
2013-01-04 13 14
2013-01-05 17 18
print(df.iloc[[1,3,5],1:3])
B C
2013-01-02 5 6
2013-01-04 13 14
2013-01-06 21 22
混合两种 ix 选择
print(df.ix[:3,['A','C']]) #混合选择
A C
2013-01-01 0 2
2013-01-02 4 6
2013-01-03 8 10
通过判断的筛选
print(df[df.A>8])
A B C D
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
Pandas设置值
#Pandas设置值
dates = pd.date_range('20180901',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print(df)
A B C D
2018-09-01 0 1 2 3
2018-09-02 4 5 6 7
2018-09-03 8 9 10 11
2018-09-04 12 13 14 15
2018-09-05 16 17 18 19
2018-09-06 20 21 22 23
#根据位置设置loc和iloc
df.loc['20180903','B'] = 100
df.iloc[5,3] = 200
print(df)
A B C D
2018-09-01 0 1 2 3
2018-09-02 4 5 6 7
2018-09-03 8 100 10 11
2018-09-04 12 13 14 15
2018-09-05 16 17 18 19
2018-09-06 20 21 22 200
#根据条件设置
df.B[df.A>9] = 0
print(df)
A B C D
2018-09-01 0 1 2 3
2018-09-02 4 5 6 7
2018-09-03 8 100 10 11
2018-09-04 12 0 14 15
2018-09-05 16 0 18 19
2018-09-06 20 0 22 200
#按行或列设置
df['F'] = 0
print(df)
A B C D F
2018-09-01 0 1 2 3 0
2018-09-02 4 5 6 7 0
2018-09-03 8 100 10 11 0
2018-09-04 12 0 14 15 0
2018-09-05 16 0 18 19 0
2018-09-06 20 0 22 200 0
#添加数据
df['E'] = pd.Series([1,2,3,4,5,6],index = dates)
print(df)
A B C D F E
2018-09-01 0 1 2 3 0 1
2018-09-02 4 5 6 7 0 2
2018-09-03 8 100 10 11 0 3
2018-09-04 12 0 14 15 0 4
2018-09-05 16 0 18 19 0 5
2018-09-06 20 0 22 200 0 6