版权声明:本文为博主原创文章,转载请附上博文链接! https://blog.csdn.net/huwenxing0801/article/details/84986901
前言
Pandas是一个开放源码的Python库,它使用强大的数据结构提供高性能的数据操作和分析工具。这篇文章以实例方式介绍了pandas的常见用法。
导入
# pandas一般会与numpy配合使用
import pandas as pd
import numpy as np
DataFrame
由一系列的Series组成
# Series是DataFrame的一列
series_1 = pd.Series(1, 2, "3", "a") # 创建Serise,它会自动给每个元素加上索引值,从0开始
series_1.index = ["a","b","c","d"] # 修改索引值
series_1.drop["a"] # 删
series_1["a"] = 4 # 改
series_1["a"] # 查
查看信息
# a是一个DataFrame
a.info() # 显示数据信息:列名、非空数量、数据类型等。
a.describe() # 可以得到数值型数据的一些分布
a.head() # 显示数据(一般显示前5行,可以在括号中指定显示几行)
a.tail() # 显示后几行数据
a.columns # 显示列名
a.shape # 显示形状
创建一列数据
s = pd.Series([1,np.nan,44])
print(s)
0 1.0
1 NaN
2 44.0
dtype: float64
创建一个数据表
df = pd.DataFrame(np.random.randn(6,4)) # 有索引
print(df)
0 1 2 3
0 -0.485819 1.465311 -0.874580 -0.801833
1 -1.195040 0.438705 -0.152660 -0.896882
2 0.601379 0.871732 -0.232300 -1.942046
3 -1.467846 0.985194 0.802487 1.073567
4 1.137115 1.414391 -0.194927 0.145966
5 0.403413 1.570771 1.883406 -0.559665
构建一个时间序列:从20181212开始的6天
dates = pd.date_range('20181212', periods=6)
print(dates)
DatetimeIndex(['2018-12-12', '2018-12-13', '2018-12-14', '2018-12-15',
'2018-12-16', '2018-12-17'],
dtype='datetime64[ns]', freq='D')
构建一个有行名有列名的表
# index是用上述的时间序列作为行名,columns是列名
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=['a','b','c','d'])
print(df)
a b c d
2018-12-12 -0.408206 0.690151 -0.255535 -0.825533
2018-12-13 -0.782994 0.846120 -2.369437 -0.563946
2018-12-14 0.592621 0.642034 0.631633 0.470060
2018-12-15 -1.716559 0.687173 -2.644728 0.084093
2018-12-16 0.010821 -0.669383 0.484277 -0.455398
2018-12-17 -0.686960 0.171372 0.501168 0.651696
输出上表的数据类型
print(df.dtypes)
a float64
b float64
c float64
d float64
dtype: object
输出上表的行名和列名
print(df.index) # 输出行名
DatetimeIndex(['2018-12-12', '2018-12-13', '2018-12-14', '2018-12-15',
'2018-12-16', '2018-12-17'],
dtype='datetime64[ns]', freq='D')
print(df.columns) # 输出列名
Index(['a', 'b', 'c', 'd'], dtype='object')
排序
print(df)
# axis=1是对第2个维度(列名)排序;axis=0是对第1个维度(行名)排序。
# ascending=False为倒序排序,默认为True正序。
print(df.sort_index(axis=1, ascending=False))
a b c d
2018-12-12 -0.806089 0.660987 -0.137833 -0.724158
2018-12-13 -0.375285 -1.071433 -1.046819 0.414112
2018-12-14 0.377175 -0.751585 0.197294 0.048427
2018-12-15 -0.872873 0.154589 -0.225713 0.713596
2018-12-16 -0.028886 1.199271 0.306876 -0.268253
2018-12-17 -1.468384 -0.105490 1.179329 -1.655588
d c b a
2018-12-12 -0.724158 -0.137833 0.660987 -0.806089
2018-12-13 0.414112 -1.046819 -1.071433 -0.375285
2018-12-14 0.048427 0.197294 -0.751585 0.377175
2018-12-15 0.713596 -0.225713 0.154589 -0.872873
2018-12-16 -0.268253 0.306876 1.199271 -0.028886
2018-12-17 -1.655588 1.179329 -0.105490 -1.468384
# 对特定的列排序
print(df.sort_values(by='a', ascending=True))
a b c d
2018-12-17 -1.468384 -0.105490 1.179329 -1.655588
2018-12-15 -0.872873 0.154589 -0.225713 0.713596
2018-12-12 -0.806089 0.660987 -0.137833 -0.724158
2018-12-13 -0.375285 -1.071433 -1.046819 0.414112
2018-12-16 -0.028886 1.199271 0.306876 -0.268253
2018-12-14 0.377175 -0.751585 0.197294 0.048427
索引1
dates = pd.date_range('20181212',periods=6)
df = pd.DataFrame(np.arange(24).reshape([6,4]),index=dates, columns=['A','B','C','D'])
print(df)
A B C D
2018-12-12 0 1 2 3
2018-12-13 4 5 6 7
2018-12-14 8 9 10 11
2018-12-15 12 13 14 15
2018-12-16 16 17 18 19
2018-12-17 20 21 22 23
print(df['A']) # 同 df.A,检索到特定一列
2018-12-12 0
2018-12-13 4
2018-12-14 8
2018-12-15 12
2018-12-16 16
2018-12-17 20
Freq: D, Name: A, dtype: int32
# 切片操作
print(df[0:3]) # 同 df['20181212':'20181214'],索引或者行名都可以
A B C D
2018-12-12 0 1 2 3
2018-12-13 4 5 6 7
2018-12-14 8 9 10 11
检索2
df.loc和df.iloc的用法:
print(df)
A B C D
2018-12-12 0 1 2 3
2018-12-13 4 5 6 7
2018-12-14 8 9 10 11
2018-12-15 12 13 14 15
2018-12-16 16 17 18 19
2018-12-17 20 21 22 23
# df.loc:以行名、列名检索
print(df.loc['20181213']) # 检索到特定一行
A 4
B 5
C 6
D 7
Name: 2018-12-13 00:00:00, dtype: int32
print(df.loc[:,['A','B']]) # 所有行的A、B列
A B
2018-12-12 0 1
2018-12-13 4 5
2018-12-14 8 9
2018-12-15 12 13
2018-12-16 16 17
2018-12-17 20 21
print(df.loc['20181212',['A','B']]) # 20181212行的A、B列
A 0
B 1
Name: 2018-12-12 00:00:00, dtype: int32
# df.iloc:以索引号(从0开始)检索
print(df.iloc[3]) # 第三行
A 12
B 13
C 14
D 15
Name: 2018-12-15 00:00:00, dtype: int32
print(df.iloc[3,1]) # 第三行的第一列
13
print(df.iloc[3:5,1:3]) # 3-5行的1-3列
B C
2018-12-15 13 14
2018-12-16 17 18
print(df[df.A > 12]) # 找出A列大于12的行
A B C D
2018-12-16 16 17 18 19
2018-12-17 20 21 22 23
修改元素值
dates = pd.date_range('20181212',periods=6)
df = pd.DataFrame(np.arange(24).reshape([6,4]),index=dates, columns=['A','B','C','D'])
print(df)
A B C D
2018-12-12 0 1 2 3
2018-12-13 4 5 6 7
2018-12-14 8 9 10 11
2018-12-15 12 13 14 15
2018-12-16 16 17 18 19
2018-12-17 20 21 22 23
df.iloc[2,2] = 2222 # 将第二行第二列元素值改为2222
print(df)
A B C D
2018-12-12 0 1 2 3
2018-12-13 4 5 6 7
2018-12-14 8 9 2222 11
2018-12-15 12 13 14 15
2018-12-16 16 17 18 19
2018-12-17 20 21 22 23
df.loc['20181214','B'] = 6666 # 按行名列名改
A B C D
2018-12-12 0 1 2 3
2018-12-13 4 5 6 7
2018-12-14 8 6666 2222 11
2018-12-15 12 13 14 15
2018-12-16 16 17 18 19
2018-12-17 20 21 22 23
df[df.A>6]=222 # 把A列中所有大于6的行的值变为222
print(df)
A B C D
2018-12-12 0 1 2 3
2018-12-13 4 5 6 7
2018-12-14 222 222 222 222
2018-12-15 222 222 222 222
2018-12-16 222 222 222 222
2018-12-17 222 222 222 222
df.B[df.A>3]=999 # 再把A中所有大于3的B列值变为999
print(df)
A B C D
2018-12-12 0 1 2 3
2018-12-13 4 999 6 7
2018-12-14 222 999 222 222
2018-12-15 222 999 222 222
2018-12-16 222 999 222 222
2018-12-17 222 999 222 222
df['F'] = np.nan #添加一列 全为NaN
print(df)
A B C D F
2018-12-12 0 1 2 3 NaN
2018-12-13 4 999 6 7 NaN
2018-12-14 222 999 222 222 NaN
2018-12-15 222 999 222 222 NaN
2018-12-16 222 999 222 222 NaN
2018-12-17 222 999 222 222 NaN
df['E'] = [1,2,3,4,5,6] # 再添加一列,给定值
print(df)
A B C D F E
2018-12-12 0 1 2 3 NaN 1
2018-12-13 4 999 6 7 NaN 2
2018-12-14 222 999 222 222 NaN 3
2018-12-15 222 999 222 222 NaN 4
2018-12-16 222 999 222 222 NaN 5
2018-12-17 222 999 222 222 NaN 6
NaN
pandas中缺失数据用NaN来表示。
dates = pd.date_range('20181212',periods=3)
df = pd.DataFrame(np.arange(12).reshape([3,4]),index=dates, columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,3] = np.nan
print(df)
A B C D
2018-12-12 0 NaN 2 3.0
2018-12-13 4 5.0 6 NaN
2018-12-14 8 9.0 10 11.0
# how=any或all 分别表示有一个元素或整行元素全为NaN才删掉这行
print(df.dropna(axis=0, how='any'))
A B C D
2018-12-14 8 9.0 10 11.0
print(df.fillna(value=666)) # 数据为NaN是将其改为666
A B C D
2018-12-12 0 666.0 2 3.0
2018-12-13 4 5.0 6 666.0
2018-12-14 8 9.0 10 11.0
print(df.isnull()) # 打印出元素是否是NaN
A B C D
2018-12-12 False True False False
2018-12-13 False False False True
2018-12-14 False False False False
print(np.any(df.isnull()) == True) # 判断表中是否有元素为NaN
True
读入和保存
# pandas可以打开很多种类型的文件,同样也可以将文件保存为很多种类型
# 简单介绍一下
# 读入
data = pd.read_csv('data.txt')
# 其中sep是可选参数,表示会以制表符分开变成几列数据。
data = pd.read_csv('data.txt',sep="\t")
# 保存,括号里写保存的路径+文件名,和程序在同一路径下只填文件名即可
data.to_csv("save.csv")
合并数据表1:concat
df1 = pd.DataFrame(np.ones([3,4])*0, columns=['a','b','c','c'])
df2 = pd.DataFrame(np.ones([3,4])*1, columns=['a','b','c','c'])
df3 = pd.DataFrame(np.ones([3,4])*2, columns=['a','b','c','c'])
print(df1)
print(df2)
print(df3)
a b c c
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
a b c c
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
a b c c
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
res = pd.concat([df1,df2,df3], axis=0) # axis默认为0
print(res)
a b c c
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
# ignore_index=True,表示重新排列行号的索引,默认为False
res_new = pd.concat([df1,df2,df3], axis=0, ignore_index=True)
print(res_new)
a b c c
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
合并数据表2:join
df1 = pd.DataFrame(np.ones([3,4])*0, columns=['a','b','c','d'],index=[1,2,3])
df2 = pd.DataFrame(np.ones([3,4])*1, columns=['d','c','b','e'],index=[2,3,4])
print(df1)
print(df2)
a b c d
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
d c b e
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
# 可选参数join默认为outer,两个表全部合并,没有的列元素值置为NaN
res = pd.concat([df1,df2],join='outer')
print(res)
a b c d e
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 0.0 NaN
2 NaN 1.0 1.0 1.0 1.0
3 NaN 1.0 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0 1.0
# join='inner',两个表都有的列才合并,其余的列直接删掉
res_new = pd.concat([df1,df2],join='inner')
print(res_new)
b c d
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
# 复习一下ignore_index=True的作用
res_new2 = pd.concat([df1,df2],join='inner', ignore_index=True)
print(res_new2)
b c d
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
合并数据表3:merge
df1 = pd.DataFrame({'key':['K0','K1','K2','K3'],
'A':['A0','B0','C0','D0']})
df2 = pd.DataFrame({'key':['K0','K1','K2','K3'],
'B':['A1','B2','C3','D4']})
print(df1)
print(df2)
A key
0 A0 K0
1 B0 K1
2 C0 K2
3 D0 K3
B key
0 A1 K0
1 B2 K1
2 C3 K2
3 D4 K3
# 以key为基准合并
res = pd.merge(df1,df2,on='key')
print(res)
A key B
0 A0 K0 A1
1 B0 K1 B2
2 C0 K2 C3
3 D0 K3 D4
# 两个key时
df2 = pd.DataFrame({'key1':['K0','K1','K2','K3'],
'key2':['K1','K2','K3','K4'],
'A':['A0','B0','C0','D0']})
df3 = pd.DataFrame({'key1':['K2','K3','K4','K5'],
'key2':['K3','K4','K5','K6'],
'B':['A1','B2','C3','D4']})
print(df2)
print(df3)
A key1 key2
0 A0 K0 K1
1 B0 K1 K2
2 C0 K2 K3
3 D0 K3 K4
B key1 key2
0 A1 K2 K3
1 B2 K3 K4
2 C3 K4 K5
3 D4 K5 K6
# 其种how默认为'inner',indicator会有一个列:表明左右两个表哪个有key1和key2
res_new = pd.merge(df2, df3, on=['key1','key2'], how='outer', indicator=True)
print(res_new)
A key1 key2 B _merge
0 A0 K0 K1 NaN left_only
1 B0 K1 K2 NaN left_only
2 C0 K2 K3 A1 both
3 D0 K3 K4 B2 both
4 NaN K4 K5 C3 right_only
5 NaN K5 K6 D4 right_only
# 默认how='inner',只显示左右两个表都有key1、key2的行
# indicator='indicator_column'只是使那一列的列名改变而已
res_new1 = pd.merge(df2, df3, on=['key1','key2'],indicator='indicator_column')
print(res_new1)
A key1 key2 B indicator_column
0 C0 K2 K3 A1 both
1 D0 K3 K4 B2 both
suffixes:后缀
boys = pd.DataFrame({'k':['K0','K1','K2'],'age':[1,2,3]})
girls = pd.DataFrame({'k':['K1','K0','K0'],'age':[2,3,4]})
print(boys)
print(girls)
age k
0 1 K0
1 2 K1
2 3 K2
age k
0 2 K1
1 3 K0
2 4 K0
# suffixes=['_boy','_girl'],给左右两个表重名的列名加上下标
res = pd.merge(boys, girls, on='k', suffixes=['_boy','_girl'], how='inner')
print(res)
age_boy k age_girl
0 1 K0 3
1 1 K0 4
2 2 K1 2
重置编号
# a是一个DataFrame
new_a = a.sort_values("Age", ascending=False) # 降序排序
a_reindexed = new_a.reset_index(drop=True) # 重置编号,drop=True表明原来的编号不要了
定义函数
def null_count(column): # 用该函数可以返回缺失值的个数
column_null = pd.isnull(column)
null = column[column_null]
return len(null)
column_null_count = a.apply(null_count) # 用apply来调用函数
画图简介
import matplotlib.pyplot as plt
data = pd.DataFrame(np.random.randn(1000,4),
index=np.arange(1000),
columns=list("ABCD"))
data = data.cumsum() # 累加函数
data.plot()
plt.show()
References:
[1] https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/