pandas常见用法总结

前言

Pandas是一个开放源码的Python库，它使用强大的数据结构提供高性能的数据操作和分析工具。这篇文章以实例方式介绍了pandas的常见用法。

导入

# pandas一般会与numpy配合使用
import pandas as pd 
import numpy as np

DataFrame

由一系列的Series组成

# Series是DataFrame的一列
series_1 = pd.Series(1, 2, "3", "a")  # 创建Serise,它会自动给每个元素加上索引值,从0开始
series_1.index = ["a","b","c","d"]  # 修改索引值
series_1.drop["a"]  # 删
series_1["a"] = 4  # 改
series_1["a"]  # 查

查看信息

# a是一个DataFrame

a.info() # 显示数据信息：列名、非空数量、数据类型等。

a.describe() # 可以得到数值型数据的一些分布

a.head()  # 显示数据(一般显示前5行,可以在括号中指定显示几行)

a.tail()  # 显示后几行数据      

a.columns  # 显示列名

a.shape  # 显示形状

创建一列数据

s = pd.Series([1,np.nan,44])
print(s)
0     1.0
1     NaN
2    44.0
dtype: float64

创建一个数据表

df = pd.DataFrame(np.random.randn(6,4)) # 有索引       
print(df)
          0         1         2         3
0 -0.485819  1.465311 -0.874580 -0.801833
1 -1.195040  0.438705 -0.152660 -0.896882
2  0.601379  0.871732 -0.232300 -1.942046
3 -1.467846  0.985194  0.802487  1.073567
4  1.137115  1.414391 -0.194927  0.145966
5  0.403413  1.570771  1.883406 -0.559665

构建一个时间序列:从20181212开始的6天


dates = pd.date_range('20181212', periods=6) 
print(dates)
DatetimeIndex(['2018-12-12', '2018-12-13', '2018-12-14', '2018-12-15',
               '2018-12-16', '2018-12-17'],
              dtype='datetime64[ns]', freq='D')

构建一个有行名有列名的表

# index是用上述的时间序列作为行名，columns是列名
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=['a','b','c','d'])
print(df)
                   a         b         c         d
2018-12-12 -0.408206  0.690151 -0.255535 -0.825533
2018-12-13 -0.782994  0.846120 -2.369437 -0.563946
2018-12-14  0.592621  0.642034  0.631633  0.470060
2018-12-15 -1.716559  0.687173 -2.644728  0.084093
2018-12-16  0.010821 -0.669383  0.484277 -0.455398
2018-12-17 -0.686960  0.171372  0.501168  0.651696

输出上表的数据类型

print(df.dtypes)
a    float64
b    float64
c    float64
d    float64
dtype: object

输出上表的行名和列名

print(df.index) # 输出行名
DatetimeIndex(['2018-12-12', '2018-12-13', '2018-12-14', '2018-12-15',
               '2018-12-16', '2018-12-17'],
              dtype='datetime64[ns]', freq='D')
             
print(df.columns) # 输出列名
Index(['a', 'b', 'c', 'd'], dtype='object')

排序

print(df)
# axis=1是对第2个维度(列名)排序；axis=0是对第1个维度(行名)排序。
# ascending=False为倒序排序，默认为True正序。
print(df.sort_index(axis=1, ascending=False)) 
                   a         b         c         d
2018-12-12 -0.806089  0.660987 -0.137833 -0.724158
2018-12-13 -0.375285 -1.071433 -1.046819  0.414112
2018-12-14  0.377175 -0.751585  0.197294  0.048427
2018-12-15 -0.872873  0.154589 -0.225713  0.713596
2018-12-16 -0.028886  1.199271  0.306876 -0.268253
2018-12-17 -1.468384 -0.105490  1.179329 -1.655588
                   d         c         b         a
2018-12-12 -0.724158 -0.137833  0.660987 -0.806089
2018-12-13  0.414112 -1.046819 -1.071433 -0.375285
2018-12-14  0.048427  0.197294 -0.751585  0.377175
2018-12-15  0.713596 -0.225713  0.154589 -0.872873
2018-12-16 -0.268253  0.306876  1.199271 -0.028886
2018-12-17 -1.655588  1.179329 -0.105490 -1.468384


# 对特定的列排序
print(df.sort_values(by='a', ascending=True))
                   a         b         c         d
2018-12-17 -1.468384 -0.105490  1.179329 -1.655588
2018-12-15 -0.872873  0.154589 -0.225713  0.713596
2018-12-12 -0.806089  0.660987 -0.137833 -0.724158
2018-12-13 -0.375285 -1.071433 -1.046819  0.414112
2018-12-16 -0.028886  1.199271  0.306876 -0.268253
2018-12-14  0.377175 -0.751585  0.197294  0.048427

索引1

dates = pd.date_range('20181212',periods=6)
df = pd.DataFrame(np.arange(24).reshape([6,4]),index=dates, columns=['A','B','C','D'])
print(df)
             A   B   C   D
2018-12-12   0   1   2   3
2018-12-13   4   5   6   7
2018-12-14   8   9  10  11
2018-12-15  12  13  14  15
2018-12-16  16  17  18  19
2018-12-17  20  21  22  23

print(df['A']) # 同 df.A，检索到特定一列
2018-12-12     0
2018-12-13     4
2018-12-14     8
2018-12-15    12
2018-12-16    16
2018-12-17    20
Freq: D, Name: A, dtype: int32

# 切片操作 
print(df[0:3]) # 同 df['20181212':'20181214']，索引或者行名都可以
            A  B   C   D
2018-12-12  0  1   2   3
2018-12-13  4  5   6   7
2018-12-14  8  9  10  11

检索2

df.loc和df.iloc的用法：

print(df)
             A   B   C   D
2018-12-12   0   1   2   3
2018-12-13   4   5   6   7
2018-12-14   8   9  10  11
2018-12-15  12  13  14  15
2018-12-16  16  17  18  19
2018-12-17  20  21  22  23

# df.loc：以行名、列名检索
print(df.loc['20181213']) # 检索到特定一行
A    4
B    5
C    6
D    7
Name: 2018-12-13 00:00:00, dtype: int32

print(df.loc[:,['A','B']]) # 所有行的A、B列
             A   B
2018-12-12   0   1
2018-12-13   4   5
2018-12-14   8   9
2018-12-15  12  13
2018-12-16  16  17
2018-12-17  20  21

print(df.loc['20181212',['A','B']]) # 20181212行的A、B列
A    0
B    1
Name: 2018-12-12 00:00:00, dtype: int32

# df.iloc：以索引号(从0开始)检索
print(df.iloc[3]) # 第三行
A    12
B    13
C    14
D    15
Name: 2018-12-15 00:00:00, dtype: int32

print(df.iloc[3,1]) # 第三行的第一列
13

print(df.iloc[3:5,1:3]) # 3-5行的1-3列
             B   C
2018-12-15  13  14
2018-12-16  17  18


print(df[df.A > 12]) # 找出A列大于12的行
             A   B   C   D
2018-12-16  16  17  18  19
2018-12-17  20  21  22  23

修改元素值

dates = pd.date_range('20181212',periods=6)
df = pd.DataFrame(np.arange(24).reshape([6,4]),index=dates, columns=['A','B','C','D'])
print(df)
             A   B   C   D
2018-12-12   0   1   2   3
2018-12-13   4   5   6   7
2018-12-14   8   9  10  11
2018-12-15  12  13  14  15
2018-12-16  16  17  18  19
2018-12-17  20  21  22  23

df.iloc[2,2] = 2222 # 将第二行第二列元素值改为2222
print(df)
             A   B     C   D
2018-12-12   0   1     2   3
2018-12-13   4   5     6   7
2018-12-14   8   9  2222  11
2018-12-15  12  13    14  15
2018-12-16  16  17    18  19
2018-12-17  20  21    22  23

df.loc['20181214','B'] = 6666 # 按行名列名改
             A     B     C   D
2018-12-12   0     1     2   3
2018-12-13   4     5     6   7
2018-12-14   8  6666  2222  11
2018-12-15  12    13    14  15
2018-12-16  16    17    18  19
2018-12-17  20    21    22  23

df[df.A>6]=222 # 把A列中所有大于6的行的值变为222
print(df)
              A    B    C    D
2018-12-12    0    1    2    3
2018-12-13    4    5    6    7
2018-12-14  222  222  222  222
2018-12-15  222  222  222  222
2018-12-16  222  222  222  222
2018-12-17  222  222  222  222

df.B[df.A>3]=999 # 再把A中所有大于3的B列值变为999
print(df)
              A    B    C    D
2018-12-12    0    1    2    3
2018-12-13    4  999    6    7
2018-12-14  222  999  222  222
2018-12-15  222  999  222  222
2018-12-16  222  999  222  222
2018-12-17  222  999  222  222

df['F'] = np.nan #添加一列 全为NaN
print(df)
              A    B    C    D   F
2018-12-12    0    1    2    3 NaN
2018-12-13    4  999    6    7 NaN
2018-12-14  222  999  222  222 NaN
2018-12-15  222  999  222  222 NaN
2018-12-16  222  999  222  222 NaN
2018-12-17  222  999  222  222 NaN

df['E'] = [1,2,3,4,5,6] # 再添加一列，给定值
print(df)
              A    B    C    D   F  E
2018-12-12    0    1    2    3 NaN  1
2018-12-13    4  999    6    7 NaN  2
2018-12-14  222  999  222  222 NaN  3
2018-12-15  222  999  222  222 NaN  4
2018-12-16  222  999  222  222 NaN  5
2018-12-17  222  999  222  222 NaN  6

NaN

pandas中缺失数据用NaN来表示。

dates = pd.date_range('20181212',periods=3)
df = pd.DataFrame(np.arange(12).reshape([3,4]),index=dates, columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,3] = np.nan
print(df) 
            A    B   C     D
2018-12-12  0  NaN   2   3.0
2018-12-13  4  5.0   6   NaN
2018-12-14  8  9.0  10  11.0

# how=any或all 分别表示有一个元素或整行元素全为NaN才删掉这行
print(df.dropna(axis=0, how='any'))
            A    B   C     D
2018-12-14  8  9.0  10  11.0

print(df.fillna(value=666)) # 数据为NaN是将其改为666
            A      B   C      D
2018-12-12  0  666.0   2    3.0
2018-12-13  4    5.0   6  666.0
2018-12-14  8    9.0  10   11.0

print(df.isnull()) # 打印出元素是否是NaN
                A      B      C      D
2018-12-12  False   True  False  False
2018-12-13  False  False  False   True
2018-12-14  False  False  False  False

print(np.any(df.isnull()) == True) # 判断表中是否有元素为NaN
True

读入和保存

# pandas可以打开很多种类型的文件，同样也可以将文件保存为很多种类型
# 简单介绍一下

# 读入
data = pd.read_csv('data.txt')

# 其中sep是可选参数，表示会以制表符分开变成几列数据。
data = pd.read_csv('data.txt',sep="\t")

# 保存，括号里写保存的路径+文件名，和程序在同一路径下只填文件名即可
data.to_csv("save.csv")

合并数据表1：concat

df1 = pd.DataFrame(np.ones([3,4])*0, columns=['a','b','c','c'])
df2 = pd.DataFrame(np.ones([3,4])*1, columns=['a','b','c','c'])
df3 = pd.DataFrame(np.ones([3,4])*2, columns=['a','b','c','c'])
print(df1)
print(df2)
print(df3)
     a    b    c    c
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
     a    b    c    c
0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0
2  1.0  1.0  1.0  1.0
     a    b    c    c
0  2.0  2.0  2.0  2.0
1  2.0  2.0  2.0  2.0
2  2.0  2.0  2.0  2.0

res = pd.concat([df1,df2,df3], axis=0) # axis默认为0
print(res)
     a    b    c    c
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0
2  1.0  1.0  1.0  1.0
0  2.0  2.0  2.0  2.0
1  2.0  2.0  2.0  2.0
2  2.0  2.0  2.0  2.0


# ignore_index=True，表示重新排列行号的索引，默认为False
res_new = pd.concat([df1,df2,df3], axis=0, ignore_index=True)
print(res_new)
     a    b    c    c
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0
5  1.0  1.0  1.0  1.0
6  2.0  2.0  2.0  2.0
7  2.0  2.0  2.0  2.0
8  2.0  2.0  2.0  2.0

合并数据表2：join

df1 = pd.DataFrame(np.ones([3,4])*0, columns=['a','b','c','d'],index=[1,2,3])
df2 = pd.DataFrame(np.ones([3,4])*1, columns=['d','c','b','e'],index=[2,3,4])
print(df1)
print(df2)
     a    b    c    d
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0
     d    c    b    e
2  1.0  1.0  1.0  1.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0

# 可选参数join默认为outer，两个表全部合并，没有的列元素值置为NaN
res = pd.concat([df1,df2],join='outer') 
print(res)
     a    b    c    d    e
1  0.0  0.0  0.0  0.0  NaN
2  0.0  0.0  0.0  0.0  NaN
3  0.0  0.0  0.0  0.0  NaN
2  NaN  1.0  1.0  1.0  1.0
3  NaN  1.0  1.0  1.0  1.0
4  NaN  1.0  1.0  1.0  1.0

# join='inner'，两个表都有的列才合并，其余的列直接删掉
res_new = pd.concat([df1,df2],join='inner')
print(res_new)
     b    c    d
1  0.0  0.0  0.0
2  0.0  0.0  0.0
3  0.0  0.0  0.0
2  1.0  1.0  1.0
3  1.0  1.0  1.0
4  1.0  1.0  1.0

# 复习一下ignore_index=True的作用
res_new2 = pd.concat([df1,df2],join='inner', ignore_index=True)
print(res_new2)
     b    c    d
0  0.0  0.0  0.0
1  0.0  0.0  0.0
2  0.0  0.0  0.0
3  1.0  1.0  1.0
4  1.0  1.0  1.0
5  1.0  1.0  1.0

合并数据表3：merge

df1 = pd.DataFrame({'key':['K0','K1','K2','K3'],
                   'A':['A0','B0','C0','D0']})
df2 = pd.DataFrame({'key':['K0','K1','K2','K3'],
                   'B':['A1','B2','C3','D4']})
print(df1)
print(df2)
    A key
0  A0  K0
1  B0  K1
2  C0  K2
3  D0  K3
    B key
0  A1  K0
1  B2  K1
2  C3  K2
3  D4  K3

# 以key为基准合并
res = pd.merge(df1,df2,on='key') 
print(res)
    A key   B
0  A0  K0  A1
1  B0  K1  B2
2  C0  K2  C3
3  D0  K3  D4


# 两个key时
df2 = pd.DataFrame({'key1':['K0','K1','K2','K3'],
                    'key2':['K1','K2','K3','K4'],
                   'A':['A0','B0','C0','D0']})
df3 = pd.DataFrame({'key1':['K2','K3','K4','K5'],
                    'key2':['K3','K4','K5','K6'],
                   'B':['A1','B2','C3','D4']})
print(df2)
print(df3)
    A key1 key2
0  A0   K0   K1
1  B0   K1   K2
2  C0   K2   K3
3  D0   K3   K4
    B key1 key2
0  A1   K2   K3
1  B2   K3   K4
2  C3   K4   K5
3  D4   K5   K6

# 其种how默认为'inner'，indicator会有一个列：表明左右两个表哪个有key1和key2
res_new = pd.merge(df2, df3, on=['key1','key2'], how='outer', indicator=True) 
print(res_new)
     A key1 key2    B      _merge
0   A0   K0   K1  NaN   left_only
1   B0   K1   K2  NaN   left_only
2   C0   K2   K3   A1        both
3   D0   K3   K4   B2        both
4  NaN   K4   K5   C3  right_only
5  NaN   K5   K6   D4  right_only

# 默认how='inner'，只显示左右两个表都有key1、key2的行
# indicator='indicator_column'只是使那一列的列名改变而已
res_new1 = pd.merge(df2, df3, on=['key1','key2'],indicator='indicator_column')
print(res_new1)
    A key1 key2   B indicator_column
0  C0   K2   K3  A1             both
1  D0   K3   K4  B2             both

suffixes：后缀

boys = pd.DataFrame({'k':['K0','K1','K2'],'age':[1,2,3]})
girls = pd.DataFrame({'k':['K1','K0','K0'],'age':[2,3,4]})
print(boys)
print(girls)
   age   k
0    1  K0
1    2  K1
2    3  K2
   age   k
0    2  K1
1    3  K0
2    4  K0

# suffixes=['_boy','_girl']，给左右两个表重名的列名加上下标
res = pd.merge(boys, girls, on='k', suffixes=['_boy','_girl'], how='inner')
print(res)
   age_boy   k  age_girl
0        1  K0         3
1        1  K0         4
2        2  K1         2

重置编号

# a是一个DataFrame
new_a = a.sort_values("Age", ascending=False)  # 降序排序
a_reindexed = new_a.reset_index(drop=True)  # 重置编号,drop=True表明原来的编号不要了

定义函数

def null_count(column):  # 用该函数可以返回缺失值的个数
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)
column_null_count = a.apply(null_count)  # 用apply来调用函数

画图简介

import matplotlib.pyplot as plt
data = pd.DataFrame(np.random.randn(1000,4),
                   index=np.arange(1000),
                   columns=list("ABCD"))
data = data.cumsum() # 累加函数
data.plot()
plt.show()

References:

[1] https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/

pandas常见用法总结

前言

导入

DataFrame

查看信息

创建一列数据

创建一个数据表

构建一个时间序列:从20181212开始的6天

构建一个有行名有列名的表

输出上表的数据类型

输出上表的行名和列名

排序

索引1

检索2

修改元素值

NaN

读入和保存

合并数据表1：concat

合并数据表2：join

合并数据表3：merge

suffixes：后缀

重置编号

定义函数

画图简介

References:

猜你喜欢