Pandas常见用法

一、简介

Pandas是一套基于Numpy的数据分析工具，包含很多数据模型，方便操作大型数据集；
Pandas包含Series（一维数据集）和DataFrame（二维数据集）两个重要数据类型；
Pandas包含values（原始数据）、index（索引）、columns（索引）等属性；

二、Series

2.1、创建方式

2.1.1、直接创建

import numpy as np
import pandas as pd

data = pd.Series(['a', 'b', 'c'])
print(data)
print(data.values)
print(data.index)

012为数据集索引号

0    a
1    b
2    c
dtype: object
['a' 'b' 'c']
RangeIndex(start=0, stop=3, step=1)

2.1.2、指定index

data = pd.Series(['a', 'b', 'c'], index=['A', 'B', 'C'])
print(data)
data = pd.Series(['a', 'b', 'c'], index=list('一二三'))
print(data)

A    a
B    b
C    c
dtype: object
一    a
二    b
三    c
dtype: object

2.1.3、字典方式

# 默认使用key作为index
phone_price_dict = {
    
    '华为': 5000, '小米': 4000, 'OPPO': 3000}
phone_price_series = pd.Series(phone_price_dict)
print(phone_price_series)
# 指定index时，如果key不存在则为NaN值
phone_price_series = pd.Series(phone_price_dict, index=['华为', 'xiaomi', 'oppo'])
print(phone_price_series)

华为      5000
小米      4000
OPPO    3000
dtype: int64
华为        5000.0
xiaomi       NaN
oppo         NaN
dtype: float64

2.1.4、标量方式

data = pd.Series(10, index=[1, 2, 3])
print(data)

1    10
2    10
3    10
dtype: int64

2.2、取值

2.2.1、常规方式

data = pd.Series([1, 2, 3, 4, 5], index=list('ABCDE'))
print(data[0])  # 索引查询
print(data[1:3])  # 索引切片
print(data['E'])  # key查询
print(data['D':'E'])  # key切片

1
B    2
C    3
dtype: int64
5
D    4
E    5
dtype: int64

2.2.2、loc与iloc

因为位置索引与标签索引有可能为同一个数值（比如都为2），此时data[2]取值就容易混淆，所以可以采用iloc（位置索引）、loc（标签索引）方式取值

data = pd.Series([1, 2, 3, 4, 5], index=[1, 2, 3, 4, 5])
print(data[2])
print(data.loc[2])  # 标签索引
print(data.iloc[2])  # 位置索引
print(data.iloc[0])

三、DataFrame

两个Series组合一个DataFrame

3.1、创建方式

3.1.1、直接创建

phone_price_dict = {
    
    '华为': 5000, '小米': 4000, 'OPPO': 3000}
phone_sales_dict = {
    
    '华为': 500, '小米': 400, 'OPPO': 300}
phone_price_series = pd.Series(phone_price_dict)
phone_sales_series = pd.Series(phone_sales_dict)
phone = pd.DataFrame({
    
    '销量': phone_sales_series, '价格':phone_price_series})
print(phone)
print(phone.values)
print(phone.index)
print(phone.columns)

       销量    价格
华为    500  5000
小米    400  4000
OPPO  300  3000
[[ 500 5000]
 [ 400 4000]
 [ 300 3000]]
Index(['华为', '小米', 'OPPO'], dtype='object')
Index(['销量', '价格'], dtype='object')

3.1.2、列表方式

phone_price_dict = {
    
    '华为': 5000, '小米': 4000, 'OPPO': 3000}
phone_sales_dict = {
    
    '华为': 500, '小米': 400, 'OPPO': 300}
phone = pd.DataFrame([phone_price_dict, phone_sales_dict])
print(phone)

     华为    小米  OPPO
0  5000  4000  3000
1   500   400   300

3.1.3、列表遍历方式

data = pd.DataFrame([{
    
    'x': i, 'y': i+10} for i in range(5)])
print(data)

3.1.4、随机方式

data = pd.DataFrame(np.random.randint(0, 10, (3, 4)), index=list('一二三'), columns=list('1234'))
print(data)

   1  2  3  4
一  1  0  0  6
二  9  0  2  2
三  8  5  5  3

3.1.5、指定index

phone_price_dict = {
    
    '华为': 5000, '小米': 4000, 'OPPO': 3000}
phone_sales_dict = {
    
    '华为': 500, '小米': 400, 'OPPO': 300}
phone = pd.DataFrame([phone_price_dict, phone_sales_dict], index=['价格', '销量'])
print(phone)

      华为    小米  OPPO
价格  5000  4000  3000
销量   500   400   300

3.1.6、输出指定columns

phone_price_dict = {
    
    '华为': 5000, '小米': 4000, 'OPPO': 3000}
phone_sales_dict = {
    
    '华为': 500, '小米': 400, 'OPPO': 300}
phone = pd.DataFrame([phone_price_dict, phone_sales_dict],  columns=['小米'])
print(phone)

     小米
0  4000
1   400

3.2、取值

3.2.1、常规方式

np.random.seed(1)
data = pd.DataFrame(np.random.randint(0, 10, (3, 4)), index=list('一二三'), columns=list('abcd'))
print(data)
print(data['b'])  # 取指定列
print(data[['a', 'b']])  # 取指定多列

  a  b  c  d
一  5  8  9  5
二  0  0  1  7
三  6  9  2  4
一    8
二    0
三    9
Name: b, dtype: int32
   a  b
一  5  8
二  0  0
三  6  9

3.2.2、loc与iloc

np.random.seed(1)
data = pd.DataFrame(np.random.randint(0, 10, (3, 4)), index=list('一二三'), columns=list('abcd'))
print(data)
print(data.iloc[:, 0])  # 取指定列
print(data.iloc[:, [0, 2]])  # 取指定多列
print(data.iloc[:, 0:3])  # 取连续多列
print(data.loc[:, 'b':'d'])   # 取连续多列
print(data.loc['一'])  # 取指定行
print(data.loc[['一', '三']])  # 取指定多行
print(data.loc['一': '二', 'a':'b'])  # 同时取行和列表
print(data.iloc[0:2, 0:2])  # 同时取行和列表

   a  b  c  d
一  5  8  9  5
二  0  0  1  7
三  6  9  2  4
一    5
二    0
三    6
Name: a, dtype: int32
   a  c
一  5  9
二  0  1
三  6  2
   a  b  c
一  5  8  9
二  0  0  1
三  6  9  2
   b  c  d
一  8  9  5
二  0  1  7
三  9  2  4
a    5
b    8
c    9
d    5
Name: 一, dtype: int32
   a  b  c  d
一  5  8  9  5
三  6  9  2  4
   a  b
一  5  8
二  0  0
   a  b
一  5  8
二  0  0

四、数据处理

4.1、常用函数

data = pd.DataFrame({
    
    
    'Name': ['tom', 'jim', 'jack'],
    'ID': ['001', '002', '003'],
    'Sex': ['M', 'W', 'M'],
    'Age': [10, 20, 30]
}, columns=['ID', 'Sex', 'Age'], index=['tom', 'jim', 'jack'])
print(data)
print(data['Age'].mean())  # 平均值
print(data['Age'].max())  # 最大值
print(data['Age'].min())  # 最小值
print(data['Age'].std())  # 标准差
print(data['Age'].sort_values(ascending=False))  # 降序排序

       ID Sex  Age
tom   001   M   10
jim   002   W   20
jack  003   M   30
20.0
30
10
10.0
jack    30
jim     20
tom     10
Name: Age, dtype: int64

4.2、条件过滤

...
ages = data['Age']
print(data[ages > ages.mean()])  # 年龄大于平均值的信息
print(data[ages > ages.mean()].loc[:, ['ID']])  # 年龄大于平均值的信息中的ID值

       ID Sex  Age
tom   001   M   10
jim   002   W   20
jack  003   M   30
       ID Sex  Age
jack  003   M   30
       ID
jack  003

4.3、缺失值处理

data = pd.DataFrame([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])
print(data)
print(data.info())  # 查看缺失值情况（列维度）
print(data.isnull())  # 查看哪个值为缺失值（返回True表示缺失值，False相反）
print(data.notnull())  # 与上面相反
print(data.dropna())  # 删除含义缺失值的行
print(data.dropna(axis='columns'))  # 删除含义缺失值的列
print(data.fillna(0))  # 使用0填充缺失值
print(data.fillna(method='ffill'))  # 使用行的前一个数填充
print(data.fillna(method='bfill', axis=1))  # 使用列的后一个数填充
for i in data.columns:  # 使用列平均值填充
    data[i] = data[i].fillna(np.nanmean(data[i]))
print(data)

   0    1    2
0  1  2.0  NaN
1  4  NaN  6.0
2  7  8.0  9.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
0    3 non-null int64
1    2 non-null float64
2    2 non-null float64
dtypes: float64(2), int64(1)
memory usage: 200.0 bytes
None
       0      1      2
0  False  False   True
1  False   True  False
2  False  False  False
      0      1      2
0  True   True  False
1  True  False   True
2  True   True   True
   0    1    2
2  7  8.0  9.0
   0
0  1
1  4
2  7
   0    1    2
0  1  2.0  0.0
1  4  0.0  6.0
2  7  8.0  9.0
   0    1    2
0  1  2.0  NaN
1  4  2.0  6.0
2  7  8.0  9.0
     0    1    2
0  1.0  2.0  NaN
1  4.0  6.0  6.0
2  7.0  8.0  9.0
   0    1    2
0  1  2.0  7.5
1  4  5.0  6.0
2  7  8.0  9.0

4.4、拼接与合并

4.4.1、concat

s1 = pd.Series([1, 2, 3], index=list('abc'))
s2 = pd.Series([4, 5, 6], index=list('efg'))
s3 = pd.concat([s1, s2])
print(s3)

a    1
b    2
c    3
e    4
f    5
g    6
dtype: int64

4.4.2、merge

# 4种合并方式： how=['left', 'right', 'outer', 'inner']，默认how='inner'
left = pd.DataFrame({
    
    
    'key': ['k1', 'k2', 'k3', 'k4'],
    'a': ['a1', 'a2', 'a3', 'a4'],
    'b': ['b1', 'b2', 'b3', 'b3']
})
right = pd.DataFrame({
    
    
    'key': ['k1', 'k2', 'k3', 'k5'],
    'c': ['c1', 'c2', 'c3', 'c5'],
    'd': ['d1', 'd2', 'd3', 'd4']
})
print(pd.merge(left, right))
print(pd.merge(left, right, how='outer'))
print(pd.merge(left, right, how='left'))
print(pd.merge(left, right, how='right'))

  key   a   b   c   d
0  k1  a1  b1  c1  d1
1  k2  a2  b2  c2  d2
2  k3  a3  b3  c3  d3
  key    a    b    c    d
0  k1   a1   b1   c1   d1
1  k2   a2   b2   c2   d2
2  k3   a3   b3   c3   d3
3  k4   a4   b3  NaN  NaN
4  k5  NaN  NaN   c5   d4
  key   a   b    c    d
0  k1  a1  b1   c1   d1
1  k2  a2  b2   c2   d2
2  k3  a3  b3   c3   d3
3  k4  a4  b3  NaN  NaN
  key    a    b   c   d
0  k1   a1   b1  c1  d1
1  k2   a2   b2  c2  d2
2  k3   a3   b3  c3  d3
3  k5  NaN  NaN  c5  d4

目录