数据预处理:处理缺失值

前言

本文参照《利用Python进行数据分析》一书,介绍了对Series和DataFrame对象进行缺失值处理的一些方法

缺失值处理

1. isnull方法

isnull方法用于判断哪些值为缺失值(NaN),并返回布尔值
举例如下:

import pandas as pd
from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, None, NA], [NA, 6.5, 3.]])
print(data)
print(data.isnull())

输出依次为:

# data
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
# data.isnull()
       0      1      2
0  False  False  False
1  False   True   True
2   True   True   True
3   True  False  False

值得一提的是,python内建的None值也被当做NaN处理(见data[2][1])

isnull方法并不改变原对象,不信可以试着再输出data试试?

另有一个notnull方法,用法与isnull相似,作用与其相反

2.dropna方法

  • dropna方法用于过滤缺失值,默认会删除包含缺失值的
  • 同样的,该方法也不改变原对象

举例如下:

import pandas as pd
from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, None, NA], [NA, 6.5, 3.]])
print(data)
print(data.dropna())
print(data)

输出依次为:

# data
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
# data.dropna()
     0    1    2
0  1.0  6.5  3.0
# data——原对象未受改变
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
  • 若要删除列,可以传入参数axis=1,举例如下:
import pandas as pd
from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, None, NA], [NA, 6.5, 3.]])
print(data.dropna(axis=1))

输出为:

#一不小心就全删掉了..
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
  • 传入how='all’时,将删除所有值均为NA的行
  • 传入inplace=True时,将直接修改对象

举例如下:

import pandas as pd
from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, None, NA], [NA, 6.5, 3.]])
print(data)
data.dropna(how='all', inplace=True)
print(data)

输出依次为

# data(处理前),只有第3行全为NaN
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
# data(处理后)
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0

另外再介绍一个thresh参数,它用来选择性过滤含有NaN值的行(或列)
比如说,令thresh = 2, 则它会删除除了NaN值,剩下的值少于2个(不含)的行
举例如下:

import numpy as np
import pandas as pd
from numpy import nan as NA
data = pd.DataFrame(np.random.randn(7, 3)) #生成一个具有正态分布的7*3数组
data.iloc[:4, 1] = NA
data.iloc[:2, 2] = NA
print(data)
data.dropna(thresh=2, inplace=True)
print(data)

输出依次为:

# 第1,2行只有一个非NaN值(小于2),故被删除
          0         1         2
0 -0.224037       NaN       NaN
1  0.874976       NaN       NaN
2 -1.288165       NaN -1.915158
3  0.659451       NaN  0.710363
4 -0.059080  0.637937  1.374358
5  0.035360 -1.562229  0.949080
6 -0.753340  0.222408  0.042445

          0         1         2
2 -1.288165       NaN -1.915158
3  0.659451       NaN  0.710363
4 -0.059080  0.637937  1.374358
5  0.035360 -1.562229  0.949080
6 -0.753340  0.222408  0.042445

3.fillna方法

fillna方法用来补全替代缺失值

  • 可以用一个常数来替代NaN值,如fillna(0)
  • 可以使用字典,为不同列设定不同的替代值,如fillna({1: 0.5, 2: 0})
  • 可以插入参数method用来前向或后向替代NaN值
  • 可以插入参数limit来限制前向或后向填充时最大的填充范围
  • 可以插入参数inplace来直接修改对象,用法与dropna方法相同

举例如下:

import numpy as np
import pandas as pd
from numpy import nan as NA
data = pd.DataFrame(np.random.randn(7, 3))
data.iloc[2:, 1] = NA
data.iloc[4:, 2] = NA
print(data)
print(data.fillna(0))
print(data.fillna({1: 0.5, 2: 1}))
print(data.fillna(method='ffill', limit=2))

结果依次为:

# data
          0         1         2
0  1.458572  1.261047  0.297550
1 -0.314772  1.591051 -0.858676
2 -0.647519       NaN  0.816708
3 -1.675259       NaN -1.416578
4 -1.273126       NaN       NaN
5  0.347896       NaN       NaN
6 -1.774214       NaN       NaN
# data.fillna(0) 将所有NaN值替换为0
          0         1         2
0  1.458572  1.261047  0.297550
1 -0.314772  1.591051 -0.858676
2 -0.647519  0.000000  0.816708
3 -1.675259  0.000000 -1.416578
4 -1.273126  0.000000  0.000000
5  0.347896  0.000000  0.000000
6 -1.774214  0.000000  0.000000
# data.fillna({1: 0.5, 2: 1})第一列的NaN值替换为0.5, 第二列的替换为1
          0         1         2
0  1.458572  1.261047  0.297550
1 -0.314772  1.591051 -0.858676
2 -0.647519  0.500000  0.816708
3 -1.675259  0.500000 -1.416578
4 -1.273126  0.500000  1.000000
5  0.347896  0.500000  1.000000
6 -1.774214  0.500000  1.000000
# data.fillna(method='ffill', limit=2)前向填充,并限制最大范围为2
          0         1         2
0  1.458572  1.261047  0.297550
1 -0.314772  1.591051 -0.858676
2 -0.647519  1.591051  0.816708
3 -1.675259  1.591051 -1.416578
4 -1.273126       NaN -1.416578
5  0.347896       NaN -1.416578
6 -1.774214       NaN       NaN
发布了19 篇原创文章 · 获赞 1 · 访问量 799

猜你喜欢

转载自blog.csdn.net/weixin_43901558/article/details/104299026