前言
本文参照《利用Python进行数据分析》一书,介绍了对Series和DataFrame对象进行缺失值处理的一些方法
缺失值处理
1. isnull方法
isnull方法用于判断哪些值为缺失值(NaN),并返回布尔值
举例如下:
import pandas as pd
from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, None, NA], [NA, 6.5, 3.]])
print(data)
print(data.isnull())
输出依次为:
# data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
# data.isnull()
0 1 2
0 False False False
1 False True True
2 True True True
3 True False False
值得一提的是,python内建的None值也被当做NaN处理(见data[2][1])
isnull方法并不改变原对象,不信可以试着再输出data试试?
另有一个notnull方法,用法与isnull相似,作用与其相反
2.dropna方法
- dropna方法用于过滤缺失值,默认会删除包含缺失值的行
- 同样的,该方法也不改变原对象
举例如下:
import pandas as pd
from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, None, NA], [NA, 6.5, 3.]])
print(data)
print(data.dropna())
print(data)
输出依次为:
# data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
# data.dropna()
0 1 2
0 1.0 6.5 3.0
# data——原对象未受改变
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
- 若要删除列,可以传入参数axis=1,举例如下:
import pandas as pd
from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, None, NA], [NA, 6.5, 3.]])
print(data.dropna(axis=1))
输出为:
#一不小心就全删掉了..
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
- 传入how='all’时,将删除所有值均为NA的行
- 传入inplace=True时,将直接修改对象
举例如下:
import pandas as pd
from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, None, NA], [NA, 6.5, 3.]])
print(data)
data.dropna(how='all', inplace=True)
print(data)
输出依次为
# data(处理前),只有第3行全为NaN
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
# data(处理后)
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
另外再介绍一个thresh参数,它用来选择性过滤含有NaN值的行(或列)
比如说,令thresh = 2, 则它会删除除了NaN值,剩下的值少于2个(不含)的行
举例如下:
import numpy as np
import pandas as pd
from numpy import nan as NA
data = pd.DataFrame(np.random.randn(7, 3)) #生成一个具有正态分布的7*3数组
data.iloc[:4, 1] = NA
data.iloc[:2, 2] = NA
print(data)
data.dropna(thresh=2, inplace=True)
print(data)
输出依次为:
# 第1,2行只有一个非NaN值(小于2),故被删除
0 1 2
0 -0.224037 NaN NaN
1 0.874976 NaN NaN
2 -1.288165 NaN -1.915158
3 0.659451 NaN 0.710363
4 -0.059080 0.637937 1.374358
5 0.035360 -1.562229 0.949080
6 -0.753340 0.222408 0.042445
0 1 2
2 -1.288165 NaN -1.915158
3 0.659451 NaN 0.710363
4 -0.059080 0.637937 1.374358
5 0.035360 -1.562229 0.949080
6 -0.753340 0.222408 0.042445
3.fillna方法
fillna方法用来补全或替代缺失值
- 可以用一个常数来替代NaN值,如fillna(0)
- 可以使用字典,为不同列设定不同的替代值,如fillna({1: 0.5, 2: 0})
- 可以插入参数method用来前向或后向替代NaN值
- 可以插入参数limit来限制前向或后向填充时最大的填充范围
- 可以插入参数inplace来直接修改对象,用法与dropna方法相同
举例如下:
import numpy as np
import pandas as pd
from numpy import nan as NA
data = pd.DataFrame(np.random.randn(7, 3))
data.iloc[2:, 1] = NA
data.iloc[4:, 2] = NA
print(data)
print(data.fillna(0))
print(data.fillna({1: 0.5, 2: 1}))
print(data.fillna(method='ffill', limit=2))
结果依次为:
# data
0 1 2
0 1.458572 1.261047 0.297550
1 -0.314772 1.591051 -0.858676
2 -0.647519 NaN 0.816708
3 -1.675259 NaN -1.416578
4 -1.273126 NaN NaN
5 0.347896 NaN NaN
6 -1.774214 NaN NaN
# data.fillna(0) 将所有NaN值替换为0
0 1 2
0 1.458572 1.261047 0.297550
1 -0.314772 1.591051 -0.858676
2 -0.647519 0.000000 0.816708
3 -1.675259 0.000000 -1.416578
4 -1.273126 0.000000 0.000000
5 0.347896 0.000000 0.000000
6 -1.774214 0.000000 0.000000
# data.fillna({1: 0.5, 2: 1})第一列的NaN值替换为0.5, 第二列的替换为1
0 1 2
0 1.458572 1.261047 0.297550
1 -0.314772 1.591051 -0.858676
2 -0.647519 0.500000 0.816708
3 -1.675259 0.500000 -1.416578
4 -1.273126 0.500000 1.000000
5 0.347896 0.500000 1.000000
6 -1.774214 0.500000 1.000000
# data.fillna(method='ffill', limit=2)前向填充,并限制最大范围为2
0 1 2
0 1.458572 1.261047 0.297550
1 -0.314772 1.591051 -0.858676
2 -0.647519 1.591051 0.816708
3 -1.675259 1.591051 -1.416578
4 -1.273126 NaN -1.416578
5 0.347896 NaN -1.416578
6 -1.774214 NaN NaN