-
过滤缺失值:dropna
)1.在Series上使用dropna,dropna会直接把缺失值所在的行过滤掉 from pandas import Series,DataFrame from numpy import nan as NA data = Series(['a','b',NA,'d']) print(data.dropna()) 0 a 1 b 3 d dtype: object )2.DataFrame中,dropna默认情况下会删除包含缺失值的行 data2 = DataFrame([[1,2,3],[4,NA,6],[7,8,9],[NA,NA,NA]]) print(data2) 0 1 2 0 1.0 2.0 3.0 1 4.0 NaN 6.0 2 7.0 8.0 9.0 3 NaN NaN NaN print(data2.dropna()) 0 1 2 0 1.0 2.0 3.0 2 7.0 8.0 9.0 )3.参数how=all,将删除所有值为NA的行 print(data2.dropna(how='all')) 0 1 2 0 1.0 2.0 3.0 1 4.0 NaN 6.0 2 7.0 8.0 9.0 )4.参数axis=1,删除包含缺失值的列 print(data2.dropna(axis=1)) Empty DataFrame Columns: [] Index: [0, 1, 2, 3] print(data2.dropna(how='all',axis=1)) 0 1 2 0 1.0 2.0 3.0 1 4.0 NaN 6.0 2 7.0 8.0 9.0 3 NaN NaN NaN )5.参数thresh=n,保留至少几个非NA值的行/列 thresh:脱落 df = DataFrame(np.random.randn(7,4)) df.iloc[:4,:2]=NA df.iloc[:2,2]=NA print(df) 0 1 2 3 0 NaN NaN NaN 0.552708 1 NaN NaN NaN -0.032440 2 NaN NaN -0.451361 -1.666976 3 NaN NaN 0.289092 -0.750890 4 -0.508130 -1.409111 0.133071 -0.033718 5 1.253351 -0.399418 -1.522084 0.264536 print(df.dropna(thresh=3)) 0 1 2 3 4 -0.508130 -1.409111 0.133071 -0.033718 5 1.253351 -0.399418 -1.522084 0.264536 6 0.794228 -0.781878 -0.872452 0.933511 print(df.dropna(thresh=2)) 0 1 2 3 2 NaN NaN -0.102649 -1.341281 3 NaN NaN 1.699763 0.445703 4 -0.248189 1.021283 -1.104852 -1.672537 5 0.519167 1.364827 -0.119368 -0.688406 6 -1.350202 -1.876677 -0.250996 -0.405626
-
补全缺失值:fillna
)1.用常数来替代所有的NA df = DataFrame(np.random.randn(7,4)) df.iloc[:4,:2]=NA df.iloc[:2,2]=NA print(df) 0 1 2 3 0 NaN NaN NaN -1.342383 1 NaN NaN NaN 0.457186 2 NaN NaN 0.688483 -0.793584 3 NaN NaN -1.441301 0.850932 4 0.517123 1.534299 0.587808 -0.074566 5 -0.630781 1.533494 -0.682704 -2.778620 6 0.735230 -0.268278 1.510993 -1.083807 print(df.fillna(0)) 0 1 2 3 0 0.000000 0.000000 0.000000 -1.342383 1 0.000000 0.000000 0.000000 0.457186 2 0.000000 0.000000 0.688483 -0.793584 3 0.000000 0.000000 -1.441301 0.850932 4 0.517123 1.534299 0.587808 -0.074566 5 -0.630781 1.533494 -0.682704 -2.778620 6 0.735230 -0.268278 1.510993 -1.083807 )2.为不同的列设定不同的填充值 print(df.fillna({ 0:1,1:2,2:3})) 0 1 2 3 0 1.000000 2.000000 3.000000 1.790328 1 1.000000 2.000000 3.000000 -0.080825 2 1.000000 2.000000 -0.305895 -0.956160 3 1.000000 2.000000 -0.912206 1.343991 4 0.634614 -1.312406 1.119154 -1.272266 5 -0.959586 -1.988487 0.638590 -1.002639 6 -1.338498 0.657485 0.667352 0.032378
-
删除重复值:duplicated(重复的)
)1.duplicated:返回一个布尔值Series data = DataFrame({ 'k1':['one','two']*3+['two'], 'k2':[1,1,2,3,3,4,4]}) print(data) k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4 6 two 4 print(data.duplicated()) 0 False 1 False 2 False 3 False 4 False 5 False 6 True dtype: bool )2.drop_duplicates:返回删除重复值的Dataframe print(data.drop_duplicates()) k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4 )3.基于某一列去除重复值:传入列名 data['k3']=range(7) print(data) k1 k2 k3 0 one 1 0 1 two 1 1 2 one 2 2 3 two 3 3 4 one 3 4 5 two 4 5 6 two 4 6 print(data.drop_duplicates(['k1'])) k1 k2 k3 0 one 1 0 1 two 1 1 print(data.drop_duplicates(['k2'])) k1 k2 k3 0 one 1 0 2 one 2 2 3 two 3 3 5 two 4 5 )4.参数keep='last',保留最后一个观测到的重复值(默认是保留第一个观测到的重复值) print(data.drop_duplicates(['k1'],keep='last'),) k1 k2 k3 4 one 3 4 6 two 4 6
-
使用函数或映射进行数据转换:map
Series(DataFrame的一列)的map方法接受一个函数或一个包含映射关系的字典型数组 data = DataFrame({ 'name':['zxw','zhj']*4, 'grade':[71,0,75,0,100,0,124,0]}) print(data) name grade 0 zxw 71 1 zhj 0 2 zxw 75 3 zhj 0 4 zxw 100 5 zhj 0 6 zxw 124 7 zhj 0 sex_to_people={ 'zxw':'nan','zhj':'nv' } data['sex'] = data['name'].map(sex_to_people) print(data)
-
替换值:replace
data = Series([1,2,4,9,16]) print(data) 0 1 1 2 2 4 3 9 4 16 dtype: int64 )1.替换单个值 print(data.replace(1,4)) 0 4 1 2 2 4 3 9 4 16 dtype: int64 )2.将多个值替换成同一个值 print(data.replace([4,9],0)) 0 1 1 2 2 0 3 0 4 16 )3.利用列表将多个值替换成不同的值 print(data.replace([4,9],[-4,-9])) 0 1 1 2 2 -4 3 -9 4 16 )4.利用字典将多个值替换成不同的值 print(data.replace({ 16:-16,9:-9})) 0 1 1 2 2 4 3 -9 4 -16
-
重命名轴索引:rename
data = DataFrame(np.arange(16).reshape((4,4)), index = ['one','two','three','four']) print(data) 0 1 2 3 one 0 1 2 3 two 4 5 6 7 three 8 9 10 11 four 12 13 14 15 print(data.rename(index={ 'one':'11'}, columns={ 0:'one'})) one 1 2 3 11 0 1 2 3 two 4 5 6 7 three 8 9 10 11 four 12 13 14 15
-
离散化和分箱:cut,qcut(将一些数据进行分组,放入离散的框里面)
ages = [20,22,25,27,21,23,37,31,61,45,41,32] bins = [18,25,35,60,100] c = pd.cut(ages,bins) )1.print(c):展示数值所在的区间 [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]] )2.print(c.codes):展示数值所在的组序号 [0 0 0 1 0 0 2 1 3 2 2 1] )3.print(pd.value_counts(c))统计不同组的数据个数 (18, 25] 5 (35, 60] 3 (25, 35] 3 (60, 100] 1 )4.参数right指定区间哪边是封闭的 ages = [20,22,25,27,21,23,37,31,61,45,41,32] bins = [18,25,35,60,100] c = pd.cut(ages,bins,right=False) print(c) [[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)] Length: 12 Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)] )5.labels:传入自定义的箱名 ages = [20,22,25,27,21,23,37,31,61,45,41,32] bins = [18,25,35,60,100] name =['少年','青年','壮年','老年'] c = pd.cut(ages,bins,right=False,labels=name) print(c) [少年, 少年, 青年, 青年, 少年, ..., 青年, 老年, 壮年, 壮年, 青年] Length: 12 Categories (4, object): [少年 < 青年 < 壮年 < 老年] )6.qcut:基于样本的中位数进行分箱,每一个等距的箱子类别中的数值数量是一样的。 data = np.random.randn(20) c = pd.qcut(data,4) print(pd.value_counts(c)) (0.689, 2.849] 5 (-0.0909, 0.689] 5 (-1.052, -0.0909] 5 (-1.757, -1.052] 5 dtype: int64
-
检测和过滤异常值)
)1.describe():打印每一列常见的信息。 data2 = DataFrame([[1,2,3],[4,NA,6],[7,8,9],[NA,NA,NA]]) print(data2.describe()) 0 1 2 count 3.0 2.000000 3.0 mean 4.0 5.000000 6.0 std 3.0 4.242641 3.0 min 1.0 2.000000 3.0 25% 2.5 3.500000 4.5 50% 4.0 5.000000 6.0 75% 5.5 6.500000 7.5 max 7.0 8.000000 9.0 )2.找出一列中绝对值大于1的数 abs:绝对值 data = pd.DataFrame(np.random.randn(10,4)) col = data[2] print(col[np.abs(col)>1]) 4 1.079749 7 1.628768 8 1.082464 9 1.110299 Name: 2, dtype: float64 )3.np.sign(data):根据数据中的正负分别生成1和-1的数值 0 1 2 3 0 1.0 1.0 1.0 1.0 1 1.0 1.0 1.0 -1.0 2 -1.0 -1.0 -1.0 1.0 3 1.0 -1.0 -1.0 1.0 4 -1.0 -1.0 -1.0 -1.0 5 -1.0 1.0 -1.0 -1.0 6 1.0 -1.0 1.0 1.0 7 1.0 1.0 1.0 1.0 8 -1.0 1.0 -1.0 -1.0 9 -1.0 -1.0 1.0 -1.0
-
置换和随机抽样:permutation,take,sample(permutation:排列,组合和置换)
1.permutation:根据你想要的轴长度产生一个新顺序的整数数组 take:将新顺序应用到索引中 df = pd.DataFrame(np.arange(20).reshape((5,4))) ss = np.random.permutation(5) print(ss) print(df.take(ss)) 0 1 2 3 3 12 13 14 15 4 16 17 18 19 2 8 9 10 11 0 0 1 2 3 1 4 5 6 7 2.sample:找出随机子集,顺序不定 print(df.sample(n=4)) 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 4 16 17 18 19 replace=True:允许重复抽样 print(df.sample(n=4,replace=True)) 0 1 2 3 3 12 13 14 15 3 12 13 14 15 3 12 13 14 15 4 16 17 18 19
-
计算指标/虚拟变量:get_dummies(dummy:虚设,假的)
1.选择DataFrame的某一列,生成它的位置矩阵 df = pd.DataFrame({ 'key1':['a','b','c','c','b','a'], 'data1':range(6)}) print(pd.get_dummies(df['data1'])) 0 1 2 3 4 5 0 1 0 0 0 0 0 1 0 1 0 0 0 0 2 0 0 1 0 0 0 3 0 0 0 1 0 0 4 0 0 0 0 1 0 5 0 0 0 0 0 1 2.data[ ['data1'] ]:会保留列名,相当于只有一列的DataFrame data['data1']:相当于Series。 3.prefix:给列加上前缀 join:与其他数据合并,DataFrame的方法。 dummy = pd.get_dummies(df['key1'],prefix='key') print(dummy) key_a key_b key_c 0 1 0 0 1 0 1 0 2 0 0 1 3 0 0 1 4 0 1 0 5 1 0 0 print(df[['data1']].join(dummy)) print(dummy.join(df['data1']))
数据清洗与准备
猜你喜欢
转载自blog.csdn.net/qq_41458842/article/details/102534095
今日推荐
周排行