基于JUPYTER的python主流库新手教程(上)

作者：二马传奇

0. 基本测试

for i in range(5):
    print(i,end=",")

0,1,2,3,4,

1. numpy教程

1.1 numpy 基础函数

下面是arange方法的用法

import numpy as np
warry=np.arange(0,1,0.2)
print(warry)

[0.  0.2 0.4 0.6 0.8]

下面是linspace方法的用法

warry=np.linspace(0,1,5)
print(warry)

[0.   0.25 0.5  0.75 1.  ]

接着是logspace，用于创建等比数列，起始地址是取过对数之后的数字

warry=np.logspace(0,1,5)
print(warry)

[ 1.          1.77827941  3.16227766  5.62341325 10.        ]

zeros方法同matlab,用法略有区别

print(np.zeros(4))

[0. 0. 0. 0.]

print(np.zeros([4,4]))

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

ones 方法，用法同zeros，全一矩阵，这和matlab也是相似的

print(np.ones([4,4]))

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]

diag 函数,对角矩阵，指定对角线元素，同Matlab

print(np.diag([1,2,3,4]))

[[1 0 0 0]
 [0 2 0 0]
 [0 0 3 0]
 [0 0 0 4]]

1.2 ndarray 对象属性和数据转换

ndim,秩；shape,数组维度；size,数组元素个数；dtype,数组类型；itemsize,数组中每个元素的字节大小；

warray=np.array([[1,2,3],[4,5,6]])
print('秩:',warray.ndim)
print('形状:',warray.shape)
print('元素个数:',warray.size)
print('数据类型:',warray.dtype)
print('数据元素字节大小:',warray.itemsize)

秩: 2
形状: (2, 3)
元素个数: 6
数据类型: int32
数据元素字节大小: 4

1.3 随机数生成方法

用randint生成指定范围的随机整数

print(np.random.randint(100,200,size=(2,4)))

[[102 144 157 134]
 [122 146 132 108]]

rand 函数生成[0,1]随机数组

print(np.random.rand(5))

[0.00671345 0.38544324 0.99597451 0.90532193 0.70969704]

1.4 数组变换

reshape 方法修改数组维度(-1表示数组维度可以通过本身来判断）

arr=np.array([[1,2,3,4],[5,6,7,8]])
print(arr)
print(arr.reshape(4,-1))

[[1 2 3 4]
 [5 6 7 8]]
[[1 2]
 [3 4]
 [5 6]
 [7 8]]

数组横向合并，通过hstack方法,传递的参数是两个矩阵构成的元组

arr1=np.arange(0,6,1).reshape(2,-1)
arr2=arr1*2
print(np.hstack((arr1,arr2)))

[[ 0  1  2  0  2  4]
 [ 3  4  5  6  8 10]]

concatenate通过控制axis参数来控制横向或是纵向合并

print(np.concatenate((arr1,arr2),axis=0))

[[ 0  1  2]
 [ 3  4  5]
 [ 0  2  4]
 [ 6  8 10]]

相反的，也存在数组的横向或纵向的分割，通过hsplit或vsplit或split(axis)来控制

print(np.split(arr1,2,axis=0))

[array([[0, 1, 2]]), array([[3, 4, 5]])]

数组转置，T属性即可，或者transpose，这和matlab是非常像的

print(arr1.T)

[[0 3]
 [1 4]
 [2 5]]

1.5 ufunc函数

print(arr1<arr2)

[[False  True  True]
 [ True  True  True]]

print(arr1**arr2)

[[      1       1      16]
 [    729   65536 9765625]]

ufunc函数的广播机制

x=np.array([[1,2,3],[2,3,4],[3,4,5]])
y=np.array([1,2,3])
print(x+y)

[[2 4 6]
 [3 5 7]
 [4 6 8]]

使用基本的逻辑运算实现数组的条件运算

arr1=np.array([1,2,3,5])
arr2=np.array([2,4,6,8])
cond=np.array([True,False,True,False])
print([(x if c else y)for x,y,c in zip(arr1,arr2,cond)])

[1, 4, 3, 8]

使用where方法处理大规模数组的条件逻辑运算

print(np.where(cond,arr1,arr2))

[1 4 3 8]

print(np.where(arr1>2))

(array([2, 3], dtype=int64),)

1.5 排序

sort方法进行排序

arr=np.array([[4,2,9,5],[6,4,8,3],[1,6,2,4]])
arr.sort(axis=1)
print(arr)

[[2 4 5 9]
 [3 4 6 8]
 [1 2 4 6]]

argsort函数和lexsort函数可以返回原始数据在新数据中的下标

arr=np.array([[4,2,9,5],[6,4,8,3],[1,6,2,4]])
print(arr.argsort())

[[1 0 3 2]
 [3 1 0 2]
 [0 2 3 1]]

1.6 重复数据与去重

unique函数找到数组中的唯一值并返回已排序的结果

names=np.array([3,1,2,2,3,3,3,4,5,5])
print(np.unique(names))

[1 2 3 4 5]

使用tile和repeat实现数据重复

print(np.tile(names,2))
print(np.repeat(names,2))

[3 1 2 2 3 3 3 4 5 5 3 1 2 2 3 3 3 4 5 5]
[3 3 1 1 2 2 2 2 3 3 3 3 3 3 4 4 5 5 5 5]

names=np.array([[1,2,3],[4,5,6]])
print(names.repeat(2,axis=0))
print(names.repeat(2,axis=1))
print(np.tile(names,2))

[[1 2 3]
 [1 2 3]
 [4 5 6]
 [4 5 6]]
[[1 1 2 2 3 3]
 [4 4 5 5 6 6]]
[[1 2 3 1 2 3]
 [4 5 6 4 5 6]]

1.7 常用统计函数

sum,mean,std,var,min,max

arr=np.arange(20).reshape(4,5)
print('数组和:',np.sum(arr))
print('数组纵轴和:',np.sum(arr,axis=0))
print('数组横轴和:',np.sum(arr,axis=1))
print('数组的均值:',np.mean(arr))
print('数组横轴的均值:',np.mean(arr,axis=1))
print('数组标准差:',np.std(arr))

数组和: 190
数组纵轴和: [30 34 38 42 46]
数组横轴和: [10 35 60 85]
数组的均值: 9.5
数组横轴的均值: [ 2.  7. 12. 17.]
数组标准差: 5.766281297335398

2. Pandas教程

pandas三种数据结构: series,dataframe,panel

2.1 series

import pandas as pd
obj=pd.Series([1,-2,3,-4])
print(obj)

0    1
1   -2
2    3
3   -4
dtype: int64

i=["a","c","d","a"]
v=[2,4,5,7]
t=pd.Series(v,index=i,name="col")
print(t)

a    2
c    4
d    5
a    7
Name: col, dtype: int64

两套索引方式:位置和标签

print(t[0])
print(t["a"])

2
a    2
a    7
Name: col, dtype: int64

通过字典创建Series

sdata={'Ohio':35000,"Texas":71000,"Oregon":16000,"Utah":50000}
obj3=pd.Series(sdata)
print(obj3)

Ohio      35000
Texas     71000
Oregon    16000
Utah      50000
dtype: int64

states=['California','Ohio','Oregon','Texas']
obj4=pd.Series(sdata,index=states)
print(obj3+obj4)

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

就地修改Series的index

obj3.index=states
print(obj3)

California    35000
Ohio          71000
Oregon        16000
Texas         50000
dtype: int64

2.2 DataFrame

由等长的列表或者字典创建

import numpy as np
data={'name':['张三','李四'],'sex':['female','male'],'city':['北京','上海']}
df=pd.DataFrame(data,index=np.arange(2))
print(df)

  name     sex city
0   张三  female   北京
1   李四    male   上海

df2=pd.DataFrame(data,columns=['name','sex','city','address'],index=['a','b'])
print(df2)

  name     sex city address
a   张三  female   北京     NaN
b   李四    male   上海     NaN

索引对象

print(df2.index)
print(df2.columns)

Index(['a', 'b'], dtype='object')
Index(['name', 'sex', 'city', 'address'], dtype='object')

插入索引

print(df2.index.insert(1,'w'))

Index(['a', 'w', 'b'], dtype='object')

DataFrame的基本属性: values,index,columns,dtypes,ndim,shape

print(df2.values)
print(df2.ndim)
print(df2.shape)

[['张三' 'female' '北京' nan]
 ['李四' 'male' '上海' nan]]
2
(2, 4)

重建索引：指的是重新给索引排序，使用reindex方法

obj=pd.Series([7.2,-4.3,4.5,3.6],index=['a','b','d','c'])
print(obj)
print(obj.reindex(['a','b','c','d','e'],fill_value=0))

a    7.2
b   -4.3
d    4.5
c    3.6
dtype: float64
a    7.2
b   -4.3
c    3.6
d    4.5
e    0.0
dtype: float64

前/后向填充：对于时间序列而言

import numpy as np
obj=pd.Series([7.2,-4.3,4.5],index=[0,2,4])
print(obj.reindex(np.arange(6),method='ffill'))
print(obj.reindex(np.arange(6),method='bfill'))

0    7.2
1    7.2
2   -4.3
3   -4.3
4    4.5
5    4.5
dtype: float64
0    7.2
1   -4.3
2   -4.3
3    4.5
4    4.5
5    NaN
dtype: float64

更换索引

df5=df.set_index('city')
print(df5)

     name     sex
city             
北京     张三  female
上海     李四    male

数据查询

选取列(不可以使用切片)

w1=df['name']
w2=df[['name','city']]
print(w1)
print(w2)

0    张三
1    李四
Name: name, dtype: object
  name city
0   张三   北京
1   李四   上海

选取行(可以使用切片)

print(df[:2])
print(df[:1])

  name     sex city
0   张三  female   北京
1   李四    male   上海
  name     sex city
0   张三  female   北京

使用loc同时索引和标签，索引选取行，标签选取列

print(df.loc[:,['name','city']])
print(df.loc[[0],['name']])

  name city
0   张三   北京
1   李四   上海
  name
0   张三

使用iloc行和列均通过索引来选取

a={'name':['王五','刘九'],'city':['深圳','广州'],'sex':['male','female']}
df1=pd.DataFrame(a)
df2=pd.concat([df,df1],ignore_index=True)
print(df2)
print(df2.iloc[[1,3],[1,2]])

  name     sex city
0   张三  female   北京
1   李四    male   上海
2   王五    male   深圳
3   刘九  female   广州
      sex city
1    male   上海
3  female   广州

布尔选择（同matlab)

print(df2[df2['city']=='上海'])

  name   sex city
1   李四  male   上海

增加一行数据(append,concat)

append只能增加一行

data1={'city':'南京','sex':'male','name':'马六'}
df3=df2.append(data1,ignore_index=True)
print(df3)

  name     sex city
0   张三  female   北京
1   李四    male   上海
2   王五    male   深圳
3   刘九  female   广州
4   马六    male   南京

使用loc直接添加

df2.loc[5]=['沈七','male','武汉']
print(df2)

  name     sex city
0   张三  female   北京
1   李四    male   上海
2   王五    male   深圳
3   刘九  female   广州
5   沈七    male   武汉

使用concat合并两个DataFrame(上面已经举过了例子)

增加一列数据(直接通过新标签赋值)

df2['age']=[19,20,56,28,39]
print(df2)

  name     sex city  age
0   张三  female   北京   19
1   李四    male   上海   20
2   王五    male   深圳   56
3   刘九  female   广州   28
5   沈七    male   武汉   39

删除数据

删除数据的行

print(df2.drop(5))

  name     sex city  age
0   张三  female   北京   19
1   李四    male   上海   20
2   王五    male   深圳   56
3   刘九  female   广州   28

删除数据的列 axis=1,删除列，否则删除行，inplace=True直接在原数据上操作，否则返回一个副本

df2.drop('age',axis=1,inplace=True)
print(df2)

  name     sex city
0   张三  female   北京
1   李四    male   上海
2   王五    male   深圳
3   刘九  female   广州
5   沈七    male   武汉

2.3 函数应用和映射

map:将函数套用到Series的每个元素中

df2['age']=['19岁','20岁','56岁','28岁','39岁']
print(df2)
def f(x):
    return int(x.split('岁')[0])
df2['age']=df2['age'].map(f)
print(df2)

  name     sex city  age
0   张三  female   北京  19岁
1   李四    male   上海  20岁
2   王五    male   深圳  56岁
3   刘九  female   广州  28岁
5   沈七    male   武汉  39岁
  name     sex city  age
0   张三  female   北京   19
1   李四    male   上海   20
2   王五    male   深圳   56
3   刘九  female   广州   28
5   沈七    male   武汉   39

apply:将函数套用到DataFrame的行与列上，通过axis参数设置

df3=pd.DataFrame(np.random.randn(3,3),columns=['a','b','c'],index=['app','win','mac'])
print(df3)
print('\n')
df4=df3.apply(np.mean,axis=1)
print(df4)

            a         b         c
app  1.193534  0.778194  1.952184
win  0.318285  0.213218  0.291229
mac  1.288238 -0.946410  0.726388


app    1.307971
win    0.274244
mac    0.356072
dtype: float64

applymap函数:将函数套用到DataFrame的每个元素上，用于批量处理大数据

df4=df3.applymap(lambda x:round(x,3))
print(df4)

         a      b      c
app  1.194  0.778  1.952
win  0.318  0.213  0.291
mac  1.288 -0.946  0.726

2.4 排序

Series中，通过sort_index方法对索引进行排序，通过sort_values方法对数值进行排序

wy=pd.Series([2,6,5,4],index=['a','b','d','c'])
print(wy)
print(wy.sort_index())
print(wy.sort_values())

a    2
b    6
d    5
c    4
dtype: int64
a    2
b    6
c    4
d    5
dtype: int64
a    2
c    4
d    5
b    6
dtype: int64

利用by参数在一个DataFrame中按照某一个col_name来排序

wdf=pd.DataFrame(wy,columns=['num'])
wdf['year']=['2019','2020','2016','2005']
print(wdf.sort_values(by='year'))

   num  year
c    4  2005
d    5  2016
a    2  2019
b    6  2020

2.5 汇总与统计

sum按行或列进行汇总

print('按列汇总:\n',df4.sum())
print('\n')
print('按行汇总:\n',df4.sum(axis=1))

按列汇总:
 a    2.800
b    0.045
c    2.969
dtype: float64


按行汇总:
 app    3.924
win    0.822
mac    1.068
dtype: float64

用describe函数对数据进行初步统计

print('列描述')
df4.describe().T

列描述

	count	mean	std	min	25%	50%	75%	max
a	3.0	0.933333	0.534963	0.318	0.7560	1.194	1.2410	1.288
b	3.0	0.015000	0.878890	-0.946	-0.3665	0.213	0.4955	0.778
c	3.0	0.989667	0.861319	0.291	0.5085	0.726	1.3390	1.952

print('行描述')
df4.T.describe().T

行描述

	count	mean	std	min	25%	50%	75%	max
app	3.0	1.308	0.595244	0.778	0.986	1.194	1.5730	1.952
win	3.0	0.274	0.054525	0.213	0.252	0.291	0.3045	0.318
mac	3.0	0.356	1.162052	-0.946	-0.110	0.726	1.0070	1.288

value_counts实现频数统计:

print(df4['a'].value_counts())

1.194    1
0.318    1
1.288    1
Name: a, dtype: int64

print(df4.loc['mac'].value_counts())

-0.946    1
 1.288    1
 0.726    1
Name: mac, dtype: int64

2.6 数据分组与聚合

groupby方法实现高效分组

da=pd.DataFrame({'key1':['a','a','b','b','a'],'key2':['yes','no','yes','yes','no'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
print(da)
grouped=da['data1'].groupby(da['key1'])
print(grouped)
print(grouped.size())
print(grouped.mean())

  key1 key2     data1     data2
0    a  yes  1.525793 -0.275040
1    a   no -0.086230  1.717356
2    b  yes  1.091938 -0.955356
3    b  yes -1.859716  1.926496
4    a   no  1.586969 -0.406096
<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000016C80400C48>
key1
a    3
b    2
Name: data1, dtype: int64
key1
a    1.008844
b   -0.383889
Name: data1, dtype: float64

按列名分组

groupk1=da.groupby('key2').mean()
groupk1

	data1	data2
key2
no	0.750369	0.655630
yes	0.252671	0.232033

还可以根据和DataFrame行数相同的自定义列表或元组来分组

k=da.groupby([0,0,0,1,1]).sum()
k

	data1	data2
0	2.531500	0.48696
1	-0.272747	1.52040

还可以按函数来分

def judge(x):
    return'a' if x>0 else 'b'
k=da['data1'].groupby(da['data1'].map(judge)).sum()
k

data1
a    4.204700
b   -1.945947
Name: data1, dtype: float64

利用agg计算出当前数据对应的统计量

da[["data1","data2"]].agg([np.sum,np.mean])

	data1	data2
sum	2.258753	2.007360
mean	0.451751	0.401472

计算各个字段的不同统计量

da.agg({'data1':np.mean,'data2':[np.std,np.mean]})

	data1	data2
mean	0.451751	0.401472
std	NaN	1.323638

混用数据的分组和聚合

da.groupby(['key1','key2'])['data1'].agg(np.mean)

key1  key2
a     no      0.750369
      yes     1.525793
b     yes    -0.383889
Name: data1, dtype: float64

transform或apply方法将运算分布到每一行

da.groupby(['key1','key2'])['data1'].transform('mean')

0    1.525793
1    0.750369
2   -0.383889
3   -0.383889
4    0.750369
Name: data1, dtype: float64

da.groupby(['key1','key2'])['data1'].apply(np.mean)

key1  key2
a     no      0.750369
      yes     1.525793
b     yes    -0.383889
Name: data1, dtype: float64

2.7 数据透视表

pivot_table默认计算均值

da.pivot_table(index='key1',columns='key2')

	data1		data2
key2	no	yes	no	yes
key1
a	0.750369	1.525793	0.65563	-0.27504
b	NaN	-0.383889	NaN	0.48557

改为求和

da.pivot_table(index='key1',columns='key2',aggfunc='sum')

	data1		data2
key2	no	yes	no	yes
key1
a	1.500739	1.525793	1.31126	-0.27504
b	NaN	-0.767779	NaN	0.97114

crosstab方法画交叉表计算分组频率

pd.crosstab(da['key1'],da['key2'],margins=True)

key2	no	yes	All
key1
a	2	1	3
b	0	2	2
All	2	3	5

2.8 pandas可视化

Series的plot方法绘图

import matplotlib.pyplot as plt
s=pd.Series(np.random.normal(size=10))
s.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x16c81121248>

在这里插入图片描述

DataFrame的plot方法绘图

ddp=pd.DataFrame({'normal':np.random.normal(size=50),'gamma':np.random.gamma(1,size=50)})
ddp.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x16c81597a88>

在这里插入图片描述

柱状图

print(da['key1'].value_counts())
print(da['key1'].value_counts().plot(kind='bar',rot=30))

a    3
b    2
Name: key1, dtype: int64
AxesSubplot(0.125,0.125;0.775x0.755)

在这里插入图片描述

直方图和密度图

wy=pd.Series(np.random.normal(size=80))
s.hist(bins=15,grid=False)

<matplotlib.axes._subplots.AxesSubplot at 0x16c81675dc8>

在这里插入图片描述

s.plot(kind='kde')

<matplotlib.axes._subplots.AxesSubplot at 0x16c81657648>

在这里插入图片描述

散点图

da.plot(kind='scatter',x='data1',y='data2')

<matplotlib.axes._subplots.AxesSubplot at 0x16c8343f808>

在这里插入图片描述

2.9 数据合并

merge 默认内连接返回交集，通过how参数可以修改连接方式

price=pd.DataFrame({'fruit':['apple','grape','orange'],'price':[8,7,9]})
amount=pd.DataFrame({'fruit':['apple','grape','orange'],'amount':[5,11,8]})
display(price,amount,pd.merge(price,amount))

	fruit	price
0	apple	8
1	grape	7
2	orange	9

	fruit	amount
0	apple	5
1	grape	11
2	orange	8

	fruit	price	amount
0	apple	8	5
1	grape	7	11
2	orange	9	8

display(pd.merge(price,amount,how='outer'))

	fruit	price	amount
0	apple	8	5
1	grape	7	11
2	orange	9	8

concat方法（直接堆积）

display(pd.concat([price,amount],axis=1))

	fruit	price	fruit	amount
0	apple	8	apple	5
1	grape	7	grape	11
2	orange	9	orange	8

总结:

combine_first方法，保留w1的信息，补充w2和w1不相同的信息
concat方法，按列合并时保留w1和w2索引相同的信息，丢弃w2冗余的信息(outer),按行合并时直接将w2补充在w1下面，索引不合并
merge方法，how='inner’时默认取w1和w2行数据的交集，how='outer’时取行数据的并集，注意会改变原始索引

综上，merge方法适合统计w1和w2中所有不同行数据或相同行数据的清单，concat方法适合进行行列追加，combine_first方法适合于对于包含相同索引的w1和w2进行合并，即取长补短。

w1=pd.DataFrame({0:[0,2],1:[0,5]},index=['a','b'])
w2=pd.concat([pd.DataFrame({0:[0,1],1:[0,5]},index=['a','b']),pd.DataFrame({1:[5,6]},index=['f','g'])])
display(w1,w2)
display(pd.concat([w1,w2],axis=1,join='inner'))
display(pd.merge(w1,w2,how='outer'))
display(w1.combine_first(w2))

	0	1
a	0	0
b	2	5

	0	1
a	0.0	0
b	1.0	5
f	NaN	5
g	NaN	6

	0	1	0	1
a	0	0	0.0	0
b	2	5	1.0	5

	0	1
0	0.0	0
1	2.0	5
2	1.0	5
3	NaN	5
4	NaN	6

	0	1
a	0.0	0.0
b	2.0	5.0
f	NaN	5.0
g	NaN	6.0

2.10 数据清洗

isnull可以直接判断该列中的那个数据为NaN

w3=w1.combine_first(w2)
w3.isnull()

	0	1
a	False	False
b	False	False
f	True	False
g	True	False

使用isnull().sum()统计缺失值

w3.isnull().sum()

0    2
1    0
dtype: int64

用info方法查看

w3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to g
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       2 non-null      float64
 1   1       4 non-null      float64
dtypes: float64(2)
memory usage: 96.0+ bytes

使用dropna方法删除有缺失值的行

w4=w3.dropna()
display(w4)

	0	1
a	0.0	0.0
b	2.0	5.0

通过布尔型索引,对Series操作

not_null=w3[0].notnull()
print(not_null)
print(w3[0][not_null])

a     True
b     True
f    False
g    False
Name: 0, dtype: bool
a    0.0
b    2.0
Name: 0, dtype: float64

thresh参数要求一行至少N个非NAN才能存活

display(w3)
w4=w3.dropna(thresh=1)
display(w4)
w4=w3.dropna(thresh=2)
display(w4)

	0	1
a	0.0	0.0
b	2.0	5.0
f	NaN	5.0
g	NaN	6.0

	0	1
a	0.0	0.0
b	2.0	5.0
f	NaN	5.0
g	NaN	6.0

	0	1
a	0.0	0.0
b	2.0	5.0

填充缺失值 fillna，传一个常数或者字典

w4=w3.fillna(0)
display(w4)

	0	1
a	0.0	0.0
b	2.0	5.0
f	0.0	5.0
g	0.0	6.0

使用duplicated方法检测和处理重复值

data=pd.DataFrame({'k1':['one','two']*3+['two'],'k2':[1,1,2,3,1,4,4],'k3':[1,1,5,2,1,4,4]})
print(data)
data.duplicated()

    k1  k2  k3
0  one   1   1
1  two   1   1
2  one   2   5
3  two   3   2
4  one   1   1
5  two   4   4
6  two   4   4





0    False
1    False
2    False
3    False
4     True
5    False
6     True
dtype: bool

drop_duplicates方法删除重复的行

data.drop_duplicates()

	k1	k2	k3
0	one	1	1
1	two	1	1
2	one	2	5
3	two	3	2
5	two	4	4

使用散点图，箱线图或者3 $\sigma$ 法则来判断异常数据

wdf=pd.DataFrame(np.arange(20),columns=['W'])
wdf['Y']=wdf['W']*1.5+2
wdf.iloc[3,1]=128
wdf.iloc[18,1]=150
wdf.plot(kind='scatter',x='W',y='Y')

<matplotlib.axes._subplots.AxesSubplot at 0x16c8374e808>

在这里插入图片描述

plt.boxplot(wdf['Y'].values,notch=True)

{'whiskers': [<matplotlib.lines.Line2D at 0x16c838c5f48>,
  <matplotlib.lines.Line2D at 0x16c838d4e48>],
 'caps': [<matplotlib.lines.Line2D at 0x16c838d9b48>,
  <matplotlib.lines.Line2D at 0x16c838d9c88>],
 'boxes': [<matplotlib.lines.Line2D at 0x16c838d4c08>],
 'medians': [<matplotlib.lines.Line2D at 0x16c838d9d08>],
 'fliers': [<matplotlib.lines.Line2D at 0x16c838e1a08>],
 'means': []}

在这里插入图片描述

k=wdf['Y'][abs(wdf['Y']-wdf['Y'].mean())>3*wdf['Y'].std()]
display(k)

18    150.0
Name: Y, dtype: float64

2.11 数据转换

使用replace方法替换

data=pd.DataFrame({'name':['alice','bob','cindy','dick'],'sex':['male','','male','female']})
display(data)
data2=data.replace('','unknown')
display(data2)

	name	sex
0	alice	male
1	bob
2	cindy	male
3	dick	female

	name	sex
0	alice	male
1	bob	unknown
2	cindy	male
3	dick	female

replace传入列表

data2=data2.replace(['alice','bob'],['Alice','Bob'])
display(data2)

	name	sex
0	Alice	male
1	Bob	unknown
2	cindy	male
3	dick	female

replace方法传入字典替换

data2=data2.replace({'male':'M','female':'F','unknown':'U'})
display(data2)

	name	sex
0	Alice	M
1	Bob	U
2	cindy	M
3	dick	F

使用函数映射进行替换

data2['grades']=[96,75,59,80]
display(data2)
def G(x):
    if x>80 and x<100:
        return 'excellent'
    elif x>=60 and x<=80:
        return 'good'
    else:
        return 'bad'
data2['grades']=data2['grades'].map(G)
display(data2)

	name	sex	grades
0	Alice	M	96
1	Bob	U	75
2	cindy	M	59
3	dick	F	80

	name	sex	grades
0	Alice	M	excellent
1	Bob	U	good
2	cindy	M	bad
3	dick	F	good

2.12 数据标准化

离差标准化: $x_1=\frac{\mathbf{x}-\mathbf{min}}{\mathbf{max}-\mathbf{min}}$

num=pd.DataFrame(np.random.randint(100,200,size=(3,3)),columns=['a','b','c'])
display(num)
print(num.min())
def nor(S):
    S=(S-S.min())/(S.max()-S.min())
    return  S
num2=nor(num)
display(num2)

	a	b	c
0	157	150	101
1	123	164	177
2	199	152	111

a    123
b    150
c    101
dtype: int32

	a	b	c
0	0.447368	0.000000	0.000000
1	0.000000	1.000000	1.000000
2	1.000000	0.142857	0.131579

利用cut方法来进行量化

score=np.random.randint(25,100,size=10)
print(score)
bins=[0,59,70,80,100]
score_cut=pd.cut(score,bins)
print(pd.value_counts(score_cut))

[87 73 87 47 49 86 45 26 63 64]
(0, 59]      4
(80, 100]    3
(59, 70]     2
(70, 80]     1
dtype: int64

教程的上半部分到此结束，关于Matplotlib,Seaborn等库的介绍详见教程的下半部分

二马传奇

原创文章 1 获赞 3 访问量 75

关注私信