Python 第三方模块数据分析 Pandas模块 Series2

四.方法
2.运算与比较

以下方法运算/比较时仍会进行"对齐"

(1)四则运算:

加法:<S1>.add(<S2>)
  #<S1>+<S2>
    <S1>.radd(<S2>)
  #<S2>+<S1>
减法:<S1>.sub(<S2>)
  #<S1>-<S2>
    <S1>.rsub(<S2>)
  #<S2>-<S1>
乘法:<S1>.mul(<S2>)
  #<S1>*<S2>
    <S1>.rmul(<S2>)
  #<S2>*<S1>
除法:<S1>.div(<S2>)
  #<S1>/<S2>
    <S1>.rdiv(<S2>)
  #<S2>/<S1>

#实例:
>>> a=pd.Series([0,1,2,3,4,5])
>>> b=pd.Series([-5,-4,-3,-2,-1,0])
>>> a.add(b)
0   -5
1   -3
2   -1
3    1
4    3
5    5
dtype: int64
>>> a.sub(b)
0    5
1    5
2    5
3    5
4    5
5    5
dtype: int64
>>> a.mul(b)
0    0
1   -4
2   -6
3   -6
4   -4
5    0
dtype: int64
>>> a.div(b)
0   -0.000000
1   -0.250000
2   -0.666667
3   -1.500000
4   -4.000000
5         inf
dtype: float64

(2)其他运算:

除后取商:<S1>.floordiv(<S2>)
  #即向下取整的除法;⌊<S1>/<S2>⌋
        <S1>.rfloordiv(<S2>)
  #即向下取整的除法;⌊<S2>/<S1>⌋

#实例:接上
>>> a.floordiv(b)
0    0
1   -1
2   -1
3   -2
4   -4
5    0
dtype: int64

#################################################################################################

除后取模(余数):<S1>.mod(<S2>)
  #S1-⌊<S1>/<S2>⌋*S2
	         <S1>.rmod(<S2>)
  #S2-⌊<S2>/<S1>⌋*S1
  #参数说明:
    S1,S2:指定参与运算的Series对象

#实例:接上
>>> a.mod(b)
0    0
1   -3
2   -1
3   -1
4    0
5    0
dtype: int64

#################################################################################################

幂:<S1>.pow(<S2>)
  #<S1>**<S2>
  <S1>.rpow(<S2>)
  #<S2>**<S1>

#实例:接上
>>> b.pow(a)
0    1
1   -4
2    9
3   -8
4    1
5    0
dtype: int64

#################################################################################################

点乘:<S1>.dot(<S2>)

#实例:接上
>>> a.dot(b)
-20

#################################################################################################

取所有值的绝对值:<S>.abs()

#实例:接上
>>> a.abs()
0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

#################################################################################################

对所有元素进行四舍五入:<S>.round([decimals=0,*args,**kwargs])
  #参数说明:
    decimals:指定保留到小数点后第几位;为int
      #0表示保留到个位,-1表示保留到十位...

#实例:
>>> pd.Series([1.11,2.22,5.55,6.66,9.99]).round()
0     1.0
1     2.0
2     6.0
3     7.0
4    10.0
dtype: float64

#################################################################################################

求差分:<S>.diff([periods=1])
  #参数说明:
  	periods:指定求几阶差分;为int

#实例:
>>> s=pd.Series([1,3,2,-2,10],index=[0,2,4,6,8])
>>> s.diff()
0     NaN
2     2.0#=(3-1)/1
4    -1.0#=(2-3)/1
6    -4.0#=(-2-2)/1
8    12.0#=(10-(-2))/1
dtype: float64
>>> s.diff(2)
0    NaN
2    NaN
4    1.0
6   -5.0
8    8.0
dtype: float64
>>> s.diff(-1)
0    -2.0#=(1-3)/1
2     1.0#=(3-2)/1
4     4.0#=(2-(-2))/1
6   -12.0#=(-2-10)/1
8     NaN
dtype: float64

#################################################################################################

求变化率:<S>.pct_change([periods=1,fill_method="pad",limit=None,freq=None,**kwargs])
  #参数说明:
	period:指定求几阶变化率;为int
	fill_method:指定如何处理NaN;为str
	limit:指定最大连续填充的NaN数;为int

#实例:接上
>>> s.pct_change()
0         NaN
2    2.000000#=(3-1)/1
4   -0.333333#=(2-3)/3
6   -2.000000#=(-2-2)/2
8   -6.000000#=(10-(-2))/(-2)
dtype: float64

(3)比较:

判断<S1>中的值是否小于<S2>中的值:<S1>.lt(<S2)
判断<S1>中的值是否大于<S2>中的值:<S1>.gt(<S2)
判断<S1>中的值是否小于等于<S2>中的值:<S1>.le(<S2)
判断<S1>中的值是否大于等于<S2>中的值:<S1>.ge(<S2)
判断<S1>中的值是否不等于<S2>中的值:<S1>.ne(<S2)
判断<S1>中的值是否等于<S2>中的值:<S1>.eq(<S2)
  #均返回Series对象,其中的元素是相应位置处的比较结果

#实例:
>>> a=pd.Series([0,1,2,3,4,5])
>>> b=pd.Series([-5,-4,-3,-2,-1,0])
>>> a.lt(b)
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool
>>> a.gt(b)
0    True
1    True
2    True
3    True
4    True
5    True
dtype: bool

#################################################################################################

判定<S>中的值是否在指定范围内:<S>.between(<min>,<max>)
  #返回Series对象,其中的元素是相应位置处的判断结果
  #参数说明:
    min,max:分别指定范围的下限和上限
      #指定的范围为[<min>,<max>]
      #如果min>max,则结果中的所有元素均为False

#实例:接上
>>> a.between(2,4)
0    False
1    False
2     True
3     True
4     True
5    False
dtype: bool

3.统计

获得一些基础的统计信息:<S>.describe()
  #返回Series对象

#实例:
>>> s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
>>> s.describe()
count    5.000000#Series对象的长度
mean     3.000000#平均值
std      1.581139#样本标准差
min      1.000000#最小值
25%      2.000000#第1四分位数
50%      3.000000#中位数
75%      4.000000#第3四分位数
max      5.000000#最大值
dtype: float64

(1)求和与求积:

计算所有元素的和:<S>.sum()
计算得到元素累加值构成的Series对象:<S>.cumsum()
计算所有元素的乘积:<S>.prod()
计算得到元素累乘值构成的Series对象:<S>.cumprod()

#实例:接上
>>> s.sum()
15
>>> s.cumsum()
0     0#=0
1     1#=0+1
2     3#=0+1+2
3     6#=0+1+2+3
4    10#=0+1+2+3+4
5    15#=0+1+2+3+4+5
dtype: int64
>>> s[1:].prod()
120
>>> s[1:].cumprod()
1      1#=1
2      2#=1*2
3      6#=1*2*3
4     24#=1*2*3*4
5    120#=1*2*3*4*5
dtype: int64

(2)最值:

求最大值:<S>.max()
求最大值的标签索引:<S>.idxmax()
求最大值的自动索引:<S>.argmax()
求最小值:<S>.min()
求最小值的标签索引:<S>.idxmin()
求最小值的自动索引:<S>.argmin()
求最大的n个值并按递减顺序排列:<S>.nlargest(<n>)
求最小的n个值并按递增顺序排列:<S>.nsmallest(<n>)
分别求前1-n个数的最大值:<S>.cummax()
分别求前1-n个数的最小值:<S>.cummin()
  #参数说明:
    n:指定获取的个数;默认获取所有值

#实例:接上
>>> s.max()
5
>>> s.idxmax()
'e'
>>> s.argmax()
4
>>> s.min()
1
>>> s.idxmin()
'a'
>>> s.argmin()
0
>>> s.nlargest()
e    5
d    4
c    3
b    2
a    1
dtype: int64
>>> s.nsmallest(3)
a    1
b    2
c    3
dtype: int64
>>> s=pd.Series([3,1,5,2,11],index=['a','b','c','d','e'])
>>> s.cummax()
a     3#前1个数中的最大值
b     3#前2个数中的最大值
c     5#前3个数中的最大值
d     5#前4个数中的最大值
e    11#前5个数中的最大值
dtype: int64
>>> s.cummin()
a    3#前1个数中的最小值
b    1#前2个数中的最小值
c    1#前3个数中的最小值
d    1#前4个数中的最小值
e    1#前5个数中的最小值
dtype: int64

(3)分布的衡量:

求平均值:<S>.mean()
求中位数:<S>.median()
求(第1)分位数:<S>.quantile(<n>)
  #公式为:<S>.quantile(<n>)=<S>[0]+n*(<S>[len(<S>)-1]-<S>p0[)
  #参数说明:
    n:求p分位数时就是1/p;为属于[0,1]的int
      #为0.5时就是中位数;为0.25时是4分位数

#实例:
>>> s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
>>> s.mean()
3.0
>>> s.median()
3.0
>>> s.quantile(0.5)
3.0
>>> s.quantile(1)
5.0
>>> s.quantile(0)
1.0
>>> s.quantile(0.4)
2.6#=s[0]+0.4*(s[4]-s[0])

#################################################################################################

求平均绝对误差:<S>.mad([axis=None,skipna=None,level=None])

#实例:接上
>>> s.mad()
1.2#=[|1-3|+|2-3|+...+|5-3|]/5

#################################################################################################

求样本标准差:<S>.std()
  #注意:不是求总体标准差而是求样本标准差,公式为S=math.sqrt(Σ(X_i-X)^2/(N-1))
求样本方差:<S>.var()
  #注意:不是求总体方差而是求样本方差,公式为S^2=Σ(X_i-X)^2/(N-1)

#实例:接上
>>> s.std()
1.5811388300841898
>>> s.var()
2.5

#################################################################################################

求峰度:<S>.kurt()
求偏度:<S>.skew()

#实例:接上
>>> s.kurt()
-1.2000000000000002
>>> s.skew()
0.0

(4)多个分布的关系:

求总体协方差:<S1>.cov(<other>[,min_periods=None,ddof=1])
  #公式为S_{XY}=[Σ(X_i-X)(Y_i-Y)]/N
  #参数说明:
  	S1,S2:指定Series对象
  	min_periods:指定要求的最小数据量;为int

#实例:
>>> a=pd.Series([0,1,2,3,4,5])
>>> b=pd.Series([-5,-4,-3,-2,-1,0])
>>> a.cov(b)
3.5#=[(-2.5)^2+(-1.5)^2+(-0.5)^2+2.5^2+1.5^2+0.5^2]/5

#################################################################################################

求相关系数:<S1>.corr(<other>[,method="pearson",min_periods=None])
  #参数说明:
  	method:指定计算方法(或计算那种相关系数);为'pearson'/'kendall'/'spearman'/callable object

#实例:接上
>>> a.corr(b)
1.0

(5)计数:

计算值出现的次数:<S>.values_counts()

#实例:
>>> s=pd.Series([1,2,"a","a","b"])
>>> s.value_counts()
a    2
b    1
2    1
1    1
dtype: int64

#################################################################################################

获取非空元素的个数:<S>.count()
  #None/NaN均不计入

#实例:接上
>>> s.count()
5
>>> pd.Series([1,2,3,None,math.nan]).count()
3

4.数据预处理
(1)缺失值:

对数值数据,Pandas使用浮点值NaN(Not a Number)表示缺失数据,称为"哨兵值".np.NaN,None,math.nan在Pandas中均为NaN:
>>> s=pd.Series([None,np.nan,math.nan])
>>> s
0   NaN
1   NaN
2   NaN
dtype: float64

判断是否为缺失值:<S>.isna()
  #返回Series对象,其中对应于<S>中缺失值的位置为True,其他位置为False;相当于<S>.isnull()
判断是否不为缺失值:<S>.notna()
  #返回Series对象,其中对应于<S>中缺失值的位置为False,其他位置为True;相当于<S>.notnull()

#实例:
>>> s=pd.Series([1,None,3])
>>> s.isna()
0    False
1     True
2    False
dtype: bool
>>> s.notna()
0     True
1    False
2     True
dtype: bool

#################################################################################################

删除缺失值:<S>.dropna([axis=0,inplace=False,how=None])

#实例:接上
>>> s.dropna()
0    1.0
2    3.0
dtype: float64

#################################################################################################

填充缺失值:<S>.fillna([value=None,method=None,axis=None,inplace=False,limit=None,downcast=None)
利用插值法,根据上下文填充缺失值:<S>.interpolate()
  #参数说明:
    value:指定用于填充的值
      #所有空缺值都会被<val>填充,不论有1个还是多个

#实例:接上
>>> s.fillna(2)
0    1.0
1    2.0
2    3.0
dtype: float64
>>> s.interpolate()
0    1.0
1    2.0
2    3.0
dtype: float64
>>> s=pd.Series([1,None,None,None,44,None])
>>> s.fillna(2)
0     1.0
1     2.0
2     2.0
3     2.0
4    44.0
5     2.0
dtype: float64
>>> s.interpolate()
0     1.00
1    11.75
2    22.50
3    33.25
4    44.00
5    44.00
dtype: float64
>>> s=pd.Series([None,None,None,None])
>>> s.interpolate()
0    None
1    None
2    None
3    None
dtype: object

#################################################################################################

使用另1个Series填补缺失值:<S>.combine_first(<other>)
  #若<S>[i]为NaN,则使用<other>[i]填补
  #参数说明:
  	other:指定用于填补缺失值的Series

#实例:
>>> s1=pd.Series([1,2,None,4,None,None,7,8,9,None])
>>> s2=pd.Series([1,2,3,None,None,6,None,8,9,10])
>>> s1.combine_first(s2)
0     1.0
1     2.0
2     3.0
3     4.0
4     NaN
5     6.0
6     7.0
7     8.0
8     9.0
9    10.0
dtype: float64

(2)重复值:

去重:<S>.drop_duplicates()
  #返回Series

#实例:
>>> s=pd.Series([1,2,"a","a","b"])
>>> s.drop_duplicates()
0    1
1    2
2    a
4    b
dtype: object

#################################################################################################

判断是否为重复值:<S>.duplicated()
  #如果是,返回True;如果不是,返回False
  #对某个存在重复的值,第1个该值会返回False,之后的该值会返回True(相当于如果之前存在过该值,返回True;否则,返回False)

#实例:接上
>>> s.duplicated()
0    False
1    False
2    False
3     True
4    False
dtype: bool

#################################################################################################

去重:<S>.unique()
  #返回numpy.ndarray对象

#实例:接上
>>> s.unique()
array([1, 2, 'a', 'b'], dtype=object)

(3)变换:

重新排序:<S>.reindex([index=None,method=None,copy=True,level=None,fill_value=np.NaN,limit=None,tolerance=None])
  #<S>中元素及其标签会被按其标签在index中的顺序重排
  #参数说明:
    index:指定新顺序;为Index object/array-like
    method:指定填充方法;为None(不填充)/'backfill'(使用下1个值,即后向填充)/'bfill'(后向填充)/'pad'(使用上1个值,即前向填充)/'ffill'(前向填充)/'nearest'(使用最近的值)
    fill_value:指定填充值;为scalar
      #如果index中的标签不存在于<S>中,就通过上述2个参数指定该标签对应的值
    copy:重排前后完全相同时是否返回新对象;为bool
    limit:指定前/后向填充的最大连续填充量;为int
    tolerance:指定前/后向填充时值来自的标签与要填充的标签间的最大距离;为scalar/list-like
      #即abs(index[indexer]-target)<=tolerance

#实例:
>>> s=pd.Series([0,1,2,3],index=["a","c","b","e"])
>>> s.reindex()
a    0
c    1
b    2
e    3
dtype: int64
>>> s.reindex(s.index)
a    0
c    1
b    2
e    3
dtype: int64
>>> s.reindex(["a","b","c","d"])
a    0.0
b    2.0
c    1.0
d    NaN
dtype: float64
>>> s.reindex(np.array(["a","b","c","d"]))
a    0.0
b    2.0
c    1.0
d    NaN
dtype: float64

#################################################################################################

将行展开为列:<S>.unstack([level=-1,fill_value=None])
  #参数说明:
  	level:指定将哪个行层级展开;为int/str/int list/str list
  	fill_value:指定填充NaN的值;为scalar

#实例:
>>> s=pd.Series([1,2,3,4,5,6,7,8,9],index=[["a","a","a","b","b","c","c","d","d"],[1,2,3,1,3,1,2,2,3]])
>>> s.unstack()
     1    2    3
a  1.0  2.0  3.0
b  4.0  NaN  5.0
c  6.0  7.0  NaN
d  NaN  8.0  9.0
>>> s.unstack(0)
     a    b    c    d
1  1.0  4.0  6.0  NaN
2  2.0  NaN  7.0  8.0
3  3.0  5.0  NaN  9.0

#################################################################################################

交换2个层级的索引:<S>.swaplevel([i=-2,j=-1,copy=True])
  #参数说明:
  	i,j:指定要交换的层级;为int/str

5.排序与分组
(1)排序:

从小到大排序:<S>.argsort([axis=0,kind="quicksort",order=None])
  #返回排序后元素在<S>中的索引构成的Series对象
  #参数说明:axis/order没有用处,只是为了和Numpy兼容
    kind:指定排序算法;可为'mergesort'/'quicksort'/'heapsort'

#实例:
>>> s=pd.Series([3.1,1.1,4.1,2.1,6.1,5.1])
>>> s.argsort()
0    1
1    3
2    0
3    2
4    5
5    4
dtype: int64

#################################################################################################

返回各个值按从小到大排名的排名顺序:<S>.rank([axis=0,method="average",numeric_only=None,na_option="keep",ascending=True,pct=False])

#实例:接上
>>> s.rank()
0    3.0
1    1.0
2    4.0
3    2.0
4    6.0
5    5.0
dtype: float64

#################################################################################################

按元素进行排序:<S>.sort_values([axis=0,ascending=True,inplace=False,kind="quicksort",na_position="last",ignore_index=False,key=None])
  #返回排序后的Series对象(而不是排序后索引构成的Series对象)
  #默认按从小到大排序

#实例:接上
>>> s.sort_values()
1    1.1
3    2.1
0    3.1
2    4.1
5    5.1
4    6.1
dtype: float64

#################################################################################################

按索引进行排序:<S>.sort_index([axis=0,level=None,ascending=True,inplace=False,kind="quicksort",na_position="last",sort_remaining=True,ignore_index=False,key=None])
  #返回排序后的Series对象(而不是排序后索引构成的Series对象)
  #默认按从小到大排序

#实例:接上
>>> s.sort_values().sort_index()
0    3.1
1    1.1
2    4.1
3    2.1
4    6.1
5    5.1
dtype: float64

#################################################################################################

重排元素:<S>.reindex([index=None,**kwargs])
  #不改变各元素的索引,只改变各元素在<S>中的位置;应用于DataFrame时,既可以重排横向顺序,也可以重排纵向顺序
  #参数说明:
  	index:指定相应元素的新位置;为array-like

#实例:接上
>>> s.reindex([1,0,2,3,4,5])
1    1.1
0    3.1
2    4.1
3    2.1
4    6.1
5    5.1
dtype: float64

(2)分组:

参见:https://www.jianshu.com/p/b94deb5c7eb1

进行分组:<S>.groupby([by=None,axis=0,level=None,as_index=True,sort=True,group_keys=True,squeeze=no_default,observed=False,dropna=True])
  #返回pandas.core.groupby.generic.DataFrameGroupBy对象
  #参数说明:
    by:指定如何分组;为str(列名)/list-like(指定组名/其他分组方式的组合)/dict(将指定类映射到指定组)/function(使用返回值作为组名)
    level:指定使用哪个层级的行标签分组;为int/str
      #by和level应至少指定1个
    dropna:指定是否忽略NaN;为bool
      #为True,则结果中不包括by指定的列处为NaN的行

#实例:
>>> df=pd.DataFrame({
    
    '性别':['男','女','男','女','男','女','男','男'],'成绩':['98','93','70','56','67','64','89','87'],'年龄':[15,14,15,12,13,14,15,16]})
>>> df
  性别  成绩  年龄
0  男  98  15
1  女  93  14
2  男  70  15
3  女  56  12
4  男  67  13
5  女  64  14
6  男  89  15
7  男  87  16
>>> df.groupby('性别')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A54C8E1AC0>
>>> gb=df.groupby('性别')
>>> for name,group in gb:
...     print(name,group)
...
女   性别  成绩  年龄#'女'即为name,后面的内容为group
1  女  93  14
3  女  56  12
5  女  64  14
男   性别  成绩  年龄#'男'即为name,后面的内容为group
0  男  98  15
2  男  70  15
4  男  67  13
6  男  89  15
7  男  87  16
>>> gb=df.groupby(['性别','年龄'])
>>> for name,group in gb:
...     print(name,group)
...
('女', 12)   性别  成绩  年龄
3  女  56  12
('女', 14)   性别  成绩  年龄
1  女  93  14
5  女  64  14
('男', 13)   性别  成绩  年龄
4  男  67  13
('男', 15)   性别  成绩  年龄
0  男  98  15
2  男  70  15
6  男  89  15
('男', 16)   性别  成绩  年龄
7  男  87  16

6.其他:

判断所有元素是否全为真:<S>.all()
  #如果存在[假],返回False;全为[真],返回True
判断在所有元素中是否存在真:<S>.any()
  #如果存在[真],返回True;全为[假],返回False

#实例:
>>> s=pd.Series([0,1,2,3,4,5])
>>> s.all()
False
>>> s.any()
True

#################################################################################################

查看使用的内存量:<S>.memory_usage([index=True,deep=False])
  #参数说明:
	index:指定结果是否包括标签索引使用的内存;为bool

#实例:
>>> pd.Series([-1,3,3,0,2,6,4,-7,3,0]).memory_usage()
208
>>> pd.Series([-1,3,3,0,2,6,4,-7,3,0]).memory_usage(index=False)
80

Python 第三方模块 数据分析 Pandas模块 Series2

猜你喜欢

Python 第三方模块数据分析 Pandas模块 Series2