数据结构：

series

一维数组创建Series及常用属性：

字典方式创建索引：

from pandas import Series, DataFrame

import pandas as pd

数据结构：

pandas中主要有两种数据结构，分别是：Series和DataFrame。

Series：一种类似于一维数组的对象，是由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。仅由一组数据也可产生简单的Series对象。注意：Series中的索引值是可以重复的。

DataFrame：一个表格型的数据结构，包含有一组有序的列，每列可以是不同的值类型(数值、字符串、布尔型等)，DataFrame即有行索引也有列索引，可以被看做是由Series组成的字典。

series

一维数组创建Series及常用属性：

arr=np.array([1,2,3,4])

series01=pd.Series(arr)

series01

0    1

1    2

2    3

3    4

dtype: int32

series01.dtype

Out[13]:dtype('int32')

series01.index

              RangeIndex(start=0, stop=4, step=1)

series01.values

           array([1, 2, 3, 4])

series01.index=['product1','product2','product3','product4'] #更改索引，感觉很像字典

series01

product1    1

product2    2

product3    3

product4    4

dtype: int32

series02=Series(index=['a','b','c'],data=[11,22,33],dtype=np.float64)

a    11.0

b    22.0

c    33.0

dtype: float64

series03=Series([100,98,84],index=[‘aa’,’bb’,’cc’])

字典方式创建索引：

series可以被看成定长有序字典，是索引值到数据值的映射

a_dic={'name':'wangmang','age':18,'height':180,'weight':90}

series04=Series(a_dic)

age             18

height         180

name      wangmang

weight          90

dtype: object

Series值的获取：

         通过方括号+索引的方式读取对应索引的数据，有可能返回多条数据通过方括号+下标值的方式读取对应下标值的数据，下标值的取值范围为：[0，len(Series.values))；

         另外下标值也可以是负数，表示从右往左获取数据Series获取多个值的方式类似NumPy中的ndarray的切片操作，通过方括号+下标值/索引值+冒号(:)的形式来截取series对象中的一部分数据。

series04['age']                                     #通过索引取值

series04['age':'name']

age             18

height         180

name      wangmang

dtype: object

series04[1:3]

height         180

name      wangmang

dtype: object

数组运算

NumPy中的数组运算在Series中都保留了，均可以使用，并且Series进行数组运算的时候，索引与值之间的映射关系不会发生改变。

注意：其实在操作Series的时候，基本上可以把Series看成NumPy中的ndarray数组来进行操作。ndarray数组的绝大多数操作都可以应用到Series上。

pandas中的isnull和notnull两个函数可以用于在Series中检测缺失值，这两个函数的返回时一个布尔类型的Series。

当多个series对象之间进行运算的时候，如果不同series之间具有不同的索引值，那么运算会自动对齐不同索引值的数据，如果某个series没有某个索引值，那么最终结果会赋值为NaN。

Series对象本身以及索引都具有一个name属性，默认为空，根据需要可以进行赋值操作。

dataframe

通过二维数组创建dataframe：

df01=DataFrame([ ['Tom','Merry','John'],[76,98,100] ])

	0	1	2
0	Tom	Merry	John
1	76	98	100

df02=DataFrame( [['Tom',98],['Garry',76],['John',88]], columns=['name','score'])

print(df02)

print('列索引',df02.columns)

print('行索引',df02.index)

print('数据',df02.values)

name score

0    Tom     98

1  Garry     76

2   John     88

列索引 Index(['name', 'score'], dtype='object')

行索引 RangeIndex(start=0, stop=3, step=1)

数据 [['Tom' 98]

 ['Garry' 76]

 ['John' 88]]

arr=np.array([

['Tom',78],

['Gerry',88],

['John',89]

])

df03=DataFrame(arr,index=['one','two','three'],columns=['name','score'])

	name	score
one	Tom	78
two	Gerry	88
three	John	89

通过字典创建dataframe:

data={

'apart':['101','102','103','101'],

'profits':['587.1','125.2','12.2','23.5'],

'years':[2001,2005,2010,2015],

'month':'8'

}

df=DataFrame(data)

	apart	month	profits	years
0	101	8	587.1	2001
1	102	8	125.2	2005
2	103	8	12.2	2010
3	101	8	23.5	2015

嵌套字典创建dataframe：

pop={'necada':{2001:2.4,2002:2.9},'ohnio':{2000:1.5,2001:1.7,2002:3.6}}

df4=DataFrame(pop)

索引对象：

通过索引可以从Series、DataFrame中获取值或者对某个索引值进行重新赋值

Series或者DataFrame的自动对齐功能是通过索引实现的
不管是Series还是DataFrame对象，都有索引对象。
索引对象负责管理轴c标签和其它元数据(eg：轴名称等等)

可以直接通过列索引嵌套字典获取指定列的数据， eg: df[column_name]
如果需要获取指定行的数据的话，需要通过ix方法来获取对应行索引的行数据，eg: df.ix[index_name]

df['profits']

0    587.1

1    125.2

2     12.2

3     23.5

Name: profits, dtype: object

df.ix[0]

apart        101

month          8

profits    587.1

years       2001

Name: 0, dtype: object

df.ix[[0,1]]

	apart	month	profits	years
0	101	8	587.1	2001
1	102	8	125.2	2005

层次化索引：

series1=pd.Series(np.arange(6),index=[['a','a','b','b','b','c'],[1,2,1,2,3,1]])

a 1 0

2 1

b 1 2

2 3

3 4

c 1 5

dtype: int32

series1[['a','c']]

a 1 0

2 1

c 1 5

dtype: int32

series1['a':'c']

a 1 0

2 1

b 1 2

2 3

3 4

c 1 5

dtype: int32

series1[:,1]

a 0

b 2

c 5

dtype: int32

series1.unstack()

Out[19]:

1 2 3

a 0.0 1.0 NaN

b 2.0 3.0 4.0

c 5.0 NaN NaN

series1.unstack().stack()

Out[20]:

a 1 0.0

2 1.0

b 1 2.0

2 3.0

3 4.0

c 1 5.0

dtype: float64

常用方法

方法	说明
dropna	根据标签的值中是否存在缺失数据对轴标签进行过滤(删除), 可以通过阈值的调节对缺失值的容忍度
fillna	用指定值或者插值的方式填充缺失数据，比如: ffill或者bfill
isnull	返回一个含有布尔值的对象，这些布尔值表示那些值是缺失值NA
notnull	isnull的否定式

常用数学统计方法：

count	非NA值得数量
decribe	针对series和dataframe的列计算总统计
min/max
argmin/argmax	计算能够取得最小值和最大值的索引位置
idxmin,idxmax	计算能够取得最小值和最大值的索引值
quantile	计算样本的分位数（0到1）
sum、mean	总和，平均数
median	值的算术中位数（50%分位数）
mad	根据平均值计算平均绝对离差
var	样本数值方差
std	样本标准差
cumsum	样本值的累计和
cummin/cummax	样本值的累计最小值、最大值
cumprod	样本值的累计积
Pct_change	计算百分数变化

df.count() #默认计算各个列上面的数据即就是 aixs=0

df.count(axis=1) #计算各个行数据

df.corr()

df.corr(df2) #计算相关系数

df.cov()

df.cov(df2) #计算协方差

series1/df.unique() #获取Series唯一值数组

value_counts() #计算Series中各值出现的频率 series2.value_counts()

isin() #判断矢量化集合的成员资格，可用于选Series中或者dataframe列中数据的子集

PANDAS常用用法汇总

数据结构：

series

一维数组创建Series及常用属性：

字典方式创建索引：

Series值的获取：

数组运算

dataframe

通过二维数组创建dataframe：

通过字典创建dataframe:

嵌套字典创建dataframe：

索引对象：

层次化索引：

常用方法

常用数学统计方法：

猜你喜欢