Pandas

两类非常重要的数据结构：Series（序列）和DataFrame（数据框）
- Series
  - 类似于NumPy中的一维数组
- 和DataFrame
  - 类似于NumPy中的二维数组
Series的创建
- 通过一维数组来创建
- 通过字典的方式来创建
- 通过DataFrame中的某一行或某一列来创建
和DataFrame的创建
- 通过二维数组来创建
- 通过字典的方式来创建
- 通过数据框来创建
优点
- 允许使用行和列的标签
- 可以计算时间序列数据的滚动统计数据
- 易于处理NaN值
- 能够将不同格式的数据加载到DataFrame中
- 可以将不同的数据集连接并合并在一起
- 与NumPy和matplotlib集成
Series
- shape
  - 形状
- ndim
  - 维数
- size
  - 总共有多少个元素
- index
  - 返回Series索引
- values
  - 返回数据

import pandas as pd
import numpy as np
groceries = pd.Series(data=[30,6,'Yes','No'],index=['eggs','apples','milk','breed'])
# 既有字符串也有整数
print(groceries)

eggs       30
apples      6
milk      Yes
breed      No
dtype: object

'banana' in groceries

'apples' in groceries

False

True

Pandas和NumPy一个重要的不同是：可以给Pandas序列的各元素赋上索引标签

允许我们以多种方式访问数据
- 通过index
- 通过位置进行索引

groceries['eggs']
groceries[['eggs','apples']]

30
apples      6
milk      Yes
dtype: object

groceries[-1]

No

区分索引标签和数字标签
- loc 表示位置
  - 标签索引
- iloc integer索引
  - 数字标签
删除
- pd_name.drop(‘idx_name’,inplace = True or False)
  - 修改原序列或者不修改
Pandas 算数操作
- 同numpy

print("fruits + 2 = \n{}".format(fruits + 2))
print("fruits - 2 = \n{}".format(fruits - 2))
print("fruits * 2 = \n{}".format(fruits * 2))
print("fruits / 2 = \n{}".format(fruits / 2))

fruits + 2 = 
apples     12
oranges     8
bananas     5
dtype: int64
fruits - 2 = 
apples     8
oranges    4
bananas    1
dtype: int64
fruits * 2 = 
apples     20
oranges    12
bananas     6
dtype: int64
fruits / 2 = 
apples     5.0
oranges    3.0
bananas    1.5
dtype: float64

DataFrame

创建DataFrame

# 3 DataFrame
items = {"Bob" : pd.Series([245,25,55],['bike','pants','watch']),
        'Alice' : pd.Series([40,110,500,45],['book','glasses','bike','pants'])}


shopping_carts = pd.DataFrame(items)
shopping_carts

	Bob	Alice
bike	245.0	500.0
book	NaN	40.0
glasses	NaN	110.0
pants	25.0	45.0
watch	55.0	NaN

- 若序列中没有给出索引则默认使用数字索引

# 3 DataFrame
items = {"Bob" : pd.Series([245,25,55]),
        'Alice' : pd.Series([40,110,500,45])}


shopping_carts = pd.DataFrame(items)
shopping_carts

Bob	Alice
0	245.0	40
1	25.0	110
2	55.0	500
3	NaN	45

- df_name.values
- df_name.size
- df_name.ndim
选择性录入
- columns
- index

bob_shopping_carts = pd.DataFrame(items,columns=['Bob'])
bob_shopping_carts


Bob
bike	245
pants	25
watch	55

sel_shopping_carts = pd.DataFrame(items,index=['pants','book'])
sel_shopping_carts

	Bob	Alice
pants	25.0	45
book	NaN	40

利用字典来创建
利用字典的list来创建

# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

# We create a DataFrame 
df = pd.DataFrame(data)

# We display the DataFrame
df

 	** Floats** 	Integers
0 	4.5 	1
1 	8.2 	2
2 	9.6 	3

# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# We create a DataFrame 
store_items = pd.DataFrame(items2)

# We display the DataFrame
store_items

 	** bikes** 	glasses 	pants 	watches
0 	20 	NaN 	30 	35
1 	15 	50.0 	5 	10

访问DataFrame
- 通过df_name[‘col_key’][‘row_key’]
- 先列标签再行标签

x = []
for item in store_items.columns:
    x.append(item)
print(store_items[x]['store 1'])

上是不行的

store_items.loc['sotre 1']

bike       20.0
glasses     NaN
pants      30.0
watches    35.0
Name: store 1, dtype: float64

添加新行2

store_items['shirts']=[15,2]
store_items

append 添加新行1

new_items = [{'bikes':20,'pants':30,'watches':35,'glasses':4}]
new_store = pd.DataFrame(new_items,index = ['store 3'])
new_store

	bike	bikes	glasses	pants	shirts	watches
store 1	20.0	NaN	NaN	30	15.0	35
store 2	50.0	NaN	50.0	5	2.0	10
store 3	NaN	20.0	4.0	30	NaN	35

添加新列1

store_items['new watches'] = store_items['watches'][1:]


** bikes**	glasses	pants	shirts	suits	watches	new watches
store 1	20	NaN	30	15.0	45.0	35	NaN
store 2	15	50.0	5	2.0	7.0	10	10.0
store 3	20	4.0	30	NaN	NaN	35	35.0

添加新列2
- df_name.insert(loc,label,data)
  - loc : 第几列
  - label : 标签名
  - data : 属性值

# We insert a new column with label shoes right before the column with numerical index 4
store_items.insert(4, 'shoes', [8,5,0])

# we display the modified DataFrame
store_items

	** bikes**	glasses	pants	shirts	shoes	suits	watches	new watches
store 1	20	NaN	30	15.0	8	45.0	35	NaN
store 2	15	50.0	5	2.0	5	7.0	10	10.0
store 3	20	4.0	30	NaN	0	NaN	35	35.0

删除行或列
- df_name.op(attr_key)
  - 只允许删除列
- df_name.drop([‘key1’,‘key2’],axis = 1,0)
  - 可删除行和列

# We remove the new watches column
store_items.pop('new watches')

# we display the modified DataFrame
store_items


** bikes**	glasses	pants	shirts	shoes	suits	watches
store 1	20	NaN	30	15.0	8	45.0	35
store 2	15	50.0	5	2.0	5	7.0	10
store 3	20	4.0	30	NaN	0	NaN	35

# We remove the watches and shoes columns
store_items = store_items.drop(['watches', 'shoes'], axis = 1)

# we display the modified DataFrame
store_items


** bikes**	glasses	pants	shirts	suits
store 1	20	NaN	30	15.0	45.0
store 2	15	50.0	5	2.0	7.0
store 3	20	4.0	30	NaN	NaN

# We remove the store 2 and store 1 rows
store_items = store_items.drop(['store 2', 'store 1'], axis = 0)

# we display the modified DataFrame
store_items


** bikes**	glasses	pants	shirts	suits
store 3	20	4.0	30	NaN	NaN

重命名列属性标签
- df_name.rename(columns//index// = {‘old_lname’ : ‘new_lname’})

# We change the column label bikes to hats
store_items = store_items.rename(columns = {'bikes': 'hats'})

# we display the modified DataFrame
store_items

** hats**	glasses	pants	shirts	suits
store 3	20	4.0	30	NaN	NaN

# We change the row label from store 3 to last store
store_items = store_items.rename(index = {'store 3': 'last store'})

# we display the modified DataFrame
store_items

	** hats**	glasses	pants	shirts	suits
last store	20	4.0	30	NaN	NaN

以某一列属性作为index

# We change the row index to be the data in the pants column
store_items = store_items.set_index('pants')

# we display the modified DataFrame
store_items

pants	** hats**	glasses	shirts	suits
30	20	4.0	NaN	NaN

10.pandas

Pandas

DataFrame

猜你喜欢