Pandas
-
两类非常重要的数据结构:Series(序列)和DataFrame(数据框)
- Series
- 类似于NumPy中的一维数组
- 和DataFrame
- 类似于NumPy中的二维数组
- Series
-
Series的创建
- 通过一维数组来创建
- 通过字典的方式来创建
- 通过DataFrame中的某一行或某一列来创建
-
和DataFrame的创建
- 通过二维数组来创建
- 通过字典的方式来创建
- 通过数据框来创建
-
优点
- 允许使用行和列的标签
- 可以计算时间序列数据的滚动统计数据
- 易于处理NaN值
- 能够将不同格式的数据加载到DataFrame中
- 可以将不同的数据集连接并合并在一起
- 与NumPy和matplotlib集成
-
Series
- shape
- 形状
- ndim
- 维数
- size
- 总共有多少个元素
- index
- 返回Series索引
- values
- 返回数据
- shape
import pandas as pd
import numpy as np
groceries = pd.Series(data=[30,6,'Yes','No'],index=['eggs','apples','milk','breed'])
# 既有字符串也有整数
print(groceries)
eggs 30
apples 6
milk Yes
breed No
dtype: object
'banana' in groceries
'apples' in groceries
False
True
Pandas和NumPy一个重要的不同是:可以给Pandas序列的各元素赋上索引标签
- 允许我们以多种方式访问数据
- 通过index
- 通过位置进行索引
groceries['eggs']
groceries[['eggs','apples']]
30
apples 6
milk Yes
dtype: object
groceries[-1]
No
-
区分
索引标签
和数字标签- loc 表示位置
- 标签索引
- iloc integer索引
- 数字标签
- loc 表示位置
-
删除
- pd_name.drop(‘idx_name’,inplace = True or False)
- 修改原序列或者不修改
- pd_name.drop(‘idx_name’,inplace = True or False)
-
Pandas 算数操作
- 同numpy
print("fruits + 2 = \n{}".format(fruits + 2))
print("fruits - 2 = \n{}".format(fruits - 2))
print("fruits * 2 = \n{}".format(fruits * 2))
print("fruits / 2 = \n{}".format(fruits / 2))
fruits + 2 =
apples 12
oranges 8
bananas 5
dtype: int64
fruits - 2 =
apples 8
oranges 4
bananas 1
dtype: int64
fruits * 2 =
apples 20
oranges 12
bananas 6
dtype: int64
fruits / 2 =
apples 5.0
oranges 3.0
bananas 1.5
dtype: float64
DataFrame
- 创建DataFrame
# 3 DataFrame
items = {"Bob" : pd.Series([245,25,55],['bike','pants','watch']),
'Alice' : pd.Series([40,110,500,45],['book','glasses','bike','pants'])}
shopping_carts = pd.DataFrame(items)
shopping_carts
Bob Alice
bike 245.0 500.0
book NaN 40.0
glasses NaN 110.0
pants 25.0 45.0
watch 55.0 NaN
-
- 若序列中没有给出索引则默认使用数字索引
# 3 DataFrame
items = {"Bob" : pd.Series([245,25,55]),
'Alice' : pd.Series([40,110,500,45])}
shopping_carts = pd.DataFrame(items)
shopping_carts
Bob Alice
0 245.0 40
1 25.0 110
2 55.0 500
3 NaN 45
-
- df_name.values
- df_name.size
- df_name.ndim
- 选择性录入
- columns
- index
bob_shopping_carts = pd.DataFrame(items,columns=['Bob'])
bob_shopping_carts
Bob
bike 245
pants 25
watch 55
sel_shopping_carts = pd.DataFrame(items,index=['pants','book'])
sel_shopping_carts
Bob Alice
pants 25.0 45
book NaN 40
- 利用字典来创建
- 利用字典的list来创建
# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
'Floats' : [4.5, 8.2, 9.6]}
# We create a DataFrame
df = pd.DataFrame(data)
# We display the DataFrame
df
** Floats** Integers
0 4.5 1
1 8.2 2
2 9.6 3
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]
# We create a DataFrame
store_items = pd.DataFrame(items2)
# We display the DataFrame
store_items
** bikes** glasses pants watches
0 20 NaN 30 35
1 15 50.0 5 10
- 访问DataFrame
- 通过df_name[‘col_key’][‘row_key’]
- 先列标签再行标签
x = []
for item in store_items.columns:
x.append(item)
print(store_items[x]['store 1'])
上是不行的
store_items.loc['sotre 1']
bike 20.0
glasses NaN
pants 30.0
watches 35.0
Name: store 1, dtype: float64
- 添加新行2
store_items['shirts']=[15,2]
store_items
- append 添加新行1
new_items = [{'bikes':20,'pants':30,'watches':35,'glasses':4}]
new_store = pd.DataFrame(new_items,index = ['store 3'])
new_store
bike bikes glasses pants shirts watches
store 1 20.0 NaN NaN 30 15.0 35
store 2 50.0 NaN 50.0 5 2.0 10
store 3 NaN 20.0 4.0 30 NaN 35
- 添加新列1
store_items['new watches'] = store_items['watches'][1:]
** bikes** glasses pants shirts suits watches new watches
store 1 20 NaN 30 15.0 45.0 35 NaN
store 2 15 50.0 5 2.0 7.0 10 10.0
store 3 20 4.0 30 NaN NaN 35 35.0
- 添加新列2
- df_name.insert(loc,label,data)
- loc : 第几列
- label : 标签名
- data : 属性值
- df_name.insert(loc,label,data)
# We insert a new column with label shoes right before the column with numerical index 4
store_items.insert(4, 'shoes', [8,5,0])
# we display the modified DataFrame
store_items
** bikes** glasses pants shirts shoes suits watches new watches
store 1 20 NaN 30 15.0 8 45.0 35 NaN
store 2 15 50.0 5 2.0 5 7.0 10 10.0
store 3 20 4.0 30 NaN 0 NaN 35 35.0
- 删除行或列
- df_name.op(attr_key)
- 只允许删除列
- df_name.drop([‘key1’,‘key2’],axis = 1,0)
- 可删除行和列
- df_name.op(attr_key)
# We remove the new watches column
store_items.pop('new watches')
# we display the modified DataFrame
store_items
** bikes** glasses pants shirts shoes suits watches
store 1 20 NaN 30 15.0 8 45.0 35
store 2 15 50.0 5 2.0 5 7.0 10
store 3 20 4.0 30 NaN 0 NaN 35
# We remove the watches and shoes columns
store_items = store_items.drop(['watches', 'shoes'], axis = 1)
# we display the modified DataFrame
store_items
** bikes** glasses pants shirts suits
store 1 20 NaN 30 15.0 45.0
store 2 15 50.0 5 2.0 7.0
store 3 20 4.0 30 NaN NaN
# We remove the store 2 and store 1 rows
store_items = store_items.drop(['store 2', 'store 1'], axis = 0)
# we display the modified DataFrame
store_items
** bikes** glasses pants shirts suits
store 3 20 4.0 30 NaN NaN
- 重命名列属性标签
- df_name.rename(columns//index// = {‘old_lname’ : ‘new_lname’})
# We change the column label bikes to hats
store_items = store_items.rename(columns = {'bikes': 'hats'})
# we display the modified DataFrame
store_items
** hats** glasses pants shirts suits
store 3 20 4.0 30 NaN NaN
# We change the row label from store 3 to last store
store_items = store_items.rename(index = {'store 3': 'last store'})
# we display the modified DataFrame
store_items
** hats** glasses pants shirts suits
last store 20 4.0 30 NaN NaN
- 以某一列属性作为index
# We change the row index to be the data in the pants column
store_items = store_items.set_index('pants')
# we display the modified DataFrame
store_items
pants ** hats** glasses shirts suits
30 20 4.0 NaN NaN