DataFrame
它是Pandas中的一个表格型的数据结构,包含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔型等),DataFrame即有行索引也有列索引,可以被看做是由Series组成的字典。
Series
它是一种类似于一维数组的对象,是由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。仅由一组数据也可产生简单的Series对象。
import pandas as pd
import numpy as np
DataFrame的创建
- 根据字典创建,每一个键值对看作一个Series
data = {
'name':['A','B','C','D','D'],
'year':[2018,None,2020,2021,2022],
'price':[0,1,2,3,4]
}
df = pd.DataFrame(data)
df
|
name |
year |
price |
0 |
A |
2018.0 |
0 |
1 |
B |
NaN |
1 |
2 |
C |
2020.0 |
2 |
3 |
D |
2021.0 |
3 |
4 |
D |
2022.0 |
4 |
df = pd.DataFrame(data, index=['one','two','three','four','five'])
df
|
name |
year |
price |
one |
A |
2018.0 |
0 |
two |
B |
NaN |
1 |
three |
C |
2020.0 |
2 |
four |
D |
2021.0 |
3 |
five |
D |
2022.0 |
4 |
- 根据numpy的ndarray创建
df_1 = pd.DataFrame(np.array([['A',2018,0],['B',2019,1],['C',2020,2],['D',2021,3],['E',2022,4]]),columns=['name','year','price'])
df_1
|
name |
year |
price |
0 |
A |
2018 |
0 |
1 |
B |
2019 |
1 |
2 |
C |
2020 |
2 |
3 |
D |
2021 |
3 |
4 |
E |
2022 |
4 |
DataFrame的基本属性
- df.index:返回df的索引,即行标签
df.index
Index(['one', 'two', 'three', 'four', 'five'], dtype='object')
for i in range(len(df.index)):
print(df.index[i])
one
two
three
four
five
- df.columns:返回df的列名,即列标签
df.columns
Index(['name', 'year', 'price'], dtype='object')
for i in range(len(df.columns)):
print(df.columns[i])
name
year
price
- df.dtypes:返回df每一列的数据类型
df.dtypes
name object
year float64
price int64
dtype: object
- df.values:以numpy的形式返回df中的值
df.values
array([['A', 2018.0, 0],
['B', nan, 1],
['C', 2020.0, 2],
['D', 2021.0, 3],
['D', 2022.0, 4]], dtype=object)
DataFrame的操作方法
- df.astype: 转换指定数据类型
df.astype({'price': 'int32'}).dtypes
name object
year float64
price int32
dtype: object
- df.convert_dtypes:自动转换最佳数据类型(pandas==1.0.0以上)
df.convert_dtypes().dtypes
name string
year Int64
price Int64
dtype: object
- df.isna/df.notna: 检测缺失值和未缺失值
df.isna()
|
name |
year |
price |
one |
False |
False |
False |
two |
False |
True |
False |
three |
False |
False |
False |
four |
False |
False |
False |
five |
False |
False |
False |
df.notna()
|
name |
year |
price |
one |
True |
True |
True |
two |
True |
False |
True |
three |
True |
True |
True |
four |
True |
True |
True |
five |
True |
True |
True |
- df.head: 获取表格的前几行
df.head(3)
|
name |
year |
price |
one |
A |
2018.0 |
0 |
two |
B |
NaN |
1 |
three |
C |
2020.0 |
2 |
- df.at: 根据行/列的名称获取表格中对应的单个值
df.at['two','name']
'B'
- df.iat: 根据行/列的序号获取表格中对应的单个值
df.iat[1,1]
nan
df.iat[1,2]=None
df
|
name |
year |
price |
one |
A |
2018.0 |
0 |
two |
B |
NaN |
1 |
three |
C |
2020.0 |
2 |
four |
D |
2021.0 |
3 |
five |
D |
2022.0 |
4 |
- df.loc:通过标签或布尔数组访问一组行和列。功能太多,可访问官方文档
df.loc['one']
name A
year 2018
price 0
Name: one, dtype: object
df.loc[['one','four']]
|
name |
year |
price |
one |
A |
2018.0 |
0 |
four |
D |
2021.0 |
3 |
df.loc['one','name']
'A'
df.loc['one':'three','name']
one A
two B
three C
Name: name, dtype: object
df.loc[[True,True,True,False,True]]
|
name |
year |
price |
one |
A |
2018.0 |
0 |
two |
B |
NaN |
1 |
three |
C |
2020.0 |
2 |
five |
D |
2022.0 |
4 |
- df.iloc: 按照位置索引来选取数据
df.iloc[2]
name C
year 2020
price 2
Name: three, dtype: object
df.iloc[2:4]
|
name |
year |
price |
three |
C |
2020.0 |
2 |
four |
D |
2021.0 |
3 |
df.iloc[[2,4]]
|
name |
year |
price |
three |
C |
2020.0 |
2 |
five |
D |
2022.0 |
4 |
df.iloc[[2,4],[2]]
- df.isin: DataFrame中是否包含这个元素
df.isin([2,3])
|
name |
year |
price |
one |
False |
False |
False |
two |
False |
False |
False |
three |
False |
False |
True |
four |
False |
False |
True |
five |
False |
False |
False |
- df.groupby: 对DataFrame进行分组
df.groupby(['name'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F2168ACC48>
df.groupby(['name']).mean()
|
year |
price |
name |
|
|
A |
2018.0 |
0.0 |
B |
NaN |
1.0 |
C |
2020.0 |
2.0 |
D |
2021.5 |
3.5 |
df.groupby(['name']).sum()
|
year |
price |
name |
|
|
A |
2018.0 |
0 |
B |
0.0 |
1 |
C |
2020.0 |
2 |
D |
4043.0 |
7 |
11.== df.drop==: 从行或列删除指定的标签
df.drop(['name'],axis=1)
|
year |
price |
one |
2018.0 |
0 |
two |
NaN |
1 |
three |
2020.0 |
2 |
four |
2021.0 |
3 |
five |
2022.0 |
4 |
df.drop(columns=['name'])
|
year |
price |
one |
2018.0 |
0 |
two |
NaN |
1 |
three |
2020.0 |
2 |
four |
2021.0 |
3 |
five |
2022.0 |
4 |
df.drop(['one'])
|
name |
year |
price |
two |
B |
NaN |
1 |
three |
C |
2020.0 |
2 |
four |
D |
2021.0 |
3 |
five |
D |
2022.0 |
4 |
- 根据表格画柱状图
ax = df.plot.bar(rot=0)
axes = df.plot.bar(rot=0, subplots=True)
axes[1].legend(loc=2)