文章目录
Joyful Pandas
Datawhale社区 Joyful Pandas 课程中关于连接、分组、聚合 的整理:
连接
关系连接
- 左连接 :左——右
- 右连接 :右——左
- 内连接 :保留左右表中的相同键
- 外连接 :保留左右表中的所有键
列连接
df1.merge(df2, left_index = True, right_index = True)
索引连接
df1.join(df2)
纵向连接
df1.append(df2)
# 同pd.concat([df1, df2], axis = 1)
方向连接
pd.concat([df1, df2], axis = 0)
df1.assign(s1)
# dataframe 末尾追加 series
分组模式及其对象
分组模式
df.groupby(分组依据)[数据来源].具体操作
groupby 对象
gb = df.groupby([..., ...]
- 属性
- ngroups:分组个数
- groups:组名映射的字典
- 方法
- size():统计每个组的元素个数
- get_group():获取元素所在组对应的行
聚合函数
df.agg()
动手学数据分析
Datawhale社区 动手学数据分析 课程中关于 数据重构 的内容:
开始之前,导入numpy、pandas包和数据
# 导入基本库
import numpy as np
import pandas as pd
# 载入data文件中的:train-left-up.csv
pd.read_csv("data/train-left-up.csv").head()
PassengerId | Survived | Pclass | Name | |
---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
4 | 5 | 0 | 3 | Allen, Mr. William Henry |
2 第二章:数据重构
2.4 数据的合并
2.4.1 任务一
将data文件夹里面的所有数据都载入,观察数据的之间的关系
#写入代码
text_left_up = pd.read_csv("data/train-left-up.csv")
text_left_down = pd.read_csv("data/train-left-down.csv")
text_right_up = pd.read_csv("data/train-right-up.csv")
text_right_down = pd.read_csv("data/train-right-down.csv")
#写入代码
my_list = [text_left_up, text_left_down, text_right_up, text_right_down]
for i in my_list:
print(i.info())
print('=========')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 439 non-null int64
1 Survived 439 non-null int64
2 Pclass 439 non-null int64
3 Name 439 non-null object
dtypes: int64(3), object(1)
memory usage: 13.8+ KB
None
=========
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 452 non-null int64
1 Survived 452 non-null int64
2 Pclass 452 non-null int64
3 Name 452 non-null object
dtypes: int64(3), object(1)
memory usage: 14.2+ KB
None
=========
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sex 439 non-null object
1 Age 352 non-null float64
2 SibSp 439 non-null int64
3 Parch 439 non-null int64
4 Ticket 439 non-null object
5 Fare 439 non-null float64
6 Cabin 97 non-null object
7 Embarked 438 non-null object
dtypes: float64(2), int64(2), object(4)
memory usage: 27.6+ KB
None
=========
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sex 452 non-null object
1 Age 362 non-null float64
2 SibSp 452 non-null int64
3 Parch 452 non-null int64
4 Ticket 452 non-null object
5 Fare 452 non-null float64
6 Cabin 107 non-null object
7 Embarked 451 non-null object
dtypes: float64(2), int64(2), object(4)
memory usage: 28.4+ KB
None
=========
【提示】结合之前我们加载的train.csv数据,大致预测一下上面的数据是什么
2.4.2:任务二
使用concat方法:将数据train-left-up.csv和train-right-up.csv横向合并为一张表,并保存这张表为result_up
#写入代码
result_up = pd.concat([text_left_up, text_right_up], axis = 1)
result_up.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
2.4.3 任务三
使用concat方法:将train-left-down和train-right-down横向合并为一张表,并保存这张表为result_down。然后将上边的result_up和result_down纵向合并为result。
#写入代码
result_down = pd.concat([text_left_down, text_right_down], axis = 1)
result_down.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 440 | 0 | 2 | Kvillner, Mr. Johan Henrik Johannesson | male | 31.0 | 0 | 0 | C.A. 18723 | 10.500 | NaN | S |
1 | 441 | 1 | 2 | Hart, Mrs. Benjamin (Esther Ada Bloomfield) | female | 45.0 | 1 | 1 | F.C.C. 13529 | 26.250 | NaN | S |
2 | 442 | 0 | 3 | Hampe, Mr. Leon | male | 20.0 | 0 | 0 | 345769 | 9.500 | NaN | S |
3 | 443 | 0 | 3 | Petterson, Mr. Johan Emil | male | 25.0 | 1 | 0 | 347076 | 7.775 | NaN | S |
4 | 444 | 1 | 2 | Reynaldo, Ms. Encarnacion | female | 28.0 | 0 | 0 | 230434 | 13.000 | NaN | S |
result = pd.concat([result_up, result_down], axis = 0).reset_index(drop = True)
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
2.4.4 任务四
使用DataFrame自带的方法join方法和append:完成任务二和任务三的任务
#写入代码
resul_up = text_left_up.join(text_right_up)
result_down = text_left_down.join(text_right_down)
result = result_up.append(result_down).reset_index(drop = True)
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
2.4.5 任务五
使用Panads的merge方法和DataFrame的append方法:完成任务二和任务三的任务
#写入代码
resul_up = text_left_up.merge(text_right_up, left_index = True, right_index = True)
result_down = text_left_down.merge(text_right_down, left_index = True, right_index = True)
result = result_up.append(result_down).reset_index(drop = True)
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
【思考】对比merge、join以及concat的方法的不同以及相同。思考一下在任务四和任务五的情况下,为什么都要求使用DataFrame的append方法,如何只要求使用merge或者join可不可以完成任务四和任务五呢?
merge, join 用来进行列拼接,append 用来进行行拼接,官方文档:
pandas.DataFrame.join
pandas.DataFrame.merge
pandas.DataFrame.append
2.5 换一种角度看数据
2.5.1 任务一
将我们的数据变为Series类型的数据
#写入代码
result.stack().shape
(9826,)
#写入代码
result.stack()[0]
PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22.0
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Embarked S
dtype: object
复习:在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了数据的清洗,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。
2.6 数据运用
2.6.1 任务一
通过教材《Python for Data Analysis》P303、Google or anything来学习了解GroupBy机制
#写入心得
2.4.2:任务二
计算泰坦尼克号男性与女性的平均票价
# 写入代码
df.groupby('Sex')['Fare'].mean()
Sex
female 44.479818
male 25.523893
Name: Fare, dtype: float64
在了解GroupBy机制之后,运用这个机制完成一系列的操作,来达到我们的目的。
下面通过几个任务来熟悉GroupBy机制。
2.4.3:任务三
统计泰坦尼克号中男女的存活人数
# 写入代码
df.groupby('Sex')['Survived'].sum()
Sex
female 233
male 109
Name: Survived, dtype: int64
2.4.4:任务四
计算客舱不同等级的存活人数
# 写入代码
df.groupby('Pclass')['Survived'].sum()
Pclass
1 136
2 87
3 119
Name: Survived, dtype: int64
【补充】关于 count 和 sum 方法的区别:
count 计数 -> 多少行, sum 累加 -> 0,1标签相加得到1的个数
print(df.groupby('Pclass')['Survived'].count())
print('=========')
df.Pclass.value_counts()
Pclass
1 216
2 184
3 491
Name: Survived, dtype: int64
=========
3 491
1 216
2 184
Name: Pclass, dtype: int64
【提示:】表中的存活那一栏,可以发现如果还活着记为1,死亡记为0
【思考】从数据分析的角度,上面的统计结果可以得出那些结论
#思考心得
- 女性票价平均比男性高
- 女性存活人数远多于男性
- 客舱等级1的存活概率大
【思考】从任务二到任务三中,这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗?
#思考心得
df.groupby('Sex').agg({
'Fare': 'mean', 'Survived': 'sum'}).rename(columns = {
'Fare': 'Fare_mean', 'Survived': 'Survived_sum'})
Fare_mean | Survived_sum | |
---|---|---|
Sex | ||
female | 44.479818 | 233 |
male | 25.523893 | 109 |
2.4.5:任务五
统计在不同等级的票中的不同年龄的船票花费的平均值
# 写入代码
df['Age_level'] = pd.cut(df.Age, bins = 4)
df.groupby(['Pclass','Age_level'])['Fare'].mean()
Pclass Age_level
1 (0.34, 20.315] 116.136705
(20.315, 40.21] 97.959878
(40.21, 60.105] 70.386898
(60.105, 80.0] 59.969050
2 (0.34, 20.315] 24.725834
(20.315, 40.21] 21.055769
(40.21, 60.105] 20.254032
(60.105, 80.0] 10.500000
3 (0.34, 20.315] 16.580693
(20.315, 40.21] 11.402144
(40.21, 60.105] 12.248931
(60.105, 80.0] 7.820000
Name: Fare, dtype: float64
2.4.6:任务六
将任务二和任务三的数据合并,并保存到sex_fare_survived.csv
# 写入代码
pd.concat([df.groupby('Sex')['Fare'].mean(), df.groupby('Sex')['Survived'].sum()], axis = 1)
Fare | Survived | |
---|---|---|
Sex | ||
female | 44.479818 | 233 |
male | 25.523893 | 109 |
pd.merge(df.groupby('Sex')['Fare'].mean(), df.groupby('Sex')['Survived'].sum(),on='Sex')
Fare | Survived | |
---|---|---|
Sex | ||
female | 44.479818 | 233 |
male | 25.523893 | 109 |
2.4.7:任务七
得出不同年龄的总的存活人数,然后找出存活人数最多的年龄段,最后计算存活人数最高的存活率(存活人数/总人数)
# 写入代码
survived_age = df.groupby('Age')['Survived'].sum()
# 写入代码
survived_age.sort_values(ascending= False)
Age
24.0 15
22.0 11
27.0 11
36.0 11
35.0 11
..
20.5 0
23.5 0
24.5 0
28.5 0
40.5 0
Name: Survived, Length: 88, dtype: int64
survived_age.sort_values(ascending= False).iloc[0] / df.Survived.sum()
0.043859649122807015