Joyful Pandas

Datawhale社区 Joyful Pandas 课程中关于连接、分组、聚合 的整理：

连接

关系连接

左连接：左——右
右连接：右——左
内连接：保留左右表中的相同键
外连接：保留左右表中的所有键

列连接

df1.merge(df2, left_index = True, right_index = True)

索引连接

df1.join(df2)

纵向连接

df1.append(df2) # 同pd.concat([df1, df2], axis = 1)

方向连接

pd.concat([df1, df2], axis = 0)
df1.assign(s1) # dataframe 末尾追加 series

分组模式及其对象

分组模式
df.groupby(分组依据)[数据来源].具体操作

groupby 对象
gb = df.groupby([..., ...]

属性
- ngroups：分组个数
- groups：组名映射的字典
方法
- size()：统计每个组的元素个数
- get_group()：获取元素所在组对应的行

聚合函数

df.agg()

动手学数据分析

Datawhale社区 动手学数据分析 课程中关于 数据重构 的内容：

开始之前，导入numpy、pandas包和数据

# 导入基本库
import numpy as np
import pandas as pd

# 载入data文件中的:train-left-up.csv
pd.read_csv("data/train-left-up.csv").head()

	PassengerId	Survived	Pclass	Name
0	1	0	3	Braund, Mr. Owen Harris
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	3	1	3	Heikkinen, Miss. Laina
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	5	0	3	Allen, Mr. William Henry

2 第二章：数据重构

2.4 数据的合并

2.4.1 任务一

将data文件夹里面的所有数据都载入，观察数据的之间的关系

#写入代码
text_left_up = pd.read_csv("data/train-left-up.csv")
text_left_down = pd.read_csv("data/train-left-down.csv")
text_right_up = pd.read_csv("data/train-right-up.csv")
text_right_down = pd.read_csv("data/train-right-down.csv")

#写入代码
my_list = [text_left_up, text_left_down, text_right_up, text_right_down]
for i in my_list:
    print(i.info())
    print('=========')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   PassengerId  439 non-null    int64 
 1   Survived     439 non-null    int64 
 2   Pclass       439 non-null    int64 
 3   Name         439 non-null    object
dtypes: int64(3), object(1)
memory usage: 13.8+ KB
None
=========
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   PassengerId  452 non-null    int64 
 1   Survived     452 non-null    int64 
 2   Pclass       452 non-null    int64 
 3   Name         452 non-null    object
dtypes: int64(3), object(1)
memory usage: 14.2+ KB
None
=========
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Sex       439 non-null    object 
 1   Age       352 non-null    float64
 2   SibSp     439 non-null    int64  
 3   Parch     439 non-null    int64  
 4   Ticket    439 non-null    object 
 5   Fare      439 non-null    float64
 6   Cabin     97 non-null     object 
 7   Embarked  438 non-null    object 
dtypes: float64(2), int64(2), object(4)
memory usage: 27.6+ KB
None
=========
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Sex       452 non-null    object 
 1   Age       362 non-null    float64
 2   SibSp     452 non-null    int64  
 3   Parch     452 non-null    int64  
 4   Ticket    452 non-null    object 
 5   Fare      452 non-null    float64
 6   Cabin     107 non-null    object 
 7   Embarked  451 non-null    object 
dtypes: float64(2), int64(2), object(4)
memory usage: 28.4+ KB
None
=========

【提示】结合之前我们加载的train.csv数据，大致预测一下上面的数据是什么

2.4.2：任务二

使用concat方法：将数据train-left-up.csv和train-right-up.csv横向合并为一张表，并保存这张表为result_up

#写入代码
result_up = pd.concat([text_left_up, text_right_up], axis = 1)
result_up.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

2.4.3 任务三

使用concat方法：将train-left-down和train-right-down横向合并为一张表，并保存这张表为result_down。然后将上边的result_up和result_down纵向合并为result。

#写入代码
result_down = pd.concat([text_left_down, text_right_down], axis = 1)
result_down.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	440	0	2	Kvillner, Mr. Johan Henrik Johannesson	male	31.0	0	0	C.A. 18723	10.500	NaN	S
1	441	1	2	Hart, Mrs. Benjamin (Esther Ada Bloomfield)	female	45.0	1	1	F.C.C. 13529	26.250	NaN	S
2	442	0	3	Hampe, Mr. Leon	male	20.0	0	0	345769	9.500	NaN	S
3	443	0	3	Petterson, Mr. Johan Emil	male	25.0	1	0	347076	7.775	NaN	S
4	444	1	2	Reynaldo, Ms. Encarnacion	female	28.0	0	0	230434	13.000	NaN	S

result = pd.concat([result_up, result_down], axis = 0).reset_index(drop = True)
result.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

2.4.4 任务四

使用DataFrame自带的方法join方法和append：完成任务二和任务三的任务

#写入代码
resul_up = text_left_up.join(text_right_up)
result_down = text_left_down.join(text_right_down)
result = result_up.append(result_down).reset_index(drop = True)
result.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

2.4.5 任务五

使用Panads的merge方法和DataFrame的append方法：完成任务二和任务三的任务

#写入代码
resul_up = text_left_up.merge(text_right_up, left_index = True, right_index = True)
result_down = text_left_down.merge(text_right_down, left_index = True, right_index = True)
result = result_up.append(result_down).reset_index(drop = True)
result.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

【思考】对比merge、join以及concat的方法的不同以及相同。思考一下在任务四和任务五的情况下，为什么都要求使用DataFrame的append方法，如何只要求使用merge或者join可不可以完成任务四和任务五呢？

merge, join 用来进行列拼接，append 用来进行行拼接，官方文档：
pandas.DataFrame.join
pandas.DataFrame.merge
pandas.DataFrame.append

2.5 换一种角度看数据

2.5.1 任务一

将我们的数据变为Series类型的数据

#写入代码
result.stack().shape

(9826,)

#写入代码
result.stack()[0]

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Embarked                             S
dtype: object

复习：在前面我们已经学习了Pandas基础，第二章我们开始进入数据分析的业务部分，在第二章第一节的内容中，我们学习了数据的清洗，这一部分十分重要，只有数据变得相对干净，我们之后对数据的分析才可以更有力。而这一节，我们要做的是数据重构，数据重构依旧属于数据理解（准备）的范围。

2.6 数据运用

2.6.1 任务一

通过教材《Python for Data Analysis》P303、Google or anything来学习了解GroupBy机制

#写入心得

2.4.2：任务二

计算泰坦尼克号男性与女性的平均票价

# 写入代码
df.groupby('Sex')['Fare'].mean()

Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64

在了解GroupBy机制之后，运用这个机制完成一系列的操作，来达到我们的目的。

下面通过几个任务来熟悉GroupBy机制。

2.4.3：任务三

统计泰坦尼克号中男女的存活人数

# 写入代码
df.groupby('Sex')['Survived'].sum()

Sex
female    233
male      109
Name: Survived, dtype: int64

2.4.4：任务四

计算客舱不同等级的存活人数

# 写入代码
df.groupby('Pclass')['Survived'].sum()

Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64

【补充】关于 count 和 sum 方法的区别：
count 计数 -> 多少行， sum 累加 -> 0，1标签相加得到1的个数

print(df.groupby('Pclass')['Survived'].count())
print('=========')
df.Pclass.value_counts()

Pclass
1    216
2    184
3    491
Name: Survived, dtype: int64
=========





3    491
1    216
2    184
Name: Pclass, dtype: int64

【提示：】表中的存活那一栏，可以发现如果还活着记为1，死亡记为0

【思考】从数据分析的角度，上面的统计结果可以得出那些结论

#思考心得

女性票价平均比男性高
女性存活人数远多于男性
客舱等级1的存活概率大

【思考】从任务二到任务三中，这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗？

#思考心得
df.groupby('Sex').agg({
    
    'Fare': 'mean', 'Survived': 'sum'}).rename(columns = {
    
    'Fare': 'Fare_mean', 'Survived': 'Survived_sum'})

	Fare_mean	Survived_sum
Sex
female	44.479818	233
male	25.523893	109

2.4.5：任务五

统计在不同等级的票中的不同年龄的船票花费的平均值

# 写入代码
df['Age_level'] = pd.cut(df.Age, bins = 4)
df.groupby(['Pclass','Age_level'])['Fare'].mean()

Pclass  Age_level      
1       (0.34, 20.315]     116.136705
        (20.315, 40.21]     97.959878
        (40.21, 60.105]     70.386898
        (60.105, 80.0]      59.969050
2       (0.34, 20.315]      24.725834
        (20.315, 40.21]     21.055769
        (40.21, 60.105]     20.254032
        (60.105, 80.0]      10.500000
3       (0.34, 20.315]      16.580693
        (20.315, 40.21]     11.402144
        (40.21, 60.105]     12.248931
        (60.105, 80.0]       7.820000
Name: Fare, dtype: float64

2.4.6：任务六

将任务二和任务三的数据合并，并保存到sex_fare_survived.csv

# 写入代码
pd.concat([df.groupby('Sex')['Fare'].mean(), df.groupby('Sex')['Survived'].sum()], axis = 1)

	Fare	Survived
Sex
female	44.479818	233
male	25.523893	109

pd.merge(df.groupby('Sex')['Fare'].mean(), df.groupby('Sex')['Survived'].sum(),on='Sex')

	Fare	Survived
Sex
female	44.479818	233
male	25.523893	109

2.4.7：任务七

得出不同年龄的总的存活人数，然后找出存活人数最多的年龄段，最后计算存活人数最高的存活率（存活人数/总人数）

# 写入代码
survived_age =  df.groupby('Age')['Survived'].sum()

# 写入代码
survived_age.sort_values(ascending= False)

Age
24.0    15
22.0    11
27.0    11
36.0    11
35.0    11
        ..
20.5     0
23.5     0
24.5     0
28.5     0
40.5     0
Name: Survived, Length: 88, dtype: int64

survived_age.sort_values(ascending= False).iloc[0] / df.Survived.sum()

0.043859649122807015

Pandas数据重构

文章目录

Joyful Pandas

连接

分组模式及其对象

聚合函数

动手学数据分析

2 第二章：数据重构

2.4 数据的合并

2.4.1 任务一

2.4.2：任务二

2.4.3 任务三

2.4.4 任务四

2.4.5 任务五

2.5 换一种角度看数据

2.5.1 任务一

2.6 数据运用

2.6.1 任务一

2.4.2：任务二

2.4.3：任务三

2.4.4：任务四

2.4.5：任务五

2.4.6：任务六

2.4.7：任务七

猜你喜欢