版权声明:转载请注明出处~ 摸摸博主狗头 https://blog.csdn.net/cris_zz/article/details/84336138
文章目录
1. Pandas 对指定列排序
import pandas as pd
'''
sort_values 表示按照指定列进行排序;inplace 参数如果为 True,表示对原 DataFrame 进行排序处理,否则就是返回一个
新的排序后的 DataFrame,NaN 表示缺失值;默认升序排序,可以使用 ascending 参数改变排序规则
'''
data = pd.read_csv('food_info.csv')
print(data['Sodium_(mg)'])
data.sort_values('Sodium_(mg)',inplace=True)
print(data['Sodium_(mg)'])
data.sort_values('Sodium_(mg)',inplace=True,ascending=False)
print(data['Sodium_(mg)'])
0 643.0
1 659.0
2 2.0
3 1146.0
4 560.0
5 629.0
6 842.0
7 690.0
8 644.0
9 700.0
10 604.0
11 364.0
12 344.0
13 372.0
14 308.0
15 406.0
16 365.0
17 812.0
18 917.0
19 800.0
20 600.0
21 819.0
22 714.0
23 800.0
24 600.0
25 627.0
26 710.0
27 619.0
28 682.0
29 628.0
...
8588 2.0
8589 2.0
8590 7.0
8591 564.0
8592 464.0
8593 490.0
8594 1.0
8595 199.0
8596 297.0
8597 16.0
8598 486.0
8599 0.0
8600 2.0
8601 1297.0
8602 1435.0
8603 2838.0
8604 10.0
8605 2.0
8606 12.0
8607 0.0
8608 3326.0
8609 1765.0
8610 3750.0
8611 29.0
8612 58.0
8613 4450.0
8614 667.0
8615 58.0
8616 70.0
8617 68.0
Name: Sodium_(mg), Length: 8618, dtype: float64
760 0.0
758 0.0
405 0.0
761 0.0
2269 0.0
763 0.0
764 0.0
770 0.0
774 0.0
396 0.0
395 0.0
6827 0.0
394 0.0
393 0.0
391 0.0
390 0.0
787 0.0
788 0.0
2270 0.0
2231 0.0
407 0.0
748 0.0
409 0.0
747 0.0
702 0.0
703 0.0
704 0.0
705 0.0
706 0.0
707 0.0
...
8153 NaN
8155 NaN
8156 NaN
8157 NaN
8158 NaN
8159 NaN
8160 NaN
8161 NaN
8163 NaN
8164 NaN
8165 NaN
8167 NaN
8169 NaN
8170 NaN
8172 NaN
8173 NaN
8174 NaN
8175 NaN
8176 NaN
8177 NaN
8178 NaN
8179 NaN
8180 NaN
8181 NaN
8183 NaN
8184 NaN
8185 NaN
8195 NaN
8251 NaN
8267 NaN
Name: Sodium_(mg), Length: 8618, dtype: float64
276 38758.0
5814 27360.0
6192 26050.0
1242 26000.0
1245 24000.0
1243 24000.0
1244 23875.0
292 17000.0
1254 11588.0
5811 10600.0
8575 9690.0
291 8068.0
1249 8031.0
5812 7893.0
1292 7851.0
293 7203.0
4472 7027.0
4836 6820.0
1261 6580.0
3747 6008.0
1266 5730.0
4835 5586.0
4834 5493.0
1263 5356.0
1553 5203.0
1552 5053.0
1251 4957.0
1257 4843.0
294 4616.0
8613 4450.0
...
8153 NaN
8155 NaN
8156 NaN
8157 NaN
8158 NaN
8159 NaN
8160 NaN
8161 NaN
8163 NaN
8164 NaN
8165 NaN
8167 NaN
8169 NaN
8170 NaN
8172 NaN
8173 NaN
8174 NaN
8175 NaN
8176 NaN
8177 NaN
8178 NaN
8179 NaN
8180 NaN
8181 NaN
8183 NaN
8184 NaN
8185 NaN
8195 NaN
8251 NaN
8267 NaN
Name: Sodium_(mg), Length: 8618, dtype: float64
2. 泰坦尼克经典入门案例
import numpy as np
'''
isnull 函数可以判断一列数据的缺失值,NaN 则返回 True,正常值则返回 False
'''
titanic_survival = pd.read_csv('titanic_train.csv')
titanic_survival.head()
age = titanic_survival['Age']
age_top_10 = (age[0:10])
age_is_null = pd.isnull(age_top_10)
print(age_is_null)
# 通过索引过滤得到缺失值的数据集
age_null = age_top_10[age_is_null]
print(age_null)
age_null_count = len(age_null)
print(age_null_count)
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
Name: Age, dtype: bool
5 NaN
Name: Age, dtype: float64
1
3. Pandas 常用数据预处理函数
3.1 缺失值处理
'''
如果不对 NaN 值处理,得到的计算结果就是 nan 的~~~
'''
average_age = sum(titanic_survival['Age'])/len(titanic_survival['Age'])
print(average_age)
'''
非常厉害的缺失值处理:通过切片判断表达式得到所有不是 NaN 值的正常数据
'''
# 先通过 isnull 函数得到指定列的所有值,正常值正常显示,非正常值以 NaN 显示
all_age_null = pd.isnull(titanic_survival['Age'])
print(all_age_null)
# 然后通过切片表达式作为索引得到所有的正常值
good_ages = titanic_survival['Age'][all_age_null == False]
print(good_ages)
age_average = sum(good_ages)/len(good_ages)
# 29.69911764705882
print(age_average)
nan
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 True
18 False
19 True
20 False
21 False
22 False
23 False
24 False
25 False
26 True
27 False
28 True
29 True
...
861 False
862 False
863 True
864 False
865 False
866 False
867 False
868 True
869 False
870 False
871 False
872 False
873 False
874 False
875 False
876 False
877 False
878 True
879 False
880 False
881 False
882 False
883 False
884 False
885 False
886 False
887 False
888 True
889 False
890 False
Name: Age, Length: 891, dtype: bool
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
6 54.0
7 2.0
8 27.0
9 14.0
10 4.0
11 58.0
12 20.0
13 39.0
14 14.0
15 55.0
16 2.0
18 31.0
20 35.0
21 34.0
22 15.0
23 28.0
24 8.0
25 38.0
27 19.0
30 40.0
33 66.0
34 28.0
35 42.0
37 21.0
38 18.0
...
856 45.0
857 51.0
858 24.0
860 41.0
861 21.0
862 48.0
864 24.0
865 42.0
866 27.0
867 31.0
869 4.0
870 26.0
871 47.0
872 33.0
873 47.0
874 28.0
875 15.0
876 20.0
877 19.0
879 56.0
880 25.0
881 33.0
882 22.0
883 28.0
884 25.0
885 39.0
886 27.0
887 19.0
889 26.0
890 32.0
Name: Age, Length: 714, dtype: float64
29.69911764705882
3.2 Pandas 预处理函数自动过滤缺失值
# missing data is so common that many pandas methods automatically filter for it
# 虽然 Pandas 为我们提供了过滤缺失值的函数,但是仍然不是很推荐使用,因为数据最好不要轻易过滤,通常的做法都是
# 为其添加一份计算后的默认值
mean_age = titanic_survival['Age'].mean()
print(mean_age)
29.69911764705882
3.3 手动来计算每种船舱的平均价格
Pclass = [1,2,3]
Pclass_avg_price = {}
for this_pclass in Pclass:
# 首先我们需要根据列来筛选出符合条件的行数据(样本数据),然后筛选出来的样本的指定列(特征值)的值求和并除以对应行数求均值
# 得到的数据就是指定特征值的均值
prices = titanic_survival[titanic_survival['Pclass'] == this_pclass]
# Pclass_avg_price[this_pclass] = sum(prices['Fare'])/len(prices)
# 求均值可以使用 3.2节所示的 Pandas 内置函数!
Pclass_avg_price[this_pclass] = prices['Fare'].mean()
print(Pclass_avg_price)
{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}
3.4 Pandas 的内置函数简化 3.3 节的计算
'''
index tells the method which column to group by
values is th column that we want to apply the calculation to
aggfunc specifies the calculation we want to perform
'''
passenger_survival = titanic_survival.pivot_table(index='Pclass', values='Survived', aggfunc=np.mean)
print(passenger_survival)
# 注意:aggfunc 属性如果不写,默认就是求均值
avg_age = titanic_survival.pivot_table(index='Pclass', values='Age')
print(avg_age)
age = titanic_survival.pivot_table(index='Pclass', values='Age', aggfunc=np.mean)
print(age)
Survived
Pclass
1 0.629630
2 0.472826
3 0.242363
Age
Pclass
1 38.233441
2 29.877630
3 25.140620
Age
Pclass
1 38.233441
2 29.877630
3 25.140620
3.5 分组计算制定列之间的关系
# 这里根据登船地点进行分组,然后分别统计船票价格之和以及获救人数之和(按照分组显示)
Fare_survived = titanic_survival.pivot_table(index='Embarked', values=['Fare', 'Survived'], aggfunc=np.sum)
print(Fare_survived)
Fare Survived
Embarked
C 10072.2962 93
Q 1022.2543 30
S 17439.3988 217
# specifying axis = 1 or axis = 'columns' will drop any columns that have null values
drop_col = titanic_survival.dropna(axis=1)
print(drop_col.head())
# 如果 Age 和 Sex 列缺失值,那么丢掉这一行样本
new_data = titanic_survival.dropna(axis=0, subset=['Age','Sex'])
print(new_data.head())
# 对应的 fillna 函数则是对 null 值进行填充
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex SibSp Parch \
0 Braund, Mr. Owen Harris male 1 0
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0
2 Heikkinen, Miss. Laina female 0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0
4 Allen, Mr. William Henry male 0 0
Ticket Fare
0 A/5 21171 7.2500
1 PC 17599 71.2833
2 STON/O2. 3101282 7.9250
3 113803 53.1000
4 373450 8.0500
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
3.6 数据定位
# Pandas 根据行号和列名来定位具体的某个值
print(titanic_survival.loc[12,'Age'])
print(titanic_survival.loc[342,'Pclass'])
20.0
2
3.7 重排序索引
new_data = titanic_survival.sort_values('Age', ascending=False)
# 抛弃以前的索引,对排序后的数据的索引进行重新计算,inplace 为 True 表示对原数据直接更改
new_data.reset_index(drop=True,inplace=True)
print(new_data.head())
PassengerId Survived Pclass Name Sex \
0 631 1 1 Barkworth, Mr. Algernon Henry Wilson male
1 852 0 3 Svensson, Mr. Johan male
2 494 0 1 Artagaveytia, Mr. Ramon male
3 97 0 1 Goldschmidt, Mr. George B male
4 117 0 3 Connors, Mr. Patrick male
Age SibSp Parch Ticket Fare Cabin Embarked
0 80.0 0 0 27042 30.0000 A23 S
1 74.0 0 0 347060 7.7750 NaN S
2 71.0 0 0 PC 17609 49.5042 NaN C
3 71.0 0 0 PC 17754 34.6542 A5 C
4 70.5 0 0 370369 7.7500 NaN Q
3.8 自定义函数
# 定义新函数返回第一百行的数据
def handredth_data (column):
data = column.loc[99]
return data
data = titanic_survival.apply(handredth_data)
print(data)
# 获取每列的缺失值的样本数
def null_count (column):
col_null = pd.isnull(column)
null = column[col_null]
return len(null)
count = titanic_survival.apply(null_count)
print('----------')
print(count)
print(help(pd.isnull))
PassengerId 100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
dtype: object
----------
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Help on function isna in module pandas.core.dtypes.missing:
isna(obj)
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indictates
whether values are missing (``NaN`` in numeric arrays, ``None`` or ``NaN``
in object arrays, ``NaT`` in datetimelike).
Parameters
----------
obj : scalar or array-like
Object to check for null or missing values.
Returns
-------
bool or array-like of bool
For scalar input, returns a scalar boolean.
For array input, returns an array of boolean indicating whether each
corresponding element is missing.
See Also
--------
notna : boolean inverse of pandas.isna.
Series.isna : Detetct missing values in a Series.
DataFrame.isna : Detect missing values in a DataFrame.
Index.isna : Detect missing values in an Index.
Examples
--------
Scalar arguments (including strings) result in a scalar boolean.
>>> pd.isna('dog')
False
>>> pd.isna(np.nan)
True
ndarrays result in an ndarray of booleans.
>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
>>> array
array([[ 1., nan, 3.],
[ 4., 5., nan]])
>>> pd.isna(array)
array([[False, True, False],
[False, False, True]])
For indexes, an ndarray of booleans is returned.
>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
... "2017-07-08"])
>>> index
DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
dtype='datetime64[ns]', freq=None)
>>> pd.isna(index)
array([False, False, True, False])
For Series and DataFrame, the same type is returned, containing booleans.
>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
>>> df
0 1 2
0 ant bee cat
1 dog None fly
>>> pd.isna(df)
0 1 2
0 False False False
1 False True False
>>> pd.isna(df[1])
0 False
1 True
Name: 1, dtype: bool
None
3.9 每行迭代及数据转换
ages = titanic_survival['Age']
print(ages.head())
def which_class (row):
pclass = row['Pclass']
if pd.isnull(pclass):
return 'Unknown'
elif pclass == 1:
return 'First Class'
elif pclass == 2:
return 'Second Class'
else:
return 'Third Class'
# apply 函数中,axis 属性为1,表示对每行进行函数判断,即数据迭代
result = titanic_survival.apply(which_class, axis=1)
print(result.head())
def age_class (row):
age = row['Age']
if pd.isna(age):
return 'Unknown'
elif age < 18:
return '年轻人'
elif age < 40:
return '中年人'
else:
return '老年人'
age_lable = titanic_survival.apply(age_class, axis=1)
print(age_lable.tail())
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
0 Third Class
1 First Class
2 Third Class
3 First Class
4 Third Class
dtype: object
886 中年人
887 中年人
888 Unknown
889 中年人
890 中年人
dtype: object
3.10 巧妙分组计算数据之间的关系
# 为 DataFrame 新增一列
titanic_survival['age_label'] = age_lable
result = titanic_survival.pivot_table(index='age_label', values='Survived')
print(result)
Survived
age_label
Unknown 0.293785
中年人 0.383562
年轻人 0.539823
老年人 0.374233