Pandas分类

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/QianYanDai/article/details/78178637

Pandas分类

  • categorical data是指分类数据:数据类型为:男女、班级(一班、二班)、省份(河北、江苏等),若使用赋值法给变量赋值,例如(男=1,女=0),数字1,0之间没有大小之分,不能认为1是比0大的。
  • numerical data是指数值型数据:收入(1000元,500元),是可以进行比较大小并进行运算的数据。

从0.15版本开始,pandas可以在DataFrame中支持Categorical类型的数据,

Pandas可以在DataFrame中包含分类数据

import pandas as pd
import numpy as np

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e
df["raw_grade"]
0    a
1    b
2    b
3    a
4    a
5    e
Name: raw_grade, dtype: object
# 1 将原始grade成绩转换为分类数据
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
# 2.重命名分类数据为更有意义的名称:
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad
# 3.对类别进行重新排序,增加缺失的类别:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
# 4.按整理后的类别排序(并非词汇的顺序)
df.sort_values(by="grade")
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
# 5.按类别分组也包括空类别:
df.groupby("grade").size()
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

猜你喜欢

转载自blog.csdn.net/QianYanDai/article/details/78178637