python 评估数据

从两个方面评估：数据质量问题（即内容问题）和整洁度（即结构性问题）。

（1）脏数据：不准确、损坏的、重复的数据
（2）messy data：不整洁，整洁的数据就是一行一列

以下是你经常会在 pandas 中使用的程序评估方法：

.head() 默认头5条，可以是df，也可以是列
.tail() 默认尾5条
.sample() 默认1条
.info (仅限于 DataFrame)
.describe (DataFrame 和 Series)
计数，平均值，标准差，最小值，最大值以及较低的百分位数和50。默认情况下，较低的百分位数为25，较高的百分位数为75.50百分位数与中位数相同

train_df['Parch'].describe(percentiles=[.75, .8])
Out[10]: 
count    891.000000
mean       0.381594
std        0.806057
min        0.000000
50%        0.000000
75%        0.000000
80%        1.000000
max        6.000000
Name: Parch, dtype: float64

describe：一共3个参数，percentiles、include和exclude
percentiles：可以设定数值型特征的统计量，默认是[.25, .5, .75],也就是返回25%，50%，75%数据量时的数字；
include：默认是只计算数值型特征的统计量，

//是O，不是数字0，统计非数值型
patients.describe(include='O')
//统计所有的
patients.describe(include='all')

exclude：排除哪些列，默认不排除
.value_counts (仅限于 Series)
各种索引和筛选数据的方法 (.loc and bracket notation with/without boolean indexing, also .iloc)

loc函数主要通过行标签索引行数据

dataframe.shape()
返回形状，即几行几列的数组，如[2,3],shape[0]=rows,shape[1]=columns

使用dtype查看dataframe字段类型
print df.dtypes

使用astype实现dataframe字段类型转换
df[‘col2’] = df[‘col2’].astype(‘float64’)

JackLi_csdn

发布了185 篇原创文章 · 获赞 6 · 访问量 7万+

私信关注

猜你喜欢