Task2–数据分析

EDA目标

EDA的价值在于熟悉数据集，了解数据集
了解数据集之后更好地进行特征工程，使预测的问题更加可靠
完成对于数据的探索性分析，并对于数据进行一些图表或者文字总结并打卡

2.1：载入各种数据科学以及可视化库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno # 缺失值可视化处理包
# 导入warnings包，利用过滤器来实现忽略警告语句
import warnings
warnings.filterwarnings('ignore')

2.2：载入数据

Train_data = pd.read_csv('/Users/admin/Desktop/used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv('/Users/admin/Desktop/used_car_testA_20200313.csv', sep=' ')

# 简略检查数据
Train_data.head().append(Train_data.tail())

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	0	736	20040402	30.0	6	1.0	0.0	0.0	60	12.5	...	0.235676	0.101988	0.129549	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914762
1	1	2262	20030301	40.0	1	2.0	0.0	0.0	0	15.0	...	0.264777	0.121004	0.135731	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522
2	2	14874	20040403	115.0	15	1.0	0.0	0.0	163	12.5	...	0.251410	0.114912	0.165147	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963
3	3	71865	19960908	109.0	10	0.0	0.0	1.0	193	15.0	...	0.274293	0.110300	0.121964	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699
4	4	111080	20120103	110.0	5	1.0	0.0	0.0	68	5.0	...	0.228036	0.073205	0.091880	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482
149995	149995	163978	20000607	121.0	10	4.0	0.0	1.0	163	15.0	...	0.280264	0.000310	0.048441	0.071158	0.019174	1.988114	-2.983973	0.589167	-1.304370	-0.302592
149996	149996	184535	20091102	116.0	11	0.0	0.0	0.0	125	10.0	...	0.253217	0.000777	0.084079	0.099681	0.079371	1.839166	-2.774615	2.553994	0.924196	-0.272160
149997	149997	147587	20101003	60.0	11	1.0	1.0	0.0	90	6.0	...	0.233353	0.000705	0.118872	0.100118	0.097914	2.439812	-1.630677	2.290197	1.891922	0.414931
149998	149998	45907	20060312	34.0	10	3.0	1.0	0.0	156	15.0	...	0.256369	0.000252	0.081479	0.083558	0.081498	2.075380	-2.633719	1.414937	0.431981	-1.659014
149999	149999	177672	19990204	19.0	28	6.0	0.0	1.0	193	12.5	...	0.284475	0.000000	0.040072	0.062543	0.025819	1.978453	-3.179913	0.031724	-1.483350	-0.342674

10 rows × 31 columns

Train_data.shape

(150000, 31)

Test_data.head().append(Test_data.tail())

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	150000	66932	20111212	222.0	4	5.0	1.0	1.0	313	15.0	...	0.264405	0.121800	0.070899	0.106558	0.078867	-7.050969	-0.854626	4.800151	0.620011	-3.664654
1	150001	174960	19990211	19.0	21	0.0	0.0	0.0	75	12.5	...	0.261745	0.000000	0.096733	0.013705	0.052383	3.679418	-0.729039	-3.796107	-1.541230	-0.757055
2	150002	5356	20090304	82.0	21	0.0	0.0	0.0	109	7.0	...	0.260216	0.112081	0.078082	0.062078	0.050540	-4.926690	1.001106	0.826562	0.138226	0.754033
3	150003	50688	20100405	0.0	0	0.0	0.0	1.0	160	7.0	...	0.260466	0.106727	0.081146	0.075971	0.048268	-4.864637	0.505493	1.870379	0.366038	1.312775
4	150004	161428	19970703	26.0	14	2.0	0.0	0.0	75	15.0	...	0.250999	0.000000	0.077806	0.028600	0.081709	3.616475	-0.673236	-3.197685	-0.025678	-0.101290
49995	199995	20903	19960503	4.0	4	4.0	0.0	0.0	116	15.0	...	0.284664	0.130044	0.049833	0.028807	0.004616	-5.978511	1.303174	-1.207191	-1.981240	-0.357695
49996	199996	708	19991011	0.0	0	0.0	0.0	0.0	75	15.0	...	0.268101	0.108095	0.066039	0.025468	0.025971	-3.913825	1.759524	-2.075658	-1.154847	0.169073
49997	199997	6693	20040412	49.0	1	0.0	1.0	1.0	224	15.0	...	0.269432	0.105724	0.117652	0.057479	0.015669	-4.639065	0.654713	1.137756	-1.390531	0.254420
49998	199998	96900	20020008	27.0	1	0.0	0.0	1.0	334	15.0	...	0.261152	0.000490	0.137366	0.086216	0.051383	1.833504	-2.828687	2.465630	-0.911682	-2.057353
49999	199999	193384	20041109	166.0	6	1.0	NaN	1.0	68	9.0	...	0.228730	0.000300	0.103534	0.080625	0.124264	2.914571	-1.135270	0.547628	2.094057	-1.552150

10 rows × 30 columns

Test_data.shape

(50000, 30)

2.3：数据总览

Train_data.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
count	150000.000000	150000.000000	1.500000e+05	149999.000000	150000.000000	145494.000000	141320.000000	144019.000000	150000.000000	150000.000000	...	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000
mean	74999.500000	68349.172873	2.003417e+07	47.129021	8.052733	1.792369	0.375842	0.224943	119.316547	12.597160	...	0.248204	0.044923	0.124692	0.058144	0.061996	-0.001000	0.009035	0.004813	0.000313	-0.000688
std	43301.414527	61103.875095	5.364988e+04	49.536040	7.864956	1.760640	0.548677	0.417546	177.168419	3.919576	...	0.045804	0.051743	0.201410	0.029186	0.035692	3.772386	3.286071	2.517478	1.288988	1.038685
min	0.000000	0.000000	1.991000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	0.000000	0.000000	0.000000	-9.168192	-5.558207	-9.639552	-4.153899	-6.546556
25%	37499.750000	11156.000000	1.999091e+07	10.000000	1.000000	0.000000	0.000000	0.000000	75.000000	12.500000	...	0.243615	0.000038	0.062474	0.035334	0.033930	-3.722303	-1.951543	-1.871846	-1.057789	-0.437034
50%	74999.500000	51638.000000	2.003091e+07	30.000000	6.000000	1.000000	0.000000	0.000000	110.000000	15.000000	...	0.257798	0.000812	0.095866	0.057014	0.058484	1.624076	-0.358053	-0.130753	-0.036245	0.141246
75%	112499.250000	118841.250000	2.007111e+07	66.000000	13.000000	3.000000	1.000000	0.000000	150.000000	15.000000	...	0.265297	0.102009	0.125243	0.079382	0.087491	2.844357	1.255022	1.776933	0.942813	0.680378
max	149999.000000	196812.000000	2.015121e+07	247.000000	39.000000	7.000000	6.000000	1.000000	19312.000000	15.000000	...	0.291838	0.151420	1.404936	0.160791	0.222787	12.357011	18.819042	13.847792	11.147669	8.658418

8 rows × 30 columns

Test_data.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
count	50000.000000	50000.000000	5.000000e+04	50000.000000	50000.000000	48587.000000	47107.000000	48090.000000	50000.000000	50000.000000	...	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000
mean	174999.500000	68542.223280	2.003393e+07	46.844520	8.056240	1.782185	0.373405	0.224350	119.883620	12.595580	...	0.248669	0.045021	0.122744	0.057997	0.062000	-0.017855	-0.013742	-0.013554	-0.003147	0.001516
std	14433.901067	61052.808133	5.368870e+04	49.469548	7.819477	1.760736	0.546442	0.417158	185.097387	3.908979	...	0.044601	0.051766	0.195972	0.029211	0.035653	3.747985	3.231258	2.515962	1.286597	1.027360
min	150000.000000	0.000000	1.991000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	0.000000	0.000000	0.000000	-9.160049	-5.411964	-8.916949	-4.123333	-6.112667
25%	162499.750000	11203.500000	1.999091e+07	10.000000	1.000000	0.000000	0.000000	0.000000	75.000000	12.500000	...	0.243762	0.000044	0.062644	0.035084	0.033714	-3.700121	-1.971325	-1.876703	-1.060428	-0.437920
50%	174999.500000	52248.500000	2.003091e+07	29.000000	6.000000	1.000000	0.000000	0.000000	109.000000	15.000000	...	0.257877	0.000815	0.095828	0.057084	0.058764	1.613212	-0.355843	-0.142779	-0.035956	0.138799
75%	187499.250000	118856.500000	2.007110e+07	65.000000	13.000000	3.000000	1.000000	0.000000	150.000000	15.000000	...	0.265328	0.102025	0.125438	0.079077	0.087489	2.832708	1.262914	1.764335	0.941469	0.681163
max	199999.000000	196805.000000	2.015121e+07	246.000000	39.000000	7.000000	6.000000	1.000000	20000.000000	15.000000	...	0.291618	0.153265	1.358813	0.156355	0.214775	12.338872	18.856218	12.950498	5.913273	2.624622

8 rows × 29 columns

Train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

Test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SaleID             50000 non-null  int64  
 1   name               50000 non-null  int64  
 2   regDate            50000 non-null  int64  
 3   model              50000 non-null  float64
 4   brand              50000 non-null  int64  
 5   bodyType           48587 non-null  float64
 6   fuelType           47107 non-null  float64
 7   gearbox            48090 non-null  float64
 8   power              50000 non-null  int64  
 9   kilometer          50000 non-null  float64
 10  notRepairedDamage  50000 non-null  object 
 11  regionCode         50000 non-null  int64  
 12  seller             50000 non-null  int64  
 13  offerType          50000 non-null  int64  
 14  creatDate          50000 non-null  int64  
 15  v_0                50000 non-null  float64
 16  v_1                50000 non-null  float64
 17  v_2                50000 non-null  float64
 18  v_3                50000 non-null  float64
 19  v_4                50000 non-null  float64
 20  v_5                50000 non-null  float64
 21  v_6                50000 non-null  float64
 22  v_7                50000 non-null  float64
 23  v_8                50000 non-null  float64
 24  v_9                50000 non-null  float64
 25  v_10               50000 non-null  float64
 26  v_11               50000 non-null  float64
 27  v_12               50000 non-null  float64
 28  v_13               50000 non-null  float64
 29  v_14               50000 non-null  float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB

2.4：判断数据缺失和异常

2.4.1：缺失值

Train_data.isnull().sum()

SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

# nan可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x245a26c4ef0>

# 可视化看下缺省值
msno.bar(Train_data.sample(1000))

<matplotlib.axes._subplots.AxesSubplot at 0x245a26c4ef0>

Test_data.isnull().sum()

SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1413
fuelType             2893
gearbox              1910
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

missing = Test_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x245a2b10d68>

msno.bar(Test_data.sample(1000))

<matplotlib.axes._subplots.AxesSubplot at 0x245a2b10d68>

2.4.2：异常值

Train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

Train_data['notRepairedDamage'].value_counts()

0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)

Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)

Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)

2.5：了解预测值的分布

Train_data['price']

0         1850
1         3600
2         6222
3         2400
4         5200
          ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int64

Train_data['price'].value_counts()

500      2337
1500     2158
1200     1922
1000     1850
2500     1821
         ... 
25321       1
8886        1
8801        1
37920       1
8188        1
Name: price, Length: 3763, dtype: int64

## 1) 总体分布概况（无界约翰逊分布等）
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)

<matplotlib.axes._subplots.AxesSubplot at 0x245a2d78a90>

## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())

Skewness: 3.346487
Kurtosis: 18.995183

sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')

<matplotlib.axes._subplots.AxesSubplot at 0x245a2d78a90>

# log变换 z之后的分布较均匀，可以进行log变换进行预测，这也是预测问题常用的trick
plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red') 
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-w6mWCoZv-1584979187920)(output_36_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8GOAmLOZ-1584979187922)(output_36_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-u1bpR7Ce-1584979187923)(output_36_2.png)]

qq_36317238

发布了1 篇原创文章 · 获赞 0 · 访问量 4

私信关注

Task2--数据分析

Task2–数据分析

EDA目标

2.1：载入各种数据科学以及可视化库

2.2：载入数据

2.3：数据总览

2.4：判断数据缺失和异常

2.4.1：缺失值

2.4.2：异常值

2.5：了解预测值的分布

猜你喜欢

Task2--数据分析

Task2–数据分析

EDA目标

2.1： 载入各种数据科学以及可视化库

2.2：载入数据

2.3：数据总览

2.4：判断数据缺失和异常

2.4.1：缺失值

2.4.2：异常值

2.5：了解预测值的分布

猜你喜欢

2.1：载入各种数据科学以及可视化库