之前已经简单介绍了数据,客户的违约风险的预测是一个监督学习的任务,主要是对客户进行分类,就是哪些人可以获得贷款,哪些不可以,每个申请者可能会违约的概率在0~1之间,0:表示申请者能及时还款,1:申请者很难按时还款会违约.
数据初步探索
数据来自Home Credit
共有8个不同数据:
- application_train.csv
- application_test.csv
- bureau.csv
- bureau_balance.csv
- previous_application.csv
- POS_CASH_BALANCE.csv
- credit_card_balance.csv
- installments_payment.scv
用到的Python库:numpy,pandas,sklearn, matplotlib
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Training data
train_data = pd.read_csv('data/application_train.csv')
print("样本数量:",train_data.shape[0])
print("特征数量(含TARGET[label]):",train_data.shape[1])
样本数量: 307511
特征数量(含TARGET[label]): 122
train_data.head()
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | … | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | … | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | … | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | … | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | … | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | … | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
Test data features(没有TATGET)
test_data = pd.read_csv('data/application_test.csv')
print("测试样本数量:",test_data.shape[0])
print("测试样本特征(不含TARGET):",test_data.shape[1])
测试样本数量: 48744 测试样本特征(不含TARGET): 121
test_data.head()
SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | … | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | … | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | … | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | … | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | … | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | … | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
Distribution of the TARGET
#TARGET 0:按时还款没有违约,1:不能按时还款
train_data['TARGET'].value_counts()
0 282686
1 24825
Name: TARGET, dtype: int64
train_data['TARGET'].astype(int).plot.hist()
在Training data中,不能按时还款的大概是能按时还款的1/10,典型的类别不平衡问题.
遇到类别不平衡的问题有三种策略可选:
- 针对数量过多的样本进行降采样.(目前情况来看不可取,会失去大量数据)
- 针对数量较少的样本进行过采样.(也不可取)
- 通过数据的真实的观测几率,对类别加权来克服数量上不平衡的干扰.
Examine Missing Value
count_miss_val = train_data.isnull().sum()
# 每个特征缺失值统计
count_miss_val.head(10)
SK_ID_CURR 0
TARGET 0
NAME_CONTRACT_TYPE 0
CODE_GENDER 0
FLAG_OWN_CAR 0
FLAG_OWN_REALTY 0
CNT_CHILDREN 0
AMT_INCOME_TOTAL 0
AMT_CREDIT 0
AMT_ANNUITY 12
dtype: int64
miss_val_percent = 100 * count_miss_val / train_data.shape[0]
# 缺失值占百分比
miss_val_percent.head(10)
SK_ID_CURR 0.000000
TARGET 0.000000
NAME_CONTRACT_TYPE 0.000000
CODE_GENDER 0.000000
FLAG_OWN_CAR 0.000000
FLAG_OWN_REALTY 0.000000
CNT_CHILDREN 0.000000
AMT_INCOME_TOTAL 0.000000
AMT_CREDIT 0.000000
AMT_ANNUITY 0.003902
dtype: float64
# 合并上面两个表
miss_val_table = pd.concat([count_miss_val,
miss_val_percent], axis = 1)
# 列名
new_miss_val_table = miss_val_table.rename(columns = {0:"Count",
1:"% Percent"})
new_miss_val_table.head(10)
Count | % Percent | |
---|---|---|
SK_ID_CURR | 0 | 0.000000 |
TARGET | 0 | 0.000000 |
NAME_CONTRACT_TYPE | 0 | 0.000000 |
CODE_GENDER | 0 | 0.000000 |
FLAG_OWN_CAR | 0 | 0.000000 |
FLAG_OWN_REALTY | 0 | 0.000000 |
CNT_CHILDREN | 0 | 0.000000 |
AMT_INCOME_TOTAL | 0 | 0.000000 |
AMT_CREDIT | 0 | 0.000000 |
AMT_ANNUITY | 12 | 0.003902 |
# 对存在缺失值的特征按照百分比的大小排序
new_miss_val_table = new_miss_val_table[new_miss_val_table.iloc[:,1] != 0].sort_values(
"% Percent", ascending = False).round(1)
new_miss_val_table.head(20)
Count | % Percent | |
---|---|---|
COMMONAREA_MEDI | 214865 | 69.9 |
COMMONAREA_AVG | 214865 | 69.9 |
COMMONAREA_MODE | 214865 | 69.9 |
NONLIVINGAPARTMENTS_MEDI | 213514 | 69.4 |
NONLIVINGAPARTMENTS_MODE | 213514 | 69.4 |
NONLIVINGAPARTMENTS_AVG | 213514 | 69.4 |
FONDKAPREMONT_MODE | 210295 | 68.4 |
LIVINGAPARTMENTS_MODE | 210199 | 68.4 |
LIVINGAPARTMENTS_MEDI | 210199 | 68.4 |
LIVINGAPARTMENTS_AVG | 210199 | 68.4 |
FLOORSMIN_MODE | 208642 | 67.8 |
FLOORSMIN_MEDI | 208642 | 67.8 |
FLOORSMIN_AVG | 208642 | 67.8 |
YEARS_BUILD_MODE | 204488 | 66.5 |
YEARS_BUILD_MEDI | 204488 | 66.5 |
YEARS_BUILD_AVG | 204488 | 66.5 |
OWN_CAR_AGE | 202929 | 66.0 |
LANDAREA_AVG | 182590 | 59.4 |
LANDAREA_MEDI | 182590 | 59.4 |
LANDAREA_MODE | 182590 | 59.4 |
目前还在对数据探索阶段,先不对缺失值进行处理,但是在构建机器学习模型的时候,针对不同的算法我们可能要处理这些缺失值,进行填充.如果算法对缺失值不敏感也可以不用处理.或者直接去掉缺失比例较高的特征,但是这样做可能会丢弃最有用的特征.目前还不能确定哪些特征保留,哪些丢弃,保留所有的特征.
Examine features type
检查每列特征的数据类型是连续的还是离散的,object:包含字符和字符串
train_data.dtypes.value_counts()
float64 65
int64 41
object 16
dtype: int64
# 离散变量类别数量统计
train_data.select_dtypes(include = ['object']).apply(pd.Series.nunique, axis = 0)
NAME_CONTRACT_TYPE 2
CODE_GENDER 3
FLAG_OWN_CAR 2
FLAG_OWN_REALTY 2
NAME_TYPE_SUITE 7
NAME_INCOME_TYPE 8
NAME_EDUCATION_TYPE 5
NAME_FAMILY_STATUS 6
NAME_HOUSING_TYPE 6
OCCUPATION_TYPE 18
WEEKDAY_APPR_PROCESS_START 7
ORGANIZATION_TYPE 58
FONDKAPREMONT_MODE 4
HOUSETYPE_MODE 3
WALLSMATERIAL_MODE 7
EMERGENCYSTATE_MODE 2
dtype: int64
很多模型(除了LightGBM)并不能直接使用这些字符变量,因此需要对变量进行重新编码.常用的方法有以下两个:
- label encodeing:把特征的每个类别特征值用一个整数来表示
- One-hot encodeing:每个类别特征用一个向量来表示
但是对于这两种编码方式,有一些建议,label encodeing 是对每个类别随即分配的一个整数值,当模型使用这些数据的时候,可能会有干扰,因为这些数值可能并不能反映出这些特征值固有的权重,也可能相差很远.因此,对于特征值类别只有两类,推荐label encodeing,超过两个类别的使用one_hot.但是One-hot编码,也有缺陷,当特征值的种类较多的话,会造成数据维度的暴涨.遇到这种情况,可以使用PCA或者其他降维技术对数据维度进行压缩.
Label, One-hoe encodeing
特征值的类别在2个以上的使用one-hot编码,2个使用label编码.
# scikit-learn : LabelEncoder()
le = LabelEncoder()
count = 0 #统计label编码特征数量
for col in train_data:
if train_data[col].dtype == 'object':
if len(list(train_data[col].unique())) <=2:
le.fit(train_data[col])
train_data[col] = le.transform(train_data[col])
test_data[col] = le.transform(test_data[col])
count = count + 1
print("%d 列特征是label编码"%count)
3 列特征是label编码
# One-hot
train_data = pd.get_dummies(train_data)
test_data = pd.get_dummies(test_data)
print("训练用例数量:%d,特征数量(含TARGET):%d"%(train_data.shape[0],train_data.shape[1]))
print("测试用例数量:%d,特征数量(不含TARGET):%d"%(test_data.shape[0],test_data.shape[1]))
训练用例数量:307511,特征数量(含TARGET):243
测试用例数量:48744,特征数量(不含TARGET):239
训练和测试用例的特征列不同,这是应为在测试用例中有些特征列的一些分类没有出现,为了对齐数据,要在训练数据中去掉这些特征列.
train_labels = train_data['TARGET']
# axis = 1,按列名对齐两张表,inner:交集
train_data, test_data = train_data.align(test_data, join = 'inner', axis = 1)
# 对齐后,在添加上Target列
train_data['TARGET'] = train_labels
print("训练用例数量:%d,特征数量(含TARGET):%d"%(train_data.shape[0],train_data.shape[1]))
print("测试用例数量:%d,特征数量(不含TARGET):%d"%(test_data.shape[0],test_data.shape[1]))
训练用例数量:307511,特征数量(含TARGET):240
测试用例数量:48744,特征数量(不含TARGET):239
现在训练,测试数据有了共同的特征,接下来对数据进一步探索
Anomalies Data Analysis
在数据探索过程中还有一个问题需要值得注意,那就是数据中的异常情况,异常可能来自人工或设备的故障,也可能是正常的情况,但是它代表的是一种不常见的极端情况.这会对模型造成困扰,所以应该在训练之前排除这些异常数据.
# DAY_BIRTH:中记录的值为负数
# 因为它记录的是距离当前申请日期的间隔天数/ -365 得到贷款申请人的年龄
(train_data['DAYS_BIRTH']/ -365).describe()
count 307511.000000
mean 43.936973
std 11.956133
min 20.517808
25% 34.008219
50% 43.150685
75% 53.923288
max 69.120548
Name: DAYS_BIRTH, dtype: float64
# 申请人的工作时间统计信息./-365:将申请贷款前的天数转换为年
(train_data['DAYS_EMPLOYED'] / -365).describe()
count 307511.000000
mean -174.835742
std 387.056895
min -1000.665753
25% 0.791781
50% 3.323288
75% 7.561644
max 49.073973
Name: DAYS_EMPLOYED, dtype: float64
工作时间中似乎有些异常,怎么会出现工作-1000年.
# 工作时间直方图
(train_data['DAYS_EMPLOYED']/-365).plot.hist(title = 'Years Employed Histogram')
plt.xlabel('Years_Employment')
Text(0.5,0,'Years_Employment')
# 工作天数统计,异常全部来自 365243
train_data['DAYS_EMPLOYED'].value_counts().head()
365243 55374
-200 156
-224 152
-199 151
-230 151
Name: DAYS_EMPLOYED, dtype: int64
# 对客户进行分组
anom = train_data[train_data['DAYS_EMPLOYED'] == 365243]
non_anom = train_data[train_data['DAYS_EMPLOYED'] != 365243]
print('正常样本中不能按时还款客户百分比:%.2f%%'%(100 * non_anom['TARGET'].mean()))
print('异常样本中不能按时还款客户百分比:%.2f%%'%(100 * anom['TARGET'].mean()))
print('工作时间异常用例的数量:%d'%len(anom))
正常样本中不能按时还款客户百分比:8.66%
异常样本中不能按时还款客户百分比:5.40%
工作时间异常用例的数量:55374
异常样本当中违约率比正常的还低3%左右,异常该如何处理?
最简单粗暴的办法就是,直接丢掉,但是这样要减少还几万个样本.比较稳妥的方法是,对异常值进行替换,然后在添加一个新的列,在训练数据中标记这些异常的样本.
# anom标记列
train_data['DAYS_EMPOLYED_ANOM'] = train_data['DAYS_EMPLOYED'] == 365243
# 用np.nan替换异常工作时间(天数)
train_data['DAYS_EMPLOYED'].replace({365243:np.nan}, inplace = True)
(train_data['DAYS_EMPLOYED']/-365).plot.hist(title = 'Years Employment Histogram')
plt.xlabel('Years Employment')
Text(0.5,0,'Years Employment')
现在工作时间的分布看起来比较符合预期的估计,用np.nan替换了异常,并新增了特征标记数据中的异常.接下来要对测试样本中的异常值进行处理,始终保持数据对齐.
test_data['DAYS_EMPLOYED_ANOM'] = test_data['DAYS_EMPLOYED'] == 365243
test_data['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)
print('测试数据中工作时间异常样本的数量:%d'%test_data['DAYS_EMPLOYED_ANOM'].sum())
测试数据中工作时间异常样本的数量:9274
# 测试数据中工作时间的分布
(test_data['DAYS_EMPLOYED']/ -365).plot.hist(title = 'Years Employment Histogram')
plt.xlabel('Years Employment')
Text(0.5,0,'Years Employment')
Correlations Analysis
相关性分析,主要是寻找特征变量与TARGET之间的相关性,帮助我们来发现数据之间潜在关联.
- 0.00-0.19 “very weak”
- 0.20-0.39 “weak”
- 0.40-0.59 “moderate”
- 0.60-0.79 “strong”
- 0.80-1.0 “very strong”
# 计算每个特征列跟TERGET之间的相关系数,排序
correlations = train_data.corr()['TARGET'].sort_values()
print('相关性最强的10个特征:\n',correlations.tail(10))
print('相关性最弱的10个特征:\n',correlations.head(10))
相关性最强的10个特征:
DAYS_ID_PUBLISH 0.051457
CODE_GENDER_M 0.054713
DAYS_LAST_PHONE_CHANGE 0.055218
NAME_INCOME_TYPE_Working 0.057481
REGION_RATING_CLIENT 0.058899
REGION_RATING_CLIENT_W_CITY 0.060893
DAYS_EMPLOYED 0.074958
DAYS_BIRTH 0.078239
TARGET 1.000000
DAYS_EMPOLYED_ANOM NaN
Name: TARGET, dtype: float64
相关性最弱的10个特征:
EXT_SOURCE_3 -0.178919
EXT_SOURCE_2 -0.160472
EXT_SOURCE_1 -0.155317
NAME_EDUCATION_TYPE_Higher education -0.056593
CODE_GENDER_F -0.054704
NAME_INCOME_TYPE_Pensioner -0.046209
ORGANIZATION_TYPE_XNA -0.045987
FLOORSMAX_AVG -0.044003
FLOORSMAX_MEDI -0.043768
FLOORSMAX_MODE -0.043226
Name: TARGET, dtype: float64
与TARGET相关性最强的除了它本身就是DAYS_BIRTH,DAYS_BIRTH:记录的是申请前客户的年龄(天数),负数记录,数值越小,即年龄越大.
Effect of Age on Repayment
下面进一步探索,客户年龄对还款压力的影响
# 年龄天数的绝对值,
train_data['DAYS_BIRTH'] = abs(train_data['DAYS_BIRTH'])
train_data['DAYS_BIRTH'].corr(train_data['TARGET'])
-0.078239308309827449
(train_data['DAYS_BIRTH']/365).corr(train_data['TARGET'])
-0.078239308309827282
原来的正相关变成了负相关,年龄的单位转换为年,对相关性系数影响很小.
下面通过直方图观察年龄的分布,天数转换为年
#plt.style.use('fivethirtyeight')
plt.hist(train_data['DAYS_BIRTH'] / 365, edgecolor = 'k', bins = 25)
plt.title('Age of Client')
plt.xlabel('Age (years)')
plt.ylabel('Count')
Text(0,0.5,'Count')
从年龄的分布中也没有非常明显的异常,下面根据TARGET的值进行kernel density estimation(KDE)核密度估计,KDE为非参数估计,不加入任何先验知识,而是根据数据本身的特点、性质来拟合分布.
核密度估计
# 按时还款和违约客户的年龄分开估计
plt.figure(figsize = (10, 6))
# TARGET : 0,按时还款
sns.kdeplot(train_data.loc[train_data['TARGET'] == 0, 'DAYS_BIRTH']/365, label = 'target = 0')
# TARGET : 1,违约
sns.kdeplot(train_data.loc[train_data['TARGET'] == 1, 'DAYS_BIRTH']/365, label = 'target = 1')
plt.xlabel('Age (years)')
plt.ylim(0, 0.04)
plt.ylabel('Density')
plt.title('KDE Distribution of Ages')
Text(0.5,1,'KDE Distribution of Ages')
上图,可以明显的发现一些情况了:
- target = 0,蓝色的曲线,代表按时还款的客户年龄集中在30-60之间,平均年龄偏高
- target = 1,红色的曲线,代表违约的客户,在30之后随着年龄的增长,违约的情况在递减.
计算不同年龄阶段,违约人群的比例.
age_data = train_data[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365
# 分割年龄段
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70,num = 11))
age_data.head(10)
TARGET | DAYS_BIRTH | YEARS_BIRTH | YEARS_BINNED | |
---|---|---|---|---|
0 | 1 | 9461 | 25.920548 | (25.0, 30.0] |
1 | 0 | 16765 | 45.931507 | (45.0, 50.0] |
2 | 0 | 19046 | 52.180822 | (50.0, 55.0] |
3 | 0 | 19005 | 52.068493 | (50.0, 55.0] |
4 | 0 | 19932 | 54.608219 | (50.0, 55.0] |
5 | 0 | 16941 | 46.413699 | (45.0, 50.0] |
6 | 0 | 13778 | 37.747945 | (35.0, 40.0] |
7 | 0 | 18850 | 51.643836 | (50.0, 55.0] |
8 | 0 | 20099 | 55.065753 | (55.0, 60.0] |
9 | 0 | 14469 | 39.641096 | (35.0, 40.0] |
# 根据年龄段分组,计算平均值,
age_groups = age_data.groupby('YEARS_BINNED').mean()
age_groups
TARGET | DAYS_BIRTH | YEARS_BIRTH | |
---|---|---|---|
YEARS_BINNED | |||
(20.0, 25.0] | 0.123036 | 8532.795625 | 23.377522 |
(25.0, 30.0] | 0.111436 | 10155.219250 | 27.822518 |
(30.0, 35.0] | 0.102814 | 11854.848377 | 32.479037 |
(35.0, 40.0] | 0.089414 | 13707.908253 | 37.555913 |
(40.0, 45.0] | 0.078491 | 15497.661233 | 42.459346 |
(45.0, 50.0] | 0.074171 | 17323.900441 | 47.462741 |
(50.0, 55.0] | 0.066968 | 19196.494791 | 52.593136 |
(55.0, 60.0] | 0.055314 | 20984.262742 | 57.491131 |
(60.0, 65.0] | 0.052737 | 22780.547460 | 62.412459 |
(65.0, 70.0] | 0.037270 | 24292.614340 | 66.555108 |
plt.figure(figsize = (10,4))
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])
plt.xticks(rotation = 45)
plt.xlabel('Age Group (years)')
plt.ylabel('Failure to Repay(%)')
plt.title('Failure to Repay by Age Group')
Text(0.5,1,'Failure to Repay by Age Group')
很明显,违约比例最高是在(20, 25)之间,之后随着年龄增长客户按时还款的主动性越来越强了.这是一个很有用的信息.
Exterior Source
跟TARGET最强负相关的三个特征EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3.这些数据来自外部数据源的标准化分数,Home Credit并没有说明数据的具体含义.
# EXT_SOURCE 特征和 TATGET之间的相关系数
ext_data = train_data[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
ext_data_corrs = ext_data.corr()
print(ext_data_corrs)
TARGET EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH
TARGET 1.000000 -0.155317 -0.160472 -0.178919 -0.078239
EXT_SOURCE_1 -0.155317 1.000000 0.213982 0.186846 0.600610
EXT_SOURCE_2 -0.160472 0.213982 1.000000 0.109167 0.091996
EXT_SOURCE_3 -0.178919 0.186846 0.109167 1.000000 0.205478
DAYS_BIRTH -0.078239 0.600610 0.091996 0.205478 1.000000
plt.figure(figsize = (7,7))
sns.heatmap(ext_data_corrs, cmap = plt.cm.RdYlBu_r, vmin = 0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heatmap')
Text(0.5,1,'Correlation Heatmap')
EXT_SOURCE,KDE,核密度估计
ext_data.head()
TARGET | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | DAYS_BIRTH | |
---|---|---|---|---|---|
0 | 1 | 0.083037 | 0.262949 | 0.139376 | 9461 |
1 | 0 | 0.311267 | 0.622246 | NaN | 16765 |
2 | 0 | NaN | 0.555912 | 0.729567 | 19046 |
3 | 0 | NaN | 0.650442 | NaN | 19005 |
4 | 0 | NaN | 0.322738 | NaN | 19932 |
plt.figure(figsize = (10, 15))
for i, col_name in enumerate(['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']):
plt.subplot(3, 1, i+1)
# 按时还款
sns.kdeplot(ext_data.loc[ext_data['TARGET'] == 0, col_name], label = 'target = 0')
# 违约
sns.kdeplot(ext_data.loc[ext_data['TARGET'] == 1, col_name], label = 'target = 1')
plt.title('KDE Distribution of %s by TARGET'% col_name)
plt.xlabel('%s' % col_name)
plt.ylabel('Density')
plt.ylim(0, 3)
# subplot 之间上下间距
plt.tight_layout(h_pad = 2.5)
EXT_SOURCE外部标准化分数,具体含义未知,但是从前面的相关性系数矩阵来看,EXT_SOURCE_1和DAYS_BIRSH的相关系数为0.6,所以在KDE曲线上也呈现除了相似的变化趋势,违约的情况随着EXT_SOURCE_1的增加在递减.
Pairs Plot
为了进一步探索一对变量之间分布的关系,通过seaborn提供的ParisPlot来创建Paris Plot,
ext_data.head(3)
TARGET | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | DAYS_BIRTH | |
---|---|---|---|---|---|
0 | 1 | 0.083037 | 0.262949 | 0.139376 | 9461 |
1 | 0 | 0.311267 | 0.622246 | NaN | 16765 |
2 | 0 | NaN | 0.555912 | 0.729567 | 19046 |
age_data.head(3)
TARGET | DAYS_BIRTH | YEARS_BIRTH | YEARS_BINNED | |
---|---|---|---|---|
0 | 1 | 9461 | 25.920548 | (25.0, 30.0] |
1 | 0 | 16765 | 45.931507 | (45.0, 50.0] |
2 | 0 | 19046 | 52.180822 | (50.0, 55.0] |
plot_data = ext_data.drop(['DAYS_BIRTH'], axis = 1).copy()
plot_data['YEAR_BIRTH'] = age_data['YEARS_BIRTH']
# 去掉缺失值
plot_data = plot_data.dropna().loc[:10000,:]
# Create teh pairgrid object
grid = sns.PairGrid(data = plot_data, size = 3,diag_sharey = False,
hue = 'TARGET',
vars = [x for x in list(plot_data.columns) if x != 'TARGET'])
# 散点图
grid.map_upper(plt.scatter)
grid.map_diag(sns.kdeplot)
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r)
plt.suptitle('EXT_SOURCE and Age Pairs Plot', size = 32, y = 1.05)
Text(0.5,1.05,'EXT_SOURCE and Age Pairs Plot')
红色表示违约的,蓝色表示按时还款的,YEAR_BIRTH和EXT_SOURCE_1之间有很强的正相关.这个标准化分数可能跟客户的年龄有关.
数据简单探索到此为止,接下来是特征工程.