导入本次实践过程中所需的包:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
%matplotlib inline
模型构建
数据集下载
实践数据的下载地址 https://pan.baidu.com/s/1dtHJiV6zMbf_fWPi-dZ95g
说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0表示未逾期,1表示逾期。
导入数据
从data_all.csv文件中导入原始数据,并查看数据相关信息:
data_origin = pd.read_csv('data_all.csv')
data_origin.head()
low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | ... | consfin_product_count | consfin_max_limit | consfin_avg_limit | latest_query_day | loans_latest_day | reg_preference_for_trad | latest_query_time_month | latest_query_time_weekday | loans_latest_time_month | loans_latest_time_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.01 | 0.99 | 0 | 0.90 | 0.55 | 0.313 | 17.0 | 27.0 | 26.0 | 3.0 | ... | 2.0 | 1200.0 | 1200.0 | 12.0 | 18.0 | 0 | 4.0 | 2.0 | 4.0 | 3.0 |
1 | 0.02 | 0.94 | 2000 | 1.28 | 1.00 | 0.458 | 19.0 | 30.0 | 14.0 | 4.0 | ... | 6.0 | 22800.0 | 9360.0 | 4.0 | 2.0 | 0 | 5.0 | 3.0 | 5.0 | 5.0 |
2 | 0.04 | 0.96 | 0 | 1.00 | 1.00 | 0.114 | 13.0 | 68.0 | 22.0 | 1.0 | ... | 1.0 | 4200.0 | 4200.0 | 2.0 | 6.0 | 0 | 5.0 | 5.0 | 5.0 | 1.0 |
3 | 0.00 | 0.96 | 2000 | 0.13 | 0.57 | 0.777 | 22.0 | 14.0 | 6.0 | 3.0 | ... | 5.0 | 30000.0 | 12180.0 | 2.0 | 4.0 | 1 | 5.0 | 5.0 | 5.0 | 3.0 |
4 | 0.01 | 0.99 | 0 | 0.46 | 1.00 | 0.175 | 13.0 | 66.0 | 42.0 | 1.0 | ... | 2.0 | 8400.0 | 8250.0 | 22.0 | 120.0 | 0 | 4.0 | 6.0 | 1.0 | 6.0 |
5 rows × 85 columns
查看数据各列的统计信息:
data_origin.describe()
low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | ... | consfin_product_count | consfin_max_limit | consfin_avg_limit | latest_query_day | loans_latest_day | reg_preference_for_trad | latest_query_time_month | latest_query_time_weekday | loans_latest_time_month | loans_latest_time_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | ... | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.00000 | 4754.000000 | 4754.000000 |
mean | 0.021801 | 0.901332 | 1940.197728 | 14.152318 | 0.804493 | 0.365356 | 17.503155 | 29.004628 | 21.748422 | 2.678797 | ... | 5.088347 | 16418.973496 | 7507.426378 | 24.041649 | 51.984013 | 0.372949 | 4.273875 | 3.42196 | 4.542701 | 3.025873 |
std | 0.041519 | 0.144837 | 3923.971494 | 693.961441 | 0.196920 | 0.170194 | 4.474686 | 22.711659 | 16.472031 | 0.890198 | ... | 3.344794 | 13885.107357 | 5830.674623 | 36.500344 | 53.249364 | 0.687382 | 1.333778 | 1.93213 | 2.987731 | 1.895870 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.120000 | 0.033000 | 2.000000 | 0.000000 | 4.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | -2.000000 | -2.000000 | 0.000000 | 1.000000 | 0.00000 | 1.000000 | 0.000000 |
25% | 0.010000 | 0.880000 | 0.000000 | 0.620000 | 0.670000 | 0.233000 | 15.000000 | 16.000000 | 12.000000 | 2.000000 | ... | 3.000000 | 7800.000000 | 4200.000000 | 6.000000 | 7.000000 | 0.000000 | 4.000000 | 2.00000 | 3.000000 | 2.000000 |
50% | 0.010000 | 0.960000 | 500.000000 | 0.970000 | 0.860000 | 0.350000 | 17.000000 | 23.000000 | 17.000000 | 3.000000 | ... | 4.000000 | 14400.000000 | 6750.000000 | 16.000000 | 29.000000 | 0.000000 | 4.000000 | 4.00000 | 4.000000 | 3.000000 |
75% | 0.020000 | 0.990000 | 2000.000000 | 1.600000 | 1.000000 | 0.479500 | 20.000000 | 32.000000 | 26.750000 | 3.000000 | ... | 7.000000 | 20400.000000 | 9696.250000 | 23.000000 | 86.000000 | 1.000000 | 5.000000 | 5.00000 | 5.000000 | 5.000000 |
max | 1.000000 | 1.000000 | 68000.000000 | 47596.740000 | 1.000000 | 0.941000 | 42.000000 | 285.000000 | 234.000000 | 5.000000 | ... | 20.000000 | 266400.000000 | 82800.000000 | 360.000000 | 323.000000 | 4.000000 | 12.000000 | 6.00000 | 12.000000 | 6.000000 |
8 rows × 85 columns
查看数据是否存在缺失值:
data_origin.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 85 columns):
low_volume_percent 4754 non-null float64
middle_volume_percent 4754 non-null float64
take_amount_in_later_12_month_highest 4754 non-null int64
trans_amount_increase_rate_lately 4754 non-null float64
trans_activity_month 4754 non-null float64
trans_activity_day 4754 non-null float64
transd_mcc 4754 non-null float64
trans_days_interval_filter 4754 non-null float64
trans_days_interval 4754 non-null float64
regional_mobility 4754 non-null float64
repayment_capability 4754 non-null int64
is_high_user 4754 non-null int64
number_of_trans_from_2011 4754 non-null float64
first_transaction_time 4754 non-null float64
historical_trans_amount 4754 non-null int64
historical_trans_day 4754 non-null float64
rank_trad_1_month 4754 non-null float64
trans_amount_3_month 4754 non-null int64
avg_consume_less_12_valid_month 4754 non-null float64
abs 4754 non-null int64
top_trans_count_last_1_month 4754 non-null float64
avg_price_last_12_month 4754 non-null int64
avg_price_top_last_12_valid_month 4754 non-null float64
trans_top_time_last_1_month 4754 non-null float64
trans_top_time_last_6_month 4754 non-null float64
consume_top_time_last_1_month 4754 non-null float64
consume_top_time_last_6_month 4754 non-null float64
cross_consume_count_last_1_month 4754 non-null float64
trans_fail_top_count_enum_last_1_month 4754 non-null float64
trans_fail_top_count_enum_last_6_month 4754 non-null float64
trans_fail_top_count_enum_last_12_month 4754 non-null float64
consume_mini_time_last_1_month 4754 non-null float64
max_cumulative_consume_later_1_month 4754 non-null int64
max_consume_count_later_6_month 4754 non-null float64
railway_consume_count_last_12_month 4754 non-null float64
pawns_auctions_trusts_consume_last_1_month 4754 non-null int64
pawns_auctions_trusts_consume_last_6_month 4754 non-null int64
jewelry_consume_count_last_6_month 4754 non-null float64
status 4754 non-null int64
first_transaction_day 4754 non-null float64
trans_day_last_12_month 4754 non-null float64
apply_score 4754 non-null float64
apply_credibility 4754 non-null float64
query_org_count 4754 non-null float64
query_finance_count 4754 non-null float64
query_cash_count 4754 non-null float64
query_sum_count 4754 non-null float64
latest_one_month_apply 4754 non-null float64
latest_three_month_apply 4754 non-null float64
latest_six_month_apply 4754 non-null float64
loans_score 4754 non-null float64
loans_credibility_behavior 4754 non-null float64
loans_count 4754 non-null float64
loans_settle_count 4754 non-null float64
loans_overdue_count 4754 non-null float64
loans_org_count_behavior 4754 non-null float64
consfin_org_count_behavior 4754 non-null float64
loans_cash_count 4754 non-null float64
latest_one_month_loan 4754 non-null float64
latest_three_month_loan 4754 non-null float64
latest_six_month_loan 4754 non-null float64
history_suc_fee 4754 non-null float64
history_fail_fee 4754 non-null float64
latest_one_month_suc 4754 non-null float64
latest_one_month_fail 4754 non-null float64
loans_long_time 4754 non-null float64
loans_credit_limit 4754 non-null float64
loans_credibility_limit 4754 non-null float64
loans_org_count_current 4754 non-null float64
loans_product_count 4754 non-null float64
loans_max_limit 4754 non-null float64
loans_avg_limit 4754 non-null float64
consfin_credit_limit 4754 non-null float64
consfin_credibility 4754 non-null float64
consfin_org_count_current 4754 non-null float64
consfin_product_count 4754 non-null float64
consfin_max_limit 4754 non-null float64
consfin_avg_limit 4754 non-null float64
latest_query_day 4754 non-null float64
loans_latest_day 4754 non-null float64
reg_preference_for_trad 4754 non-null int64
latest_query_time_month 4754 non-null float64
latest_query_time_weekday 4754 non-null float64
loans_latest_time_month 4754 non-null float64
loans_latest_time_weekday 4754 non-null float64
dtypes: float64(73), int64(12)
memory usage: 3.1 MB
从以上信息可以看出,这份数据共有85个特征(包括标签列status),4754个样本,数据不存在缺失值。
划分数据集
首先将status列作为数据标签y,其余列作为数据集X:
y = data_origin.status
X = data_origin.drop(['status'], axis=1)
再调用sklearn包将此金融数据集按比例7:3划分为训练集和数据集,随机种子2018:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2018)
查看划分的数据集和训练集大小:
[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(3327, 84), (3327,), (1427, 84), (1427,)]
构建模型
此部分共构建三种模型:逻辑回归,SVM,以及决策树模型
逻辑回归
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
SVM
svc = SVC()
svc.fit(X_train, y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
决策树
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
模型评分
首先对测试集进行预测:
log_predict = log_reg.predict(X_test)
svc_predict = svc.predict(X_test)
dt_predict = dt_clf.predict(X_test)
准确度Accuracy
print('逻辑回归模型的准确度为%.2f%%' % (accuracy_score(log_predict, y_test) * 100))
print('决策树模型的准确度为%.2f%%' % (accuracy_score(dt_predict, y_test) * 100))
print('SVM模型的准确度为%.2f%%' % (accuracy_score(svc_predict, y_test) * 100))
逻辑回归模型的准确度为74.84%
决策树模型的准确度为67.83%
SVM模型的准确度为74.84%
混淆矩阵
混淆矩阵中的每一行表示一个实际的类, 而每一列表示一个预测的类。一个完美的分类器将只有真反例和真正例,所以混淆矩阵的左上到右下的对角线值越小越好。
log_conf = confusion_matrix(log_predict, y_test)
dt_conf = confusion_matrix(dt_predict, y_test)
svm_conf = confusion_matrix(svc_predict, y_test)
print('逻辑回归模型混淆矩阵为\n%s' % (log_conf))
print('决策树模型混淆矩阵为\n%s' % (dt_conf))
print('SVM模型的混淆矩阵为\n%s' % (svm_conf))
逻辑回归模型混淆矩阵为
[[1068 359]
[ 0 0]]
决策树模型混淆矩阵为
[[832 223]
[236 136]]
SVM模型的混淆矩阵为
[[1068 359]
[ 0 0]]
混淆矩阵的绘制:
fig = plt.figure(figsize=(8, 6))
fig1 = plt.subplot(131)
fig1.matshow(log_conf, cmap=plt.cm.gray)
fig2 = plt.subplot(132)
fig2.matshow(dt_conf, cmap=plt.cm.gray)
fig3 = plt.subplot(133)
fig3.matshow(svm_conf, cmap=plt.cm.gray)
精确率Precision
精确率的定义如下:
其中TP表示被正确识别的正例(True Positive),FP表示被误判为负例的正例(False Positive)。
print('逻辑回归模型精确率为%.2f%%' % (precision_score(log_predict, y_test) * 100))
print('决策树模型精确率为%.2f%%' % (precision_score(dt_predict, y_test) * 100))
print('SVM模型的精确率为%.2f%%' % (precision_score(svc_predict, y_test) * 100))
逻辑回归模型精确率为0.00%
决策树模型精确率为37.88%
SVM模型的精确率为0.00%
召回率Recall
召回率定义如下:
其中TP表示被正确识别的正例(True Positive),FN表示被误判为正例的负例(False Negative)。
print('逻辑回归模型召回率为%.2f%%' % (recall_score(log_predict, y_test) * 100))
print('决策树模型召回率为%.2f%%' % (recall_score(dt_predict, y_test) * 100))
print('SVM模型的召回率为%.2f%%' % (recall_score(svc_predict, y_test) * 100))
逻辑回归模型召回率为0.00%
决策树模型召回率为36.56%
SVM模型的召回率为0.00%
由于逻辑回归模型和SVM模型预测中没有正例(即status=1),所以根据召回率和精确率的定义,两个值均为0。