【数据分析实践】 Task1.1 模型构建

导入本次实践过程中所需的包:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

%matplotlib inline

模型构建

数据集下载

实践数据的下载地址 https://pan.baidu.com/s/1dtHJiV6zMbf_fWPi-dZ95g

说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0表示未逾期,1表示逾期。

导入数据

从data_all.csv文件中导入原始数据,并查看数据相关信息:

data_origin = pd.read_csv('data_all.csv')
data_origin.head()
low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter trans_days_interval regional_mobility ... consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day reg_preference_for_trad latest_query_time_month latest_query_time_weekday loans_latest_time_month loans_latest_time_weekday
0 0.01 0.99 0 0.90 0.55 0.313 17.0 27.0 26.0 3.0 ... 2.0 1200.0 1200.0 12.0 18.0 0 4.0 2.0 4.0 3.0
1 0.02 0.94 2000 1.28 1.00 0.458 19.0 30.0 14.0 4.0 ... 6.0 22800.0 9360.0 4.0 2.0 0 5.0 3.0 5.0 5.0
2 0.04 0.96 0 1.00 1.00 0.114 13.0 68.0 22.0 1.0 ... 1.0 4200.0 4200.0 2.0 6.0 0 5.0 5.0 5.0 1.0
3 0.00 0.96 2000 0.13 0.57 0.777 22.0 14.0 6.0 3.0 ... 5.0 30000.0 12180.0 2.0 4.0 1 5.0 5.0 5.0 3.0
4 0.01 0.99 0 0.46 1.00 0.175 13.0 66.0 42.0 1.0 ... 2.0 8400.0 8250.0 22.0 120.0 0 4.0 6.0 1.0 6.0

5 rows × 85 columns

查看数据各列的统计信息:

data_origin.describe()
low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter trans_days_interval regional_mobility ... consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day reg_preference_for_trad latest_query_time_month latest_query_time_weekday loans_latest_time_month loans_latest_time_weekday
count 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 ... 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.00000 4754.000000 4754.000000
mean 0.021801 0.901332 1940.197728 14.152318 0.804493 0.365356 17.503155 29.004628 21.748422 2.678797 ... 5.088347 16418.973496 7507.426378 24.041649 51.984013 0.372949 4.273875 3.42196 4.542701 3.025873
std 0.041519 0.144837 3923.971494 693.961441 0.196920 0.170194 4.474686 22.711659 16.472031 0.890198 ... 3.344794 13885.107357 5830.674623 36.500344 53.249364 0.687382 1.333778 1.93213 2.987731 1.895870
min 0.000000 0.000000 0.000000 0.000000 0.120000 0.033000 2.000000 0.000000 4.000000 1.000000 ... 0.000000 0.000000 0.000000 -2.000000 -2.000000 0.000000 1.000000 0.00000 1.000000 0.000000
25% 0.010000 0.880000 0.000000 0.620000 0.670000 0.233000 15.000000 16.000000 12.000000 2.000000 ... 3.000000 7800.000000 4200.000000 6.000000 7.000000 0.000000 4.000000 2.00000 3.000000 2.000000
50% 0.010000 0.960000 500.000000 0.970000 0.860000 0.350000 17.000000 23.000000 17.000000 3.000000 ... 4.000000 14400.000000 6750.000000 16.000000 29.000000 0.000000 4.000000 4.00000 4.000000 3.000000
75% 0.020000 0.990000 2000.000000 1.600000 1.000000 0.479500 20.000000 32.000000 26.750000 3.000000 ... 7.000000 20400.000000 9696.250000 23.000000 86.000000 1.000000 5.000000 5.00000 5.000000 5.000000
max 1.000000 1.000000 68000.000000 47596.740000 1.000000 0.941000 42.000000 285.000000 234.000000 5.000000 ... 20.000000 266400.000000 82800.000000 360.000000 323.000000 4.000000 12.000000 6.00000 12.000000 6.000000

8 rows × 85 columns

查看数据是否存在缺失值:

data_origin.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 85 columns):
low_volume_percent                            4754 non-null float64
middle_volume_percent                         4754 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4754 non-null float64
trans_activity_month                          4754 non-null float64
trans_activity_day                            4754 non-null float64
transd_mcc                                    4754 non-null float64
trans_days_interval_filter                    4754 non-null float64
trans_days_interval                           4754 non-null float64
regional_mobility                             4754 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4754 non-null float64
first_transaction_time                        4754 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4754 non-null float64
rank_trad_1_month                             4754 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4754 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4754 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4754 non-null float64
trans_top_time_last_1_month                   4754 non-null float64
trans_top_time_last_6_month                   4754 non-null float64
consume_top_time_last_1_month                 4754 non-null float64
consume_top_time_last_6_month                 4754 non-null float64
cross_consume_count_last_1_month              4754 non-null float64
trans_fail_top_count_enum_last_1_month        4754 non-null float64
trans_fail_top_count_enum_last_6_month        4754 non-null float64
trans_fail_top_count_enum_last_12_month       4754 non-null float64
consume_mini_time_last_1_month                4754 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4754 non-null float64
railway_consume_count_last_12_month           4754 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4754 non-null float64
status                                        4754 non-null int64
first_transaction_day                         4754 non-null float64
trans_day_last_12_month                       4754 non-null float64
apply_score                                   4754 non-null float64
apply_credibility                             4754 non-null float64
query_org_count                               4754 non-null float64
query_finance_count                           4754 non-null float64
query_cash_count                              4754 non-null float64
query_sum_count                               4754 non-null float64
latest_one_month_apply                        4754 non-null float64
latest_three_month_apply                      4754 non-null float64
latest_six_month_apply                        4754 non-null float64
loans_score                                   4754 non-null float64
loans_credibility_behavior                    4754 non-null float64
loans_count                                   4754 non-null float64
loans_settle_count                            4754 non-null float64
loans_overdue_count                           4754 non-null float64
loans_org_count_behavior                      4754 non-null float64
consfin_org_count_behavior                    4754 non-null float64
loans_cash_count                              4754 non-null float64
latest_one_month_loan                         4754 non-null float64
latest_three_month_loan                       4754 non-null float64
latest_six_month_loan                         4754 non-null float64
history_suc_fee                               4754 non-null float64
history_fail_fee                              4754 non-null float64
latest_one_month_suc                          4754 non-null float64
latest_one_month_fail                         4754 non-null float64
loans_long_time                               4754 non-null float64
loans_credit_limit                            4754 non-null float64
loans_credibility_limit                       4754 non-null float64
loans_org_count_current                       4754 non-null float64
loans_product_count                           4754 non-null float64
loans_max_limit                               4754 non-null float64
loans_avg_limit                               4754 non-null float64
consfin_credit_limit                          4754 non-null float64
consfin_credibility                           4754 non-null float64
consfin_org_count_current                     4754 non-null float64
consfin_product_count                         4754 non-null float64
consfin_max_limit                             4754 non-null float64
consfin_avg_limit                             4754 non-null float64
latest_query_day                              4754 non-null float64
loans_latest_day                              4754 non-null float64
reg_preference_for_trad                       4754 non-null int64
latest_query_time_month                       4754 non-null float64
latest_query_time_weekday                     4754 non-null float64
loans_latest_time_month                       4754 non-null float64
loans_latest_time_weekday                     4754 non-null float64
dtypes: float64(73), int64(12)
memory usage: 3.1 MB

从以上信息可以看出,这份数据共有85个特征(包括标签列status),4754个样本,数据不存在缺失值。

划分数据集

首先将status列作为数据标签y,其余列作为数据集X:

y = data_origin.status
X = data_origin.drop(['status'], axis=1)

再调用sklearn包将此金融数据集按比例7:3划分为训练集和数据集,随机种子2018:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2018)

查看划分的数据集和训练集大小:

[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(3327, 84), (3327,), (1427, 84), (1427,)]

构建模型

此部分共构建三种模型:逻辑回归,SVM,以及决策树模型

逻辑回归

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

SVM

svc = SVC()
svc.fit(X_train, y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

决策树

dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

模型评分

首先对测试集进行预测:

log_predict = log_reg.predict(X_test)
svc_predict = svc.predict(X_test)
dt_predict = dt_clf.predict(X_test)

准确度Accuracy

print('逻辑回归模型的准确度为%.2f%%' % (accuracy_score(log_predict, y_test) * 100))
print('决策树模型的准确度为%.2f%%' % (accuracy_score(dt_predict, y_test) * 100))
print('SVM模型的准确度为%.2f%%' % (accuracy_score(svc_predict, y_test) * 100))
逻辑回归模型的准确度为74.84%
决策树模型的准确度为67.83%
SVM模型的准确度为74.84%

混淆矩阵

混淆矩阵中的每一行表示一个实际的类, 而每一列表示一个预测的类。一个完美的分类器将只有真反例和真正例,所以混淆矩阵的左上到右下的对角线值越小越好。

log_conf = confusion_matrix(log_predict, y_test)
dt_conf = confusion_matrix(dt_predict, y_test)
svm_conf = confusion_matrix(svc_predict, y_test)

print('逻辑回归模型混淆矩阵为\n%s' % (log_conf))
print('决策树模型混淆矩阵为\n%s' % (dt_conf))
print('SVM模型的混淆矩阵为\n%s' % (svm_conf))
逻辑回归模型混淆矩阵为
[[1068  359]
 [   0    0]]
决策树模型混淆矩阵为
[[832 223]
 [236 136]]
SVM模型的混淆矩阵为
[[1068  359]
 [   0    0]]

混淆矩阵的绘制:

fig = plt.figure(figsize=(8, 6))
fig1 = plt.subplot(131)
fig1.matshow(log_conf, cmap=plt.cm.gray)
fig2 = plt.subplot(132)
fig2.matshow(dt_conf, cmap=plt.cm.gray)
fig3 = plt.subplot(133)
fig3.matshow(svm_conf, cmap=plt.cm.gray)

在这里插入图片描述

精确率Precision

精确率的定义如下:
P r e c i s i o n = T P T P + F P Precision=\frac{TP}{TP+FP}
其中TP表示被正确识别的正例(True Positive),FP表示被误判为负例的正例(False Positive)。

print('逻辑回归模型精确率为%.2f%%' % (precision_score(log_predict, y_test) * 100))
print('决策树模型精确率为%.2f%%' % (precision_score(dt_predict, y_test) * 100))
print('SVM模型的精确率为%.2f%%' % (precision_score(svc_predict, y_test) * 100))
逻辑回归模型精确率为0.00%
决策树模型精确率为37.88%
SVM模型的精确率为0.00%

召回率Recall

召回率定义如下:
R e c a l l = T P T P + F N Recall=\frac{TP}{TP+FN}
其中TP表示被正确识别的正例(True Positive),FN表示被误判为正例的负例(False Negative)。

print('逻辑回归模型召回率为%.2f%%' % (recall_score(log_predict, y_test) * 100))
print('决策树模型召回率为%.2f%%' % (recall_score(dt_predict, y_test) * 100))
print('SVM模型的召回率为%.2f%%' % (recall_score(svc_predict, y_test) * 100))
逻辑回归模型召回率为0.00%
决策树模型召回率为36.56%
SVM模型的召回率为0.00%

由于逻辑回归模型和SVM模型预测中没有正例(即status=1),所以根据召回率和精确率的定义,两个值均为0。

猜你喜欢

转载自blog.csdn.net/bear507/article/details/85715485