1.应用调研
贷款业务是银行最基本、最主要的资产业务,是银行获得利润的主要来源,也是一项风险性较大的资产。其风险性在于如果被贷款人没有偿还贷款的能力,那么银行就会产生坏账,造成亏损。因此在银行业务中常常需要做很多是否发放贷款的调研。本课程设计旨在利用python课堂上学习到的numpy和pandas知识对网络上收集到的数据进行数据清洗,对清洗好的数据进行逻辑回归来预测是否发放贷款。
2.代码分析
2.1数据预处理
import pandas as pd
loans_2007 = pd.read_csv('./LoanStats3a.csv', skiprows=1,low_memory=False)
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
loans_2007.to_csv('./loans_2007.csv', index=False)
对数据进行初步的清洗,把数据量减少为之开始的一半。drop掉desc和url一些无关的信息,把初步清洗的数据保存为loans_2007.csv
import pandas as pd
loans_2007 = pd.read_csv("loans_2007.csv")
#loans_2007.drop_duplicates()
print(loans_2007.iloc[0])
print(loans_2007.shape[1])
打印观察数据,看看还有什么是无关项
id 1077501
member_id 1.2966e+06
loan_amnt 5000
funded_amnt 5000
funded_amnt_inv 4975
term 36 months
int_rate 10.65%
installment 162.87
grade B
sub_grade B2
emp_title NaN
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
issue_d Dec-2011
loan_status Fully Paid
pymnt_plan n
purpose credit_card
title Computer
zip_code 860xx
addr_state AZ
dti 27.65
delinq_2yrs 0
earliest_cr_line Jan-1985
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.7%
total_acc 9
initial_list_status f
out_prncp 0
out_prncp_inv 0
total_pymnt 5863.16
total_pymnt_inv 5833.84
total_rec_prncp 5000
total_rec_int 863.16
total_rec_late_fee 0
recoveries 0
collection_recovery_fee 0
last_pymnt_d Jan-2015
last_pymnt_amnt 171.62
last_credit_pull_d Nov-2016
collections_12_mths_ex_med 0
policy_code 1
application_type INDIVIDUAL
acc_now_delinq 0
chargeoff_within_12_mths 0
delinq_amnt 0
pub_rec_bankruptcies 0
tax_liens 0
Name: 0, dtype: object
52
以上是打印的信息,可以看出来,参数太多,如果我们直接拿52个特征去训练可能导致过拟合所以我们需要进一步的对特征值进行选择
loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", /"grade", "sub_grade", "emp_title", "issue_d"], axis=1)
loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", /"total_pymnt_inv", "total_rec_prncp"], axis=1)
这里是drop掉一些表示身份信息的id,以及一些压缩的编码,显然这些数字对于我们的训练是没有作用的。
print(loans_2007.iloc[0])
print(loans_2007.shape[1])
再次打印观察一下,结果如下:
loan_amnt 5000
term 36 months
int_rate 10.65%
installment 162.87
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
loan_status Fully Paid
pymnt_plan n
purpose credit_card
title Computer
addr_state AZ
dti 27.65
delinq_2yrs 0
earliest_cr_line Jan-1985
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.7%
total_acc 9
initial_list_status f
last_credit_pull_d Nov-2016
collections_12_mths_ex_med 0
policy_code 1
application_type INDIVIDUAL
acc_now_delinq 0
chargeoff_within_12_mths 0
delinq_amnt 0
pub_rec_bankruptcies 0
tax_liens 0
Name: 0, dtype: object
32
我们可以看到特征值已经变成32个了,基本上不能再去缩减特征值的数量了,这个时候我们需要去确定训练的目标值。显然,是否贷款就是我们的目标值,我们吧贷款状态打印出来看一下。
print(loans_2007['loan_status'].value_counts())#贷款状态
结果如下:
Fully Paid 33902
Charged Off 5658
Does not meet the credit policy. Status:Fully Paid 1988
Does not meet the credit policy. Status:Charged Off 761
Current 201
Late (31-120 days) 10
In Grace Period 9
Late (16-30 days) 5
Default 1
Name: loan_status, dtype: int64
贷款状态除了借贷与否还有一些其他的状态,比如,延迟31-120天放款,不确定是否放款等等,为了减少计算量,我们简化成一个二分类的问题,就是 放款与否,如此一来就是简单的把Fully Paid设置成1,Charged Off设置成0.,操作如下:
loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]
status_replace = {
"loan_status" : {
"Fully Paid": 1,
"Charged Off": 0,
}
}
loans_2007 = loans_2007.replace(status_replace)# 二分类
数据中还有一些对于计算没有意义的特征,也就是说,很多人的这个特征值都是相同的,我们需要把这个特征drop。
orig_columns = loans_2007.columns
drop_columns = []
for col in orig_columns:
col_series = loans_2007[col].dropna().unique()
if len(col_series) == 1:
drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns, axis=1)
print(drop_columns)
print (loans_2007.shape)
loans_2007.to_csv('filtered_loans_2007.csv', index=False)
现在我们需要处理一下数据中含有nan值特别多的特征,处理方式有填充或者删除,因为我这边的特征很多,所以我选择简单的删除。
import pandas as pd
loans = pd.read_csv('filtered_loans_2007.csv')
null_counts = loans.isnull().sum()
print(null_counts)
观察结果如下:
loan_amnt 0
term 0
int_rate 0
installment 0
emp_length 1073
home_ownership 0
annual_inc 0
verification_status 0
loan_status 0
pymnt_plan 0
purpose 0
title 11
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 50
total_acc 0
last_credit_pull_d 2
pub_rec_bankruptcies 697
dtype: int64
我们可以看到emp_length和pub_rec_bankruptcies的NAN值很多,但是我们只删除pub_rec_bankruptcies 。因为emp_length代表的是以前的贷款记录,很多人以前没有贷过款,所以为NAN值,但是这一项任然是我们计算的依据不能删除,在数值化的时候我们把他设置成0即可。
loans = loans.drop("pub_rec_bankruptcies", axis=1)
#loans=loans.drop("emp_length",axis=1)
loans = loans.dropna(axis=0)
print(loans.dtypes.value_counts())
接下来我们需要将字符型的数据转换成数值型的数据。其实要做的就是1.把单纯的字符型转换成one-hot编码2.把百分号的字符型转换后面的百分号去掉,然后进行强制类型转换。
object_columns_df = loans.select_dtypes(include=["object"])
print(object_columns_df.iloc[0])# 选中字符型数据
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
print(loans[c].value_counts())
mapping_dict = {
"emp_length": {
"10+ years": 10,
"9 years": 9,
"8 years": 8,
"7 years": 7,
"6 years": 6,
"5 years": 5,
"4 years": 4,
"3 years": 3,
"2 years": 2,
"1 year": 1,
"< 1 year": 0,
"n/a": 0
}
}# 做成字典的形式去映射
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")# 去掉百分号,转换成float类型
loans = loans.replace(mapping_dict)
cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)
loans = loans.drop("pymnt_plan", axis=1)
接下来我们打印了propose和title特征之后发现,其实title就是propose的更加细化的贷款原因,因此我们二者取其一进行one-hot编码即可。
debt_consolidation 18137
credit_card 4970
other 3803
home_improvement 2869
major_purchase 2108
small_business 1771
car 1492
wedding 932
medical 667
moving 557
house 365
vacation 350
educational 312
renewable_energy 95
Name: purpose, dtype: int64
Debt Consolidation 2128
Debt Consolidation Loan 1671
Personal Loan 640
Consolidation 503
debt consolidation 483
Credit Card Consolidation 348
Home Improvement 344
Debt consolidation 323
Small Business Loan 310
Credit Card Loan 302
Personal 296
Consolidation Loan 254
Home Improvement Loan 237
personal loan 224
personal 207
Wedding Loan 207
Loan 206
consolidation 193
Car Loan 193
Other Loan 177
Wedding 151
Credit Card Payoff 149
Credit Card Refinance 140
Major Purchase Loan 136
Consolidate 126
Medical 114
Credit Card 112
home improvement 105
Credit Cards 93
My Loan 92
...
'71 Bobbed Duece 1
Consolidation + Home Improvement 1
Home Improvement/Consolidation 1
Kill My Debt 1
Finishing my debt 1
Q-Disc Loan 1
CC Ref 1
blues in C minor 1
J.L. 1
Steve's Consolidation Loan 1
Paying Off High Interest CC Debt 1
WiseMove4Grandson 1
4x4 truck 1
work 1
Fixing car 1
mybills 1
Pay the piper 1
Loan75 1
ant3300 1
Rolling the high interest into 1 loan 1
Marriage 1
Pay On Time 1
Consolidation Loan for ED 1
louisiana purchase 1
Loan2100 From LendingClub 1
High Credit Score, Never Missed a Payment! 1
Start On The Right Path 1
Eye On the Prize 1
buissness 1
Need capital to fund unique web site 1
Name: title, Length: 19094, dtype: int64
数据清洗完毕,保存csv
loans.to_csv('cleaned_loans2007.csv', index=False)
2.2模型建立
这边不仅仅是预测准确与否,而是涉及到了一个召回率的问题,也就是说:
代码表示就是:
import pandas as pd
# False positives.
#predictions
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
这边我们用skleean框架,线性回归模块在采用回归模型分析实际问题中,所研究的变量往往不全是区间变量而是顺序变量或属性变量,比如二项分布问题。通过分析年龄、性别、体质指数、平均血压、疾病指数等指标,判断一个人是否换糖尿病,Y=0表示未患病,Y=1表示患病,这里的响应变量是一个两点(0-1)分布变量,它就不能用h函数连续的值来预测因变量Y(只能取0或1)。
总之,线性回归模型通常是处理因变量是连续变量的问题,如果因变量是定性变量,线性回归模型就不再适用了,需采用逻辑回归模型解决。逻辑回归(Logistic Regression)是用于处理因变量为分类变量的回归问题,常见的是二分类或二项分布问题,也可以处理多分类问题,它实际上是属于一种分类方法。二分类问题的概率与自变量之间的关系往往是一个s型的曲线。
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict, KFold
lr = LogisticRegression()
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)
print(fpr)
print (predictions[:20])
预测结果如下:
0.9992125268800921
0.9975974866013676
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 1
dtype: int64
2.3模型优化
模型优化的方法有很多,这里我们只是简单的进行参数的调整,从上面打印的结果我们可以看得出来,整箱3预测的正确率和负向预测的概率是一样高的,为了将利润最大化,这是我们不想看到的,我们需要去降低负向预测的概率。
用到的方法很简单就是均衡权重。
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict
lr = LogisticRegression(class_weight="balanced") #关键代码
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)
print(fpr)
print(predictions[:20])
预测结果如下:
0.6686555410848957
0.39641471077434853
0 1
1 0
2 0
3 1
4 1
5 0
6 0
7 0
8 0
9 0
10 1
11 0
12 1
13 1
14 0
15 0
16 1
17 1
18 1
19 0
dtype: int64
以上是框架自己去均衡权重的,我们也可以自己去设置权重的值:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict
penalty = {
0: 4,
1: 1
} # 加权重
lr = LogisticRegression(class_weight=penalty)
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)
print(fpr)
预测结果如下:
0.8168519247660296
0.6451672518942894
3.扩展部分
3.1.python及其第三方库在数据分析中的应用
就我常用的库来说:
numpy和pandas是矩阵运算的常用库也是很多机器学习和深度学习的依赖库。常用来做数据清洗。因为不论是爬虫来的数据还是哪里收集的数据一定有很多的无关数据和一些NAN的值,这个时候就需要我们进行特征选择特征抽取。
数据处理完之后如果要机器学习我常用sklearn库,这个是常用的机器学习框架,包含了常用的线性回归和逻辑回归,也包含了常用的调参方法,超参数网格搜索。
在深度学习方面我常用tensorflow,tensorflow就是依赖于numpy和pandas的谷歌开发的深度学习框架。
3.2上完本课程的收获和改进建议
收获:学习到了一些以前没有注意过的python的基础语法知识,学习到了面向对象的界面开发,并且找到了其中的乐趣,于是我自学了pyqt5框架,但是由于工程实践搁置了一段实践,打算放假的时候好好的研究一下。也学习到了numpy和pandas的基本运用知识。
改进建议:可以压缩基础语法的讲解,上课可以多分析一下代码。