主成分分析
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
pd.set_option('display.max_columns', None)
某金融服务公司为了了解贷款客户的信用程度,评价客户的信用等级,采用信用评级常用的5C方法,说明客户违约的可能性。
字段 | 含义 |
---|---|
品格 | 指客户的名誉 |
能力 | 指客户的偿还能力 |
资本 | 指客户的财务势力和财务状况 |
担保 | 指对申请贷款项担保的覆盖程度 |
环境 | 指外部经济、政策环境对客户的影响 |
•每个单项都是由专家打分给出的。
loan = pd.read_csv("Loan_aply.csv")
loan.head()
ID | X1 | X2 | X3 | X4 | X5 | |
---|---|---|---|---|---|---|
0 | 1 | 76.5 | 81.5 | 76.0 | 75.8 | 71.7 |
1 | 2 | 70.6 | 73.0 | 67.6 | 68.1 | 78.5 |
2 | 3 | 90.7 | 87.3 | 91.0 | 81.5 | 80.0 |
3 | 4 | 77.5 | 73.6 | 70.9 | 69.8 | 74.8 |
4 | 5 | 85.6 | 68.5 | 70.0 | 62.2 | 76.5 |
plt.figure(figsize=(2, 2))
plt.scatter(loan['X1'], loan['X2'])
plt.title('Scatter')
plt.show()
[外链图片转存(img-xF2CqiTj-1562728614503)(output_5_0.png)]
import seaborn as sns
sns.pairplot(loan.loc[:, 'X1':])
plt.show()
[外链图片转存(img-lBgfxaSy-1562728614505)(output_6_0.png)]
计算相关系数矩阵
loan.ix[ :,'X1':].corr(method='pearson')
X1 | X2 | X3 | X4 | X5 | |
---|---|---|---|---|---|
X1 | 1.000000 | 0.726655 | 0.825342 | 0.676314 | 0.685563 |
X2 | 0.726655 | 1.000000 | 0.929080 | 0.938382 | 0.841413 |
X3 | 0.825342 | 0.929080 | 1.000000 | 0.883457 | 0.733482 |
X4 | 0.676314 | 0.938382 | 0.883457 | 1.000000 | 0.762563 |
X5 | 0.685563 | 0.841413 | 0.733482 | 0.762563 | 1.000000 |
初次查看主成分的解释方差占比
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(loan.loc[ :,'X1':])
print(pca.explained_variance_ratio_ )
[0.84585431 0.08914623 0.04259067 0.01663007 0.00577872]
print(pca.components_)
[[ 0.46881402 0.48487556 0.47274449 0.46174663 0.32925948]
[ 0.83061232 -0.32991571 0.02117417 -0.43090441 -0.12293025]
[ 0.0214065 0.0148012 -0.4127194 -0.24084475 0.87805421]
[ 0.25465387 -0.28771993 -0.58858207 0.70628304 -0.0842856 ]
[ 0.15808149 0.75700032 -0.50921327 -0.2104032 -0.31367674]]
pca1 = PCA(n_components=1, whiten=True)
pca1.fit(loan.ix[ :,'X1':])
PCA(copy=True, iterated_power='auto', n_components=1, random_state=None,
svd_solver='auto', tol=0.0, whiten=True)
将打分结果和原始数据联结
score = pd.DataFrame(pca1.transform(loan.ix[:, 'X1':]),
columns=['score', ])
loan.join(score).sort_values(by='score', ascending=False)
ID | X1 | X2 | X3 | X4 | X5 | score | |
---|---|---|---|---|---|---|---|
6 | 7 | 94.0 | 94.0 | 87.5 | 89.5 | 92.0 | 1.770770 |
2 | 3 | 90.7 | 87.3 | 91.0 | 81.5 | 80.0 | 1.238404 |
5 | 6 | 85.0 | 79.2 | 80.3 | 84.4 | 76.5 | 0.672219 |
0 | 1 | 76.5 | 81.5 | 76.0 | 75.8 | 71.7 | 0.156252 |
3 | 4 | 77.5 | 73.6 | 70.9 | 69.8 | 74.8 | -0.215028 |
4 | 5 | 85.6 | 68.5 | 70.0 | 62.2 | 76.5 | -0.316231 |
1 | 2 | 70.6 | 73.0 | 67.6 | 68.1 | 78.5 | -0.444657 |
7 | 8 | 84.6 | 66.9 | 68.8 | 64.8 | 66.4 | -0.510540 |
9 | 10 | 70.0 | 69.2 | 71.7 | 64.9 | 68.9 | -0.682753 |
8 | 9 | 57.7 | 60.4 | 57.4 | 60.8 | 65.0 | -1.668435 |