提出问题
买房时哪些因素最终影响了你的决定?
理解数据
本次竞赛的数据集中,包含了大量的房源相关的字段信息,将字段信息分为离散型和连续型,并将离散型数据做处理。
1)导入数据
# 导入需要的模块
import pandas as pd
import numpy as np
train = pd.read_csv('C:\\Users\\1\\Desktop\\house price prediction\\train.csv')
test = pd.read_csv('C:\\Users\\1\\Desktop\\house price prediction\\test.csv')
full = train.append(test,ignore_index=True)
2)查看数据信息
①查看数据类型
full.info()
数据值较多,此处仅展示部分
②查看缺失值
aa = full.isnull().sum()
aa[aa>0].sort_values(ascending=False)
数据集中缺失数据量较高,"PoolQC"至"LotFrontage"字段的数据缺失率甚至超过15%,一般情况下对于大量数据缺失的情况,会直接将该字段进行删除,此次为练习数据预处理故一并进行处理;
数据清洗
1)数据预处理
cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]
for col in cols:
full[col].fillna(0, inplace=True)
cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]
for col in cols1:
full[col].fillna("None", inplace=True)
cols2 = ["MSZoning", "BsmtFullBath", "BsmtHalfBath", "Utilities", "Functional", "Electrical", "KitchenQual", "SaleType","Exterior1st", "Exterior2nd"]
for col in cols2:
full[col].fillna(full[col].mode()[0], inplace=True)
full['LotFrontage']=full['LotFrontage'].transform(lambda x: x.fillna(x.median()))
#房价缺失值处理:按房屋等级将房价进行切片,并填充中位数
full['SalePrice']=full.groupby(['MSSubClass'])['SalePrice'].transform(lambda x: x.fillna(x.median()))
full['SalePrice']=full['SalePrice'].fillna('0')
full['SalePrice']=full['SalePrice'].astype('int')
2)特征工程
①将离散型数据进行one-hot处理:
BldgTypeDf = pd.get_dummies( full['BldgType'] , prefix='BldgType' )
full = pd.concat([full,BldgTypeDf],axis=1)
full.drop('BldgType',axis=1,inplace=True)
BsmtCondDf = pd.get_dummies( full['BsmtCond'] , prefix='BsmtCond' )
full = pd.concat([full,BsmtCondDf],axis=1)
full.drop('BsmtCond',axis=1,inplace=True)
BsmtExposureDf = pd.get_dummies( full['BsmtExposure'] , prefix='BsmtExposure' )
full = pd.concat([full,BsmtExposureDf],axis=1)
full.drop('BsmtExposure',axis=1,inplace=True)
BsmtFinType1Df = pd.get_dummies( full['BsmtFinType1'] , prefix='BsmtFinType1' )
full = pd.concat([full,BsmtFinType1Df],axis=1)
full.drop('BsmtFinType1',axis=1,inplace=True)
BsmtFinType2Df = pd.get_dummies( full['BsmtFinType2'] , prefix='BsmtFinType2' )
full = pd.concat([full,BsmtFinType2Df],axis=1)
full.drop('BsmtFinType2',axis=1,inplace=True)
②特征选择
查看各特征的相关系数,并排序:
corrDf = full.corr()
corrDf['SalePrice'].sort_values(ascending=False)
正相关:
负相关:
选择以上截图中红框内的特征重新组合:
full_X=pd.concat([full['OverallQual'],
full['GrLivArea'],
full['GarageCars'],
full['YearBuilt'],
full['GarageArea'],
full['FullBath'],
full['YearRemodAdd'],
full['TotRmsAbvGrd'],
full['TotalBsmtSF'],
full['1stFlrSF'],
MSZoningDf,
BsmtExposureDf],axis=1)
构建模型
1)建立训练数据集和测试数据集
sourceRow=1460
data_X=full_X.loc[0:sourceRow-1,:]
data_y=full.loc[0:sourceRow-1,'SalePrice']
pred_X=full_X.loc[sourceRow:,:]
2)选择算法
房价预测属于回归算法的范畴,故选择最常用的逻辑回归进行计算(其优点是无参数):
#从原始数据集(source)中拆分出训练数据集(用于模型训练train),测试数据集(用于模型评估test)
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
#训练模型
regressor.fit(data_X,data_y)
3)方案评估
model.score(data_X,data_y)
R²值为0.795,模型效果一般,后期再对特征进行优化,并尝试其他可调参的模型进行计算,以求得到更好的拟合结果。