终于开新坑了~
线性回归是指将数据拟合成 的形式
通过训练模型获得参数
从而对新的x值,可以预测y
下面就正式开始吧,这次是要预测墨尔本的房价~
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# 线性回归
from sklearn.linear_model import LinearRegression
# 数据分割
from sklearn.model_selection import train_test_split
from datetime import date
1. 数据集描述
Melbourne Housing Market
Some Key Details
Suburb: Suburb
Address: Address
Rooms: Number of rooms
Price: Price in dollars
Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.
Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.
SellerG: Real Estate Agent
Date: Date sold
Distance: Distance from CBD
Regionname: General Region (West, North West, North, North east …etc)
Propertycount: Number of properties that exist in the suburb.
Bedroom2 : Scraped # of Bedrooms (from different source)
Bathroom: Number of Bathrooms
Car: Number of carspots
Landsize: Land Size
BuildingArea: Building Size
YearBuilt: Year the house was built
CouncilArea: Governing council for the area
Lattitude: Self explanitory
Longtitude: Self explanitory
import os
print(os.listdir('datasets'))
['BrentOilPrices.csv', '.DS_Store', 'Iris', 'Lending club loan data', 'Adult', 'Melbourne_housing_extra_data.csv']
2. 数据初探
org_data = pd.read_csv('datasets/Melbourne_housing_extra_data.csv')
org_data.head(10)
Suburb | Address | Rooms | Type | Price | Method | SellerG | Date | Distance | Postcode | ... | Bathroom | Car | Landsize | BuildingArea | YearBuilt | CouncilArea | Lattitude | Longtitude | Regionname | Propertycount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abbotsford | 68 Studley St | 2 | h | NaN | SS | Jellis | 3/09/2016 | 2.5 | 3067.0 | ... | 1.0 | 1.0 | 126.0 | NaN | NaN | Yarra | -37.8014 | 144.9958 | Northern Metropolitan | 4019.0 |
1 | Abbotsford | 85 Turner St | 2 | h | 1480000.0 | S | Biggin | 3/12/2016 | 2.5 | 3067.0 | ... | 1.0 | 1.0 | 202.0 | NaN | NaN | Yarra | -37.7996 | 144.9984 | Northern Metropolitan | 4019.0 |
2 | Abbotsford | 25 Bloomburg St | 2 | h | 1035000.0 | S | Biggin | 4/02/2016 | 2.5 | 3067.0 | ... | 1.0 | 0.0 | 156.0 | 79.0 | 1900.0 | Yarra | -37.8079 | 144.9934 | Northern Metropolitan | 4019.0 |
3 | Abbotsford | 18/659 Victoria St | 3 | u | NaN | VB | Rounds | 4/02/2016 | 2.5 | 3067.0 | ... | 2.0 | 1.0 | 0.0 | NaN | NaN | Yarra | -37.8114 | 145.0116 | Northern Metropolitan | 4019.0 |
4 | Abbotsford | 5 Charles St | 3 | h | 1465000.0 | SP | Biggin | 4/03/2017 | 2.5 | 3067.0 | ... | 2.0 | 0.0 | 134.0 | 150.0 | 1900.0 | Yarra | -37.8093 | 144.9944 | Northern Metropolitan | 4019.0 |
5 | Abbotsford | 40 Federation La | 3 | h | 850000.0 | PI | Biggin | 4/03/2017 | 2.5 | 3067.0 | ... | 2.0 | 1.0 | 94.0 | NaN | NaN | Yarra | -37.7969 | 144.9969 | Northern Metropolitan | 4019.0 |
6 | Abbotsford | 55a Park St | 4 | h | 1600000.0 | VB | Nelson | 4/06/2016 | 2.5 | 3067.0 | ... | 1.0 | 2.0 | 120.0 | 142.0 | 2014.0 | Yarra | -37.8072 | 144.9941 | Northern Metropolitan | 4019.0 |
7 | Abbotsford | 16 Maugie St | 4 | h | NaN | SN | Nelson | 6/08/2016 | 2.5 | 3067.0 | ... | 2.0 | 2.0 | 400.0 | 220.0 | 2006.0 | Yarra | -37.7965 | 144.9965 | Northern Metropolitan | 4019.0 |
8 | Abbotsford | 53 Turner St | 2 | h | NaN | S | Biggin | 6/08/2016 | 2.5 | 3067.0 | ... | 1.0 | 2.0 | 201.0 | NaN | 1900.0 | Yarra | -37.7995 | 144.9974 | Northern Metropolitan | 4019.0 |
9 | Abbotsford | 99 Turner St | 2 | h | NaN | S | Collins | 6/08/2016 | 2.5 | 3067.0 | ... | 2.0 | 1.0 | 202.0 | NaN | 1900.0 | Yarra | -37.7996 | 144.9989 | Northern Metropolitan | 4019.0 |
10 rows × 21 columns
# 查看变量类型:
org_data.dtypes
Suburb object
Address object
Rooms int64
Type object
Price float64
Method object
SellerG object
Date object
Distance float64
Postcode float64
Bedroom2 float64
Bathroom float64
Car float64
Landsize float64
BuildingArea float64
YearBuilt float64
CouncilArea object
Lattitude float64
Longtitude float64
Regionname object
Propertycount float64
dtype: object
3. 为了使模型简单,我们就选取type是h(house)类型的房子,选取的变量有rooms,Date,Distance,Landsize, Bedroom2, Bathroom, YearBuilt几个变量
dataframe = org_data[org_data["Type"]=='h'].loc[:,["Rooms","Date","Distance","Landsize","Bedroom2","Bathroom","YearBuilt","Price"]]
dataframe.head()
Rooms | Date | Distance | Landsize | Bedroom2 | Bathroom | YearBuilt | Price | |
---|---|---|---|---|---|---|---|---|
0 | 2 | 3/09/2016 | 2.5 | 126.0 | 2.0 | 1.0 | NaN | NaN |
1 | 2 | 3/12/2016 | 2.5 | 202.0 | 2.0 | 1.0 | NaN | 1480000.0 |
2 | 2 | 4/02/2016 | 2.5 | 156.0 | 2.0 | 1.0 | 1900.0 | 1035000.0 |
4 | 3 | 4/03/2017 | 2.5 | 134.0 | 3.0 | 2.0 | 1900.0 | 1465000.0 |
5 | 3 | 4/03/2017 | 2.5 | 94.0 | 3.0 | 2.0 | NaN | 850000.0 |
dataframe.shape
(12992, 8)
4. 去除Price列为null值的数据
dataframe = dataframe.dropna(subset=['Price'])
dataframe.head()
Rooms | Date | Distance | Landsize | Bedroom2 | Bathroom | YearBuilt | Price | |
---|---|---|---|---|---|---|---|---|
1 | 2 | 3/12/2016 | 2.5 | 202.0 | 2.0 | 1.0 | NaN | 1480000.0 |
2 | 2 | 4/02/2016 | 2.5 | 156.0 | 2.0 | 1.0 | 1900.0 | 1035000.0 |
4 | 3 | 4/03/2017 | 2.5 | 134.0 | 3.0 | 2.0 | 1900.0 | 1465000.0 |
5 | 3 | 4/03/2017 | 2.5 | 94.0 | 3.0 | 2.0 | NaN | 850000.0 |
6 | 4 | 4/06/2016 | 2.5 | 120.0 | 3.0 | 1.0 | 2014.0 | 1600000.0 |
dataframe.shape
(9944, 8)
# 统计缺失值
dataframe.isnull().describe()
Rooms | Date | Distance | Landsize | Bedroom2 | Bathroom | YearBuilt | Price | |
---|---|---|---|---|---|---|---|---|
count | 9944 | 9944 | 9944 | 9944 | 9944 | 9944 | 9944 | 9944 |
unique | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 1 |
top | False | False | False | False | False | False | True | False |
freq | 9944 | 9944 | 9939 | 7780 | 7978 | 7978 | 5352 | 9944 |
5. 将Date处理成与最小日期的天数差
dataframe["Date"] = pd.to_datetime(dataframe["Date"],dayfirst=True)
days_since_start = [(x - dataframe["Date"].min()).days for x in dataframe["Date"]]
dataframe["Days"] = days_since_start
dataframe = dataframe.drop(["Date"], axis=1)
dataframe.head()
Rooms | Distance | Landsize | Bedroom2 | Bathroom | YearBuilt | Price | Days | |
---|---|---|---|---|---|---|---|---|
1 | 2 | 2.5 | 202.0 | 2.0 | 1.0 | NaN | 1480000.0 | 310 |
2 | 2 | 2.5 | 156.0 | 2.0 | 1.0 | 119.0 | 1035000.0 | 7 |
4 | 3 | 2.5 | 134.0 | 3.0 | 2.0 | 119.0 | 1465000.0 | 401 |
5 | 3 | 2.5 | 94.0 | 3.0 | 2.0 | NaN | 850000.0 | 401 |
6 | 4 | 2.5 | 120.0 | 3.0 | 1.0 | 5.0 | 1600000.0 | 128 |
6. 将YearBuilt处理成与当前年份之间的年数差
year_from_now = [(2019 - x) for x in dataframe["YearBuilt"]]
dataframe["YearBuilt"]=year_from_now
dataframe.head()
Rooms | Distance | Landsize | Bedroom2 | Bathroom | YearBuilt | Price | Days | |
---|---|---|---|---|---|---|---|---|
1 | 2 | 2.5 | 202.0 | 2.0 | 1.0 | NaN | 1480000.0 | 310 |
2 | 2 | 2.5 | 156.0 | 2.0 | 1.0 | 1900.0 | 1035000.0 | 7 |
4 | 3 | 2.5 | 134.0 | 3.0 | 2.0 | 1900.0 | 1465000.0 | 401 |
5 | 3 | 2.5 | 94.0 | 3.0 | 2.0 | NaN | 850000.0 | 401 |
6 | 4 | 2.5 | 120.0 | 3.0 | 1.0 | 2014.0 | 1600000.0 | 128 |
7. 查看各变量非null值的分布
sns.kdeplot(dataframe["Price"])
<matplotlib.axes._subplots.AxesSubplot at 0x1a188c06d8>
sns.kdeplot(dataframe["Distance"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a189709b0>
sns.kdeplot(dataframe["Landsize"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a18a85710>
# 检查一下异常值
dataframe[dataframe["Landsize"]>70000]
Rooms | Distance | Landsize | Bedroom2 | Bathroom | YearBuilt | Price | Days | |
---|---|---|---|---|---|---|---|---|
1198 | 3 | 9.2 | 75100.0 | 3.0 | 1.0 | NaN | 2000000.0 | 213 |
17293 | 3 | 34.6 | 76000.0 | 3.0 | 2.0 | NaN | 1085000.0 | 485 |
sns.kdeplot(dataframe["Days"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a18c4b390>
sns.kdeplot(dataframe["YearBuilt"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a18bfbe10>
yearBuilt缺失值过多,且数据质量过差,我们决定放弃这一列
8. 缺失值处理
Distance = dataframe["Distance"]
Distance.fillna(Distance.mean(),inplace=True)
Distance.isnull().describe()
Bedroom2 = dataframe["Bedroom2"]
Bedroom2.fillna(Bedroom2.mean(), inplace=True)
Bedroom2.isnull().describe()
Bathroom = dataframe["Bathroom"]
Bathroom.fillna(Bathroom.mean(), inplace=True)
Bathroom.isnull().describe()
Landsize = dataframe["Landsize"]
Landsize.fillna(Landsize.mean(), inplace=True)
Landsize.isnull().describe()
dataframe = dataframe.drop(["Distance","Landsize","Bedroom2","Bathroom","YearBuilt"], axis=1)
dataframe = pd.concat([dataframe,Distance,Landsize,Bedroom2,Bathroom],axis=1)
dataframe.head()
Rooms | Price | Days | Distance | Landsize | Bedroom2 | Bathroom | |
---|---|---|---|---|---|---|---|
1 | 2 | 1480000.0 | 310 | 2.5 | 202.0 | 2.0 | 1.0 |
2 | 2 | 1035000.0 | 7 | 2.5 | 156.0 | 2.0 | 1.0 |
4 | 3 | 1465000.0 | 401 | 2.5 | 134.0 | 3.0 | 2.0 |
5 | 3 | 850000.0 | 401 | 2.5 | 94.0 | 3.0 | 2.0 |
6 | 4 | 1600000.0 | 128 | 2.5 | 120.0 | 3.0 | 1.0 |
dataframe.isnull().describe()
Rooms | Price | Days | Distance | Landsize | Bedroom2 | Bathroom | |
---|---|---|---|---|---|---|---|
count | 9944 | 9944 | 9944 | 9944 | 9944 | 9944 | 9944 |
unique | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
top | False | False | False | False | False | False | False |
freq | 9944 | 9944 | 9944 | 9944 | 9944 | 9944 | 9944 |
9. 绘制矩阵散点图,查看变量间关系
sns.pairplot(dataframe)
<seaborn.axisgrid.PairGrid at 0x1a18ab1128>
10. 绘制热度图,查看变量相关性
fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(dataframe.corr(), annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1a1a8d06d8>
我们去除和price关系不大的Days 和 Landsize两列
11. 拆分训练集与测试集
X=dataframe.drop(["Price"], axis=1)
y=dataframe["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
12. 导入线性回归模型进行训练
lm = LinearRegression()
lm.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
13. 查看拟合参数结果
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
ranked_suburbs = coeff_df.sort_values("Coefficient", ascending = False)
ranked_suburbs
Coefficient | |
---|---|
Bathroom | 287227.794348 |
Rooms | 256960.542449 |
Days | 208.567181 |
Landsize | 28.455846 |
Bedroom2 | -40640.077882 |
Distance | -48979.633196 |
14. 预测并可视化预测结果
predictions = lm.predict(X_test)
plt.scatter(y_test, predictions)
plt.ylim([200000,1000000])
plt.xlim([200000,1000000])
(200000, 1000000)
# 查看残差分布
sns.distplot((y_test-predictions),bins=50)
#结果还不错,比较尖
15. 计算 RMSE(均方根误差)、MSE(均方误差)、MAE(平均绝对误差)
from sklearn import metrics
# 1.0 最好,越小越差
print("score:", metrics.explained_variance_score(y_test, predictions))
print("MAE:", metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("r^2:", metrics.r2_score(y_test, predictions))
score: 0.31016009739486505
MAE: 406517.11675773124
MSE: 347933550457.3135
RMSE: 589858.9241990948
r^2: 0.31015583638325983
这是一个简单线性回归模型,涉及到了变量空值的填充,和一些变量分布的查看。最后效果一般,受制于线性模型的简单性,且本模型未对变量进行变化。仅作为第一个数据分析项目,熟悉数据分析流程。
希望对读者有帮助