【手把手机器学习入门到放弃】从线性回归开始

终于开新坑了～

线性回归是指将数据拟合成 $y=a_1x_1+a_2x_2+a_3x_3...+a_nx_n+b +\epsilon$ 的形式

通过训练模型获得参数 $a_1, a_2, ..., a_n, b$

从而对新的x值，可以预测y

下面就正式开始吧，这次是要预测墨尔本的房价～

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# 线性回归
from sklearn.linear_model import LinearRegression
# 数据分割
from sklearn.model_selection import train_test_split
from datetime import date

1. 数据集描述

Melbourne Housing Market

Some Key Details

Suburb: Suburb

Address: Address

Rooms: Number of rooms

Price: Price in dollars

Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

SellerG: Real Estate Agent

Date: Date sold

Distance: Distance from CBD

Regionname: General Region (West, North West, North, North east …etc)

Propertycount: Number of properties that exist in the suburb.

Bedroom2 : Scraped # of Bedrooms (from different source)

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size

BuildingArea: Building Size

YearBuilt: Year the house was built

CouncilArea: Governing council for the area

Lattitude: Self explanitory

Longtitude: Self explanitory

import os
print(os.listdir('datasets'))

['BrentOilPrices.csv', '.DS_Store', 'Iris', 'Lending club loan data', 'Adult', 'Melbourne_housing_extra_data.csv']

2. 数据初探

org_data = pd.read_csv('datasets/Melbourne_housing_extra_data.csv')
org_data.head(10)

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	...	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
0	Abbotsford	68 Studley St	2	h	NaN	SS	Jellis	3/09/2016	2.5	3067.0	...	1.0	1.0	126.0	NaN	NaN	Yarra	-37.8014	144.9958	Northern Metropolitan	4019.0
1	Abbotsford	85 Turner St	2	h	1480000.0	S	Biggin	3/12/2016	2.5	3067.0	...	1.0	1.0	202.0	NaN	NaN	Yarra	-37.7996	144.9984	Northern Metropolitan	4019.0
2	Abbotsford	25 Bloomburg St	2	h	1035000.0	S	Biggin	4/02/2016	2.5	3067.0	...	1.0	0.0	156.0	79.0	1900.0	Yarra	-37.8079	144.9934	Northern Metropolitan	4019.0
3	Abbotsford	18/659 Victoria St	3	u	NaN	VB	Rounds	4/02/2016	2.5	3067.0	...	2.0	1.0	0.0	NaN	NaN	Yarra	-37.8114	145.0116	Northern Metropolitan	4019.0
4	Abbotsford	5 Charles St	3	h	1465000.0	SP	Biggin	4/03/2017	2.5	3067.0	...	2.0	0.0	134.0	150.0	1900.0	Yarra	-37.8093	144.9944	Northern Metropolitan	4019.0
5	Abbotsford	40 Federation La	3	h	850000.0	PI	Biggin	4/03/2017	2.5	3067.0	...	2.0	1.0	94.0	NaN	NaN	Yarra	-37.7969	144.9969	Northern Metropolitan	4019.0
6	Abbotsford	55a Park St	4	h	1600000.0	VB	Nelson	4/06/2016	2.5	3067.0	...	1.0	2.0	120.0	142.0	2014.0	Yarra	-37.8072	144.9941	Northern Metropolitan	4019.0
7	Abbotsford	16 Maugie St	4	h	NaN	SN	Nelson	6/08/2016	2.5	3067.0	...	2.0	2.0	400.0	220.0	2006.0	Yarra	-37.7965	144.9965	Northern Metropolitan	4019.0
8	Abbotsford	53 Turner St	2	h	NaN	S	Biggin	6/08/2016	2.5	3067.0	...	1.0	2.0	201.0	NaN	1900.0	Yarra	-37.7995	144.9974	Northern Metropolitan	4019.0
9	Abbotsford	99 Turner St	2	h	NaN	S	Collins	6/08/2016	2.5	3067.0	...	2.0	1.0	202.0	NaN	1900.0	Yarra	-37.7996	144.9989	Northern Metropolitan	4019.0

10 rows × 21 columns

# 查看变量类型：
org_data.dtypes

Suburb            object
Address           object
Rooms              int64
Type              object
Price            float64
Method            object
SellerG           object
Date              object
Distance         float64
Postcode         float64
Bedroom2         float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea     float64
YearBuilt        float64
CouncilArea       object
Lattitude        float64
Longtitude       float64
Regionname        object
Propertycount    float64
dtype: object

3. 为了使模型简单，我们就选取type是h（house）类型的房子，选取的变量有rooms,Date,Distance,Landsize, Bedroom2, Bathroom, YearBuilt几个变量

dataframe = org_data[org_data["Type"]=='h'].loc[:,["Rooms","Date","Distance","Landsize","Bedroom2","Bathroom","YearBuilt","Price"]]

dataframe.head()

	Rooms	Date	Distance	Landsize	Bedroom2	Bathroom	YearBuilt	Price
0	2	3/09/2016	2.5	126.0	2.0	1.0	NaN	NaN
1	2	3/12/2016	2.5	202.0	2.0	1.0	NaN	1480000.0
2	2	4/02/2016	2.5	156.0	2.0	1.0	1900.0	1035000.0
4	3	4/03/2017	2.5	134.0	3.0	2.0	1900.0	1465000.0
5	3	4/03/2017	2.5	94.0	3.0	2.0	NaN	850000.0

dataframe.shape

(12992, 8)

4. 去除Price列为null值的数据

dataframe = dataframe.dropna(subset=['Price'])

dataframe.head()

	Rooms	Date	Distance	Landsize	Bedroom2	Bathroom	YearBuilt	Price
1	2	3/12/2016	2.5	202.0	2.0	1.0	NaN	1480000.0
2	2	4/02/2016	2.5	156.0	2.0	1.0	1900.0	1035000.0
4	3	4/03/2017	2.5	134.0	3.0	2.0	1900.0	1465000.0
5	3	4/03/2017	2.5	94.0	3.0	2.0	NaN	850000.0
6	4	4/06/2016	2.5	120.0	3.0	1.0	2014.0	1600000.0

dataframe.shape

(9944, 8)

# 统计缺失值
dataframe.isnull().describe()

	Rooms	Date	Distance	Landsize	Bedroom2	Bathroom	YearBuilt	Price
count	9944	9944	9944	9944	9944	9944	9944	9944
unique	1	1	2	2	2	2	2	1
top	False	False	False	False	False	False	True	False
freq	9944	9944	9939	7780	7978	7978	5352	9944

5. 将Date处理成与最小日期的天数差

dataframe["Date"] = pd.to_datetime(dataframe["Date"],dayfirst=True)

days_since_start = [(x - dataframe["Date"].min()).days for x in dataframe["Date"]]

dataframe["Days"] = days_since_start

dataframe = dataframe.drop(["Date"], axis=1)
dataframe.head()

	Rooms	Distance	Landsize	Bedroom2	Bathroom	YearBuilt	Price	Days
1	2	2.5	202.0	2.0	1.0	NaN	1480000.0	310
2	2	2.5	156.0	2.0	1.0	119.0	1035000.0	7
4	3	2.5	134.0	3.0	2.0	119.0	1465000.0	401
5	3	2.5	94.0	3.0	2.0	NaN	850000.0	401
6	4	2.5	120.0	3.0	1.0	5.0	1600000.0	128

6. 将YearBuilt处理成与当前年份之间的年数差

year_from_now = [(2019 - x) for x in dataframe["YearBuilt"]]

dataframe["YearBuilt"]=year_from_now

dataframe.head()

	Rooms	Distance	Landsize	Bedroom2	Bathroom	YearBuilt	Price	Days
1	2	2.5	202.0	2.0	1.0	NaN	1480000.0	310
2	2	2.5	156.0	2.0	1.0	1900.0	1035000.0	7
4	3	2.5	134.0	3.0	2.0	1900.0	1465000.0	401
5	3	2.5	94.0	3.0	2.0	NaN	850000.0	401
6	4	2.5	120.0	3.0	1.0	2014.0	1600000.0	128

7. 查看各变量非null值的分布

sns.kdeplot(dataframe["Price"])

<matplotlib.axes._subplots.AxesSubplot at 0x1a188c06d8>

在这里插入图片描述

sns.kdeplot(dataframe["Distance"].dropna())

<matplotlib.axes._subplots.AxesSubplot at 0x1a189709b0>

在这里插入图片描述

sns.kdeplot(dataframe["Landsize"].dropna())

<matplotlib.axes._subplots.AxesSubplot at 0x1a18a85710>

在这里插入图片描述

# 检查一下异常值
dataframe[dataframe["Landsize"]>70000]

	Rooms	Distance	Landsize	Bedroom2	Bathroom	YearBuilt	Price	Days
1198	3	9.2	75100.0	3.0	1.0	NaN	2000000.0	213
17293	3	34.6	76000.0	3.0	2.0	NaN	1085000.0	485

sns.kdeplot(dataframe["Days"].dropna())

<matplotlib.axes._subplots.AxesSubplot at 0x1a18c4b390>

在这里插入图片描述

sns.kdeplot(dataframe["YearBuilt"].dropna())

<matplotlib.axes._subplots.AxesSubplot at 0x1a18bfbe10>

在这里插入图片描述

yearBuilt缺失值过多，且数据质量过差，我们决定放弃这一列

8. 缺失值处理

Distance = dataframe["Distance"]
Distance.fillna(Distance.mean(),inplace=True)
Distance.isnull().describe()

Bedroom2 = dataframe["Bedroom2"]
Bedroom2.fillna(Bedroom2.mean(), inplace=True)
Bedroom2.isnull().describe()

Bathroom = dataframe["Bathroom"]
Bathroom.fillna(Bathroom.mean(), inplace=True)
Bathroom.isnull().describe()

Landsize = dataframe["Landsize"]
Landsize.fillna(Landsize.mean(), inplace=True)
Landsize.isnull().describe()

dataframe = dataframe.drop(["Distance","Landsize","Bedroom2","Bathroom","YearBuilt"], axis=1)

dataframe = pd.concat([dataframe,Distance,Landsize,Bedroom2,Bathroom],axis=1)

dataframe.head()

	Rooms	Price	Days	Distance	Landsize	Bedroom2	Bathroom
1	2	1480000.0	310	2.5	202.0	2.0	1.0
2	2	1035000.0	7	2.5	156.0	2.0	1.0
4	3	1465000.0	401	2.5	134.0	3.0	2.0
5	3	850000.0	401	2.5	94.0	3.0	2.0
6	4	1600000.0	128	2.5	120.0	3.0	1.0

dataframe.isnull().describe()

	Rooms	Price	Days	Distance	Landsize	Bedroom2	Bathroom
count	9944	9944	9944	9944	9944	9944	9944
unique	1	1	1	1	1	1	1
top	False	False	False	False	False	False	False
freq	9944	9944	9944	9944	9944	9944	9944

9. 绘制矩阵散点图，查看变量间关系

sns.pairplot(dataframe)

<seaborn.axisgrid.PairGrid at 0x1a18ab1128>

在这里插入图片描述

10. 绘制热度图，查看变量相关性

fig, ax = plt.subplots(figsize=(15,15)) 
sns.heatmap(dataframe.corr(), annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a1a8d06d8>

在这里插入图片描述

我们去除和price关系不大的Days 和 Landsize两列

11. 拆分训练集与测试集

X=dataframe.drop(["Price"], axis=1)
y=dataframe["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

12. 导入线性回归模型进行训练

lm = LinearRegression()
lm.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

13. 查看拟合参数结果

coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
ranked_suburbs = coeff_df.sort_values("Coefficient", ascending = False)
ranked_suburbs

	Coefficient
Bathroom	287227.794348
Rooms	256960.542449
Days	208.567181
Landsize	28.455846
Bedroom2	-40640.077882
Distance	-48979.633196

14. 预测并可视化预测结果

predictions = lm.predict(X_test)

plt.scatter(y_test, predictions)
plt.ylim([200000,1000000])
plt.xlim([200000,1000000])

(200000, 1000000)

在这里插入图片描述

# 查看残差分布
sns.distplot((y_test-predictions),bins=50)
#结果还不错，比较尖

在这里插入图片描述

15. 计算 RMSE（均方根误差）、MSE（均方误差）、MAE（平均绝对误差）

from sklearn import metrics

# 1.0 最好，越小越差
print("score:", metrics.explained_variance_score(y_test, predictions))
print("MAE:", metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("r^2:", metrics.r2_score(y_test, predictions))

score: 0.31016009739486505
MAE: 406517.11675773124
MSE: 347933550457.3135
RMSE: 589858.9241990948
r^2: 0.31015583638325983

这是一个简单线性回归模型，涉及到了变量空值的填充，和一些变量分布的查看。最后效果一般，受制于线性模型的简单性，且本模型未对变量进行变化。仅作为第一个数据分析项目，熟悉数据分析流程。

希望对读者有帮助

狐狐的鹿鹿

发布了78 篇原创文章 · 获赞 7 · 访问量 1万+

私信关注