梯度下降、随机梯度下降算法 (手写)【实战案例】

这份作业提供了一个很好的实战案例，在此将其转为博客。作为学习笔记
若想了解该作业得更多信息，或者算法的其他内容欢迎私信博主。或添加博主VX: 1178623893

COMP9417 - Machine Learning

Homework 1: Gradient Descent & Friends

Introduction

In this homework, you will be required to manually implement (Stochastic) Gradient Descent
in Python to learn the parameters of a linear regression model.

# Import Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load Data
df = pd.read_csv(r'real_estate(1).csv')
df

	transactiondate	age	nearestMRT	nConvenience	latitude	longitude	price
0	2012.917	32.0	84.87882	10.0	24.98298	121.54024	37.9
1	2012.917	19.5	306.59470	9.0	24.98034	121.53951	42.2
2	2013.583	13.3	561.98450	5.0	24.98746	121.54391	47.3
3	2013.500	13.3	561.98450	5.0	24.98746	121.54391	54.8
4	2012.833	5.0	390.56840	5.0	24.97937	121.54245	43.1
...	...	...	...	...	...	...	...
409	2013.000	13.7	4082.01500	0.0	24.94155	121.50381	15.4
410	2012.667	5.6	90.45606	9.0	24.97433	121.54310	50.0
411	2013.250	18.8	390.96960	7.0	24.97923	121.53986	40.6
412	2013.000	8.1	104.81010	5.0	24.96674	121.54067	52.5
413	2013.500	6.5	90.45606	9.0	24.97433	121.54310	63.9

414 rows × 7 columns

### Question 1. (Pre-processing)

#  Question 1. (Pre-processing)
# Q1(a)  Remove any rows of the data that contain a missing (‘NA’) value. List the indices of the removed
# data points. Then, delete all features from the dataset apart from: age, nearestMRT and nConvenience.

df1 = df.dropna(axis = 0,how='any')  # df1: remove 'NA'
df2 = df1.reset_index(drop = True)   # Set a new index for df1

df3 = df2[['price','age','nearestMRT','nConvenience']]  # Delete all features from the dataset apart from: age, nearestMRT and nConvenience.

# Q1(b) feature normalisation
df_data = (df3 - df3.min()) / (df3.max() - df3.min())
print('----------------------the cleaned data----------------------')
print(df_data)
print('-----------------the mean value over your dataset.-------------------')
print(df_data.mean())

----------------------the cleaned data----------------------
        price       age  nearestMRT  nConvenience
0    0.275705  0.730594    0.009513           1.0
1    0.314832  0.445205    0.043809           0.9
2    0.361237  0.303653    0.083315           0.5
3    0.429481  0.303653    0.083315           0.5
4    0.323021  0.114155    0.056799           0.5
..        ...       ...         ...           ...
403  0.070974  0.312785    0.627820           0.0
404  0.385805  0.127854    0.010375           0.9
405  0.300273  0.429224    0.056861           0.7
406  0.408553  0.184932    0.012596           0.5
407  0.512284  0.148402    0.010375           0.9

[408 rows x 4 columns]
-----------------the mean value over your dataset.-------------------
price           0.277240
age             0.406079
nearestMRT      0.162643
nConvenience    0.412010
dtype: float64

Question 2. (Train and Test sets)

# Question 2. (Train and Test sets)
# 
train_df = df_data.iloc[:int(.5 * len(df_data)),:]
test_df  = df_data.iloc[int(.5 * len(df_data)):,:]
print('----------------------the Train Set----------------------')
print(train_df)
print('----------------------the Test Set----------------------')
print(test_df)
print('---------------------------------------------------------')
print('Print out the first and last rows of both your training and test sets.')
print('---------------------------------------------------------')
print(train_df.iloc[1,:])
print(train_df.iloc[-1,:])
print(test_df.iloc[1,:])
print(test_df.iloc[-1,:])

----------------------the Train Set----------------------
        price       age  nearestMRT  nConvenience
0    0.275705  0.730594    0.009513           1.0
1    0.314832  0.445205    0.043809           0.9
2    0.361237  0.303653    0.083315           0.5
3    0.429481  0.303653    0.083315           0.5
4    0.323021  0.114155    0.056799           0.5
..        ...       ...         ...           ...
199  0.350318  0.356164    0.041138           0.5
200  0.172884  0.410959    0.215241           0.1
201  0.125569  0.292237    0.220637           0.3
202  0.331210  0.506849    0.055096           1.0
203  0.242038  0.878995    0.099260           0.3

[204 rows x 4 columns]
----------------------the Test Set----------------------
        price       age  nearestMRT  nConvenience
204  0.169245  0.262557    0.206780           0.1
205  0.303003  0.794521    0.023551           0.8
206  0.405823  0.118721    0.056799           0.5
207  0.326661  0.000000    0.038770           0.1
208  0.213831  0.401826    0.275697           0.2
..        ...       ...         ...           ...
403  0.070974  0.312785    0.627820           0.0
404  0.385805  0.127854    0.010375           0.9
405  0.300273  0.429224    0.056861           0.7
406  0.408553  0.184932    0.012596           0.5
407  0.512284  0.148402    0.010375           0.9

[204 rows x 4 columns]
---------------------------------------------------------
Print out the first and last rows of both your training and test sets.
---------------------------------------------------------
price           0.314832
age             0.445205
nearestMRT      0.043809
nConvenience    0.900000
Name: 1, dtype: float64
price           0.242038
age             0.878995
nearestMRT      0.099260
nConvenience    0.300000
Name: 203, dtype: float64
price           0.303003
age             0.794521
nearestMRT      0.023551
nConvenience    0.800000
Name: 205, dtype: float64
price           0.512284
age             0.148402
nearestMRT      0.010375
nConvenience    0.900000
Name: 407, dtype: float64

Question 3. (Loss Function)

Consider the loss function
$\mathcal{L}_{c}(x, y)=\sqrt{\frac{1}{c^{2}}(x-y)^{2}+1}-1$
where $\in \mathbb{R}$ is a hyper-parameter. Consider the (simple) linear model
We can write this more succintly by letting $w=\left(w_{0}, w_{1}, w_{2}, w_{3}\right)^{T}$ and $\boldsymbol{X}^{(i)}=\left(1, x_{1}^{(i)}, x_{2}^{(i)}, x_{3}^{(i)}\right)^{T}$
,so that $\hat{y}^{(i)}=w^{T} X^{(i)}$ . The mean-loss achieved by our model (w) on a given dataset of n observations is then
$\mathcal{L}_{c}(y, \hat{y})=\frac{1}{n} \sum_{i=1}^{n} \mathcal{L}_{c}\left(y^{(i)}, \hat{y}^{(i)}\right)= \\ \frac{1}{n} \sum_{i=1}^{n}\left[\sqrt{\frac{1}{c^{2}}\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+1}-1\right]$
Compute the following derivatives:
$\frac{\partial \mathcal{L}_{c}\left(y^{(i)} \hat{y}^{(i)}\right)}{\partial w_{k}}, \quad k=0,1,2,3$
You must show your working for full marks.

Answer to Question 3.
$\begin{aligned} \frac{\partial \mathcal{L}_{c}\left(y^{(i)} \hat{y}^{(i)}\right)}{\partial w_{k}} &= \frac{\partial}{\partial w_k}\left({\sqrt{\frac{1}{c^{2}}\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+1}-1} \right) \\ &= \frac{1}{2}\left(\frac{1}{c^{2}}\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+1\right)^{-\frac{1}{2}}\frac{\partial}{ \partial w_k}\left( \frac{1}{c^{2}}\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+1 \right) \\ &= \frac{1}{2}\left(\frac{1}{c^{2}}\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+1\right)^{-\frac{1}{2}}\frac{1}{c^2}\frac{\partial }{\partial w_k}\left(\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}\right) \\ &= \frac{\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)x_{k}^{(i)}}{\sqrt{c^2\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+c^4}} \end{aligned}$

Question 4. (Gradient Descent Psuedocode)

The pseudocode for gradient descent updates

for i in range ( max_iterations):   
	params_grad = evaluate_gradient ( loss_function , training_data , params )
	params = params - learning_rate * params_grad

The pseudocode for stochastic gradient descent updates

for i in range ( nb_epochs ):             # loop Epochs
	np. random . shuffle ( training_data  )    #  random
	for batch in get_batches (training_data , batch_size =50):
       params_grad = evaluate_gradient ( loss_function , batch , params )
       params = params - learning_rate * params_grad

Question 5. (Gradient Descent Implementation)

# Question 5. (Gradient Descent Implementation)
# Q5. (a)
#  -------------------- Gradient_Descent --------------------
def loss(w,x,y):
        c = 0.5
        temp = pow((y-x@w),2)
        num = np.sqrt(temp/c/c+1)-1
        return np.mean(num)
    
def Gradient_Descent(w,x,y,eta):
    max_iter = 400
    i = 0
    Loss = []
    w_all = w
    while i < max_iter:
        i += 1
        w0 = w[0] - eta * np.mean( 0.5 * (x[:,0].reshape([-1,1])) * (x@w-y) / (np.sqrt(pow((x@w-y),2)+4)))
        w1 = w[1] - eta * np.mean( 0.5 * (x[:,1].reshape([-1,1])) * (x@w-y) / (np.sqrt(pow((x@w-y),2)+4)))
        w2 = w[2] - eta * np.mean( 0.5 * (x[:,2].reshape([-1,1])) * (x@w-y) / (np.sqrt(pow((x@w-y),2)+4)))
        w3 = w[3] - eta * np.mean( 0.5 * (x[:,3].reshape([-1,1])) * (x@w-y) / (np.sqrt(pow((x@w-y),2)+4)))
        w = np.array([w0,w1,w2,w3]).reshape([-1,1])
        w_all = np.hstack([w_all,w])
        loss_fun = loss(w,x,y)
        Loss.append(loss(w,x,y))
    return Loss,w_all

#  -------------------- Main Function --------------------
df_x = train_df[['age','nearestMRT','nConvenience']]
df_y = train_df['price']
x = np.hstack([np.ones([len(train_df),1]),df_x.values])
y = df_y.values.reshape([-1,1])
ww = np.array([1,1,1,1]).reshape([-1,1])


## 
alphas = [10,5,2, 1,0.5, 0.25,0.1, 0.05, 0.01]
num = 0
for eta in alphas:
    Loss_temp,w_all  = Gradient_Descent(ww,x,y,eta)
    Loss_temp = np.array(Loss_temp).reshape([-1,1])
    if num == 0:
        Loss_all = Loss_temp
        num  = 1
    else:
        Loss_all = np.hstack([Loss_all,Loss_temp])
##  plot
fig, ax = plt.subplots(3,3, figsize=(10,10))
ax.flat
for i, ax in enumerate(ax.flat):
# losses is a list of 9 elements. Each element is an array of length nIter storing the loss at each iteration for
# that particular step size
    ax.plot(Loss_all[:,i])
    ax.set_title(f"step size: {alphas[i]}") # plot titles
    plt.tight_layout() # plot formatting

在这里插入图片描述

Q5. (b)

From your results in the previous part, choose an appropriate step size (and state your choice), and
explain why you made this choice

Answer to Question 5.(b).

In my opinion, after many tests, the step size of 1 is an appropriate choice.
If the step size is too large, the iteration will be too fast, and even the optimal solution may be missed.The step size is too small, the iteration speed is too slow, and the algorithm cannot finish for a long time.However, when the step size is 1, the algorithm uses fewer times to reach the optimal value.

# Q5. (c)
# 取 eta = 0.3 画出所有的权重迭代图
#  -------------------- Main Function --------------------
df_x = df_data[['age','nearestMRT','nConvenience']]
df_y = df_data['price']
x = np.hstack([np.ones([len(df_data),1]),df_x.values])
y = df_y.values.reshape([-1,1])
ww = np.array([1,1,1,1]).reshape([-1,1])
eta = 0.3 
Loss_temp,w_all  = Gradient_Descent(ww,x,y,eta)

plt.plot(w_all.T)
plt.title('The progression of each of the four weights over the iterations.')
plt.ylabel('Weights')
plt.xlabel('Iterations')
plt.legend(['w0','w_1','w_2','w_3','w_3'])
plt.show()
print('The final weight vector is :',w_all[:,-1]) # Print out the final weight vector.

# Finally, run your model on the train and test set, and print the achieved losses.
# Train Set
df_train_x = train_df[['age','nearestMRT','nConvenience']]
df_train_y = train_df['price']
train_x = np.hstack([np.ones([len(df_train_x),1]),df_train_x.values])
train_y = df_train_y.values.reshape([-1,1])
loss_train = loss(w_all[:,-1].reshape([-1,1]),train_x,train_y)
print('Loss for Train Set :',loss_train)

# Test Set
df_test_x = test_df[['age','nearestMRT','nConvenience']]
df_test_y = test_df['price']
test_x = np.hstack([np.ones([len(df_test_x),1]),df_test_x.values])
test_y = df_test_y.values.reshape([-1,1])
loss_test = loss(w_all[:,-1].reshape([-1,1]),test_x,test_y)
print('Loss for Test Set :',loss_test)

在这里插入图片描述

The final weight vector is : [ 0.04394919 -0.01285655  0.29836987  0.44517113]
Loss for Train Set : 0.030617652935386656
Loss for Test Set : 0.035002153888240746

Question 6. (Stochastic Gradient Descent Implementation)

# Q6. (a)
#  -------------------- Stochastic_Gradient_Descent --------------------
def Stochastic_Gradient_Descent(w,x,y,eta):
    def loss(w,x,y):
        c = 0.5
        temp = pow((y-x@w),2)
        num = np.sqrt(temp/c/c+1)-1
        return np.mean(num)
    Loss_all = []
    data = np.hstack([y,x])
    w_all    = w
    n_Epochs = 6
    for i in range(n_Epochs):
        rand_data = np.random.permutation(data)
        yy = rand_data[:,0]
        xx = rand_data[:,1:]
        for j in range(len(yy)):
            w0 = w[0] - eta *  0.5 * (xx[j,0]) * (xx[j,:]@w-yy[j]) / (np.sqrt(pow((xx[j,:]@w-yy[j]),2)+4))
            w1 = w[1] - eta *  0.5 * (xx[j,1]) * (xx[j,:]@w-yy[j]) / (np.sqrt(pow((xx[j,:]@w-yy[j]),2)+4))
            w2 = w[2] - eta *  0.5 * (xx[j,2]) * (xx[j,:]@w-yy[j]) / (np.sqrt(pow((xx[j,:]@w-yy[j]),2)+4))
            w3 = w[3] - eta *  0.5 * (xx[j,3]) * (xx[j,:]@w-yy[j]) / (np.sqrt(pow((xx[j,:]@w-yy[j]),2)+4))
            w = np.array([w0,w1,w2,w3]).reshape([-1,1])
            w_all = np.hstack([w_all,w])
            loss_fun = loss(w,x,y)
            Loss_all.append(loss(w,x,y))
    return w_all,Loss_all
#  -------------------- Main Function --------------------
df_x = train_df[['age','nearestMRT','nConvenience']]
df_y = train_df['price']
x = np.hstack([np.ones([len(train_df),1]),df_x.values])
y = df_y.values.reshape([-1,1])
ww = np.array([1,1,1,1]).reshape([-1,1])

eta = 0.4
w_all,loss_all = Stochastic_Gradient_Descent(ww,x,y,eta)


alphas = [10,5,2, 1,0.5, 0.25,0.1, 0.05, 0.01]
num = 0
for eta in alphas:
    w_temp, Loss_temp  = Stochastic_Gradient_Descent(ww,x,y,eta)
    Loss_temp1 = np.array(Loss_temp).reshape([-1,1])
    if num == 0:
        Loss_all = Loss_temp1
        num  = 1
    else:
        Loss_all = np.hstack([Loss_all,Loss_temp1])

##  plot
fig, ax = plt.subplots(3,3, figsize=(10,10))
ax.flat

for i, ax in enumerate(ax.flat):
# losses is a list of 9 elements. Each element is an array of length nIter storing the loss at each iteration for
# that particular step size
    ax.plot(Loss_all[:,i])
    ax.set_title(f"step size: {alphas[i]}") # plot titles
    plt.tight_layout() # plot formatting

在这里插入图片描述

Q6. (b)

From your results in the previous part, choose an appropriate step size (and state your choice), and
explain why you made this choice

Answer to Question 6.(b).

In my opinion, after many tests, the step size of 0.5 is an appropriate choice.
If the step size is too large, the iteration will be too fast, and even the optimal solution may be missed.The step size is too small, the iteration speed is too slow, and the algorithm cannot finish for a long time.However, when the step size is 0.5, the algorithm uses fewer times to reach the optimal value.

# Q6. (c)
# 取 eta = 0.4 画出所有的权重迭代图
#  -------------------- Main Function --------------------
df_x = train_df[['age','nearestMRT','nConvenience']]
df_y = train_df['price']
x = np.hstack([np.ones([len(train_df),1]),df_x.values])
y = df_y.values.reshape([-1,1])
ww = np.array([1,1,1,1]).reshape([-1,1])

eta = 0.4
w_all,Loss_temp  = Stochastic_Gradient_Descent(ww,x,y,eta)

plt.plot(w_all.T)
plt.title('The progression of each of the four weights over the iterations.')
plt.ylabel('Weights')
plt.xlabel('Iterations')
plt.legend(['w0','w_1','w_2','w_3','w_3'])
plt.show()
print('The final weight vector is :',w_all[:,-1]) # Print out the final weight vector.

# Finally, run your model on the train and test set, and print the achieved losses.
# Train Set
df_train_x = train_df[['age','nearestMRT','nConvenience']]
df_train_y = train_df['price']
train_x = np.hstack([np.ones([len(df_train_x),1]),df_train_x.values])
train_y = df_train_y.values.reshape([-1,1])
loss_train = loss(w_all[:,-1].reshape([-1,1]),train_x,train_y)
print('Loss for Train Set :',loss_train)

# Test Set
df_test_x = test_df[['age','nearestMRT','nConvenience']]
df_test_y = test_df['price']
test_x = np.hstack([np.ones([len(df_test_x),1]),df_test_x.values])
test_y = df_test_y.values.reshape([-1,1])
loss_test = loss(w_all[:,-1].reshape([-1,1]),test_x,test_y)
print('Loss for Test Set :',loss_test)

在这里插入图片描述

The final weight vector is : [ 0.27166775 -0.12645567 -0.17158891  0.20079968]
Loss for Train Set : 0.012240492047852287
Loss for Test Set : 0.01614156189305504

Question 7. Results Analysis

In a few lines, comment on your results in Questions 5 and 6.

Ansewr:

Both Gradient Descent and Stochastic Gradient Descent performed well in the tests, he said, meaning that they both used fewer iterations to reach the optimal value.However, in general, stochastic gradient descent is better than gradient descent.

Explain the importance of the step-size in both GD and SGD.

Ansewr:

Step size determines the length of each step in the gradient descent iteration along the negative direction of the gradient.It is important to set a proper step size.This is because the step size is too large, which will lead to too fast iteration and may even miss the optimal solution.The step size is too small, the iteration speed is too slow, and the algorithm cannot finish for a long time.

Explain why one might have a preference for GD or SGD.

Ansewr:

Gradient Descent and Stochastic Gradient Descent are two important optimization algorithms, and each of them has its own disadvantages and advantages.In general, gradient descent can be more accurate in the direction of the extreme value, but the gradient descent algorithm cannot guarantee that the optimized function can reach the global optimal solution.The worst part is that gradient descent takes too long to compute.For stochastic gradient descent, the convergence rate is generally much faster, but the error function is not
minimized as in Gd.

Explain why the GD paths look much smoother than the SGD paths.

Ansewr:

BGD always synthesizes the gradient of all data, so its iteration process is always smooth, while SGD randomly selects a piece of data as a parameter, so its iteration process is very unstable.