Five Ways to Derive the Normal Equation

There are at least five different approaches to derive the normal equation. This post aims to provide a living document of the normal equation as well as its interpretations.

Five Ways to Derive the Normal Equation

Notations

$\mathrm{RSS}$ stands for Residual Sum Squared error, $\beta$ denotes parameters in a form of column vector, $X$ is an $N\times p$ matrix where each row is an input vector, $p$ is the number of entries/features for each vector, and $y$ denotes labels in a form of column vector. That is,

X = ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ — — — x 1 x 2 ⋮ x N — — — ⎞ ⎠ ⎟ ⎟ ⎟ ⎟, β = ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ β 1 β 2 ⋮ β p ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟, y = ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ y 1 y 2 ⋮ y N ⎞ ⎠ ⎟ ⎟ ⎟ ⎟

$X = \begin{pmatrix} \text{—} & x_{1} & \text{—} \\ \text{—} & x_{2} & \text{—} \\ & \vdots & \\ \text{—} & x_{N} & \text{—} \end{pmatrix}, \beta = \begin{pmatrix} \beta_{1}\\ \beta_{2} \\ \vdots \\ \beta_{p} \end{pmatrix}, y = \begin{pmatrix} y_{1}\\ y_{2} \\ \vdots \\ y_{N} \end{pmatrix}$

Method 1. Vector Projection onto the Column Space

This is the most intuitive way to understand the normal equation. The optimization of linear regression is equivalent to finding the projection of vector $y$ onto the column space of $X$ . This projection can be understood as the following subspace shown below.

X β = ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ — — — x 1 x 2 ⋮ x N — — — ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ β 1 β 2 ⋮ β p ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ = ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ β 1 x 11 + β 1 x 21 + ⋮ β 1 x N 1 + \dots \dots ⋮ \dots + β p x 1 p + β p x 2 p ⋮ + β p x N p ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ = β 1 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ x 11 x 21 ⋮ x N 1 ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ + \dots + β p ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ x 1 p x 2 p ⋮ x N p ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟

$X\beta=\begin{pmatrix} \text{—} & x_{1} & \text{—} \\ \text{—} & x_{2} & \text{—} \\ & \vdots & \\ \text{—} & x_{N} & \text{—} \end{pmatrix} \begin{pmatrix} \beta_{1}\\ \beta_{2} \\ \vdots \\ \beta_{p} \end{pmatrix}= \begin{pmatrix} \beta_{1}x_{11}+ & \cdots &+\beta_{p}x_{1p} \\ \beta_{1}x_{21}+ & \cdots &+\beta_{p}x_{2p} \\ \vdots& \vdots & \vdots \\ \beta_{1}x_{N1}+ & \cdots &+\beta_{p}x_{Np} \end{pmatrix}\\ =\beta_{1}\begin{pmatrix} x_{11}\\ x_{21} \\ \vdots \\ x_{N1} \end{pmatrix} +\cdots+\beta_{p}\begin{pmatrix} x_{1p}\\ x_{2p} \\ \vdots \\ x_{Np} \end{pmatrix}$

As the projection is denoted by $\widehat{y}=X\beta$ , the optimal configuration of $\beta$ is when the error vector $y-X\beta$ is orthogonal to the column space of $X$ , that is

X T (y - X β) = 0. (1)

$X^{T}(y-X\beta)=0.\tag 1$

Solving this gives:

β = (X T X) - 1 X T y .

$\beta=\left ( X^{T}X \right )^{-1}X^{T}y.$

Here $X^{T}X$ is denoted as the Gram Matrix and $X^{T}y$ is denoted as a moment matrix. More intuitively, the Gram matrix captures the correlations among the features and the moment matrix captures the contributions from each feature to the regression outcome.

Method 2. Direct Matrix Differentiation

This is the most straightforward way to solve the equation by rewriting $S(\beta)$ into a simpler form:

S (β) = (y - X β) T (y - X β) = y T y - β T X T y - y T X β + β T X T X β = y T y - 2 β T X T y + β T X T X β

$S(\beta)=(y-X\beta)^{T}(y-X\beta)=y^{T}y-\beta^{T}X^{T}y-y^{T}X\beta+\beta^{T}X^{T}X\beta\\=y^{T}y-2\beta^{T}X^{T}y+\beta^{T}X^{T}X\beta$

Differentiate $S(\beta)$ w.r.t. $\beta$ :

- 2 y T X + β T (X T X + (X T X) T) = - 2 y T X + 2 β T X T X = 0

$-2y^{T}X+\beta^{T}\left ( X^{T}X+\left ( X^{T}X \right )^{T} \right )=-2y^{T}X+2\beta^{T}X^{T}X=0$

Solving $S(\beta)$ gives:

β = (X T X) - 1 X T y

$\beta=\left ( X^{T}X \right )^{-1}X^{T}y$

Method 3. Matrix Differentiation with Chain-rule

This is the simplest method for a lazy person, as it takes very little effort to reach the solution. The key is to apply the chain-rule:

\partial S ( β ) \partial β = \partial ( y - X β ) T ( y - X β ) \partial ( y - X β ) \partial ( y - X β ) \partial β = - 2 (y - X β) T X = 0

$\frac{\partial S(\beta) }{\partial \beta}=\frac{\partial (y-X\beta)^{T}(y-X\beta) }{\partial (y-X\beta)}\frac{\partial (y-X\beta)}{\partial \beta}=-2(y-X\beta)^{T}X=0$

solving $S(\beta)$ gives:

β = (X T X) - 1 X T y

$\beta=\left ( X^{T}X \right )^{-1}X^{T}y$

This method requires an understanding of matrix differentiation of the quadratic form:

\partial x T W x \partial x = x T (W + W T)

$\frac{\partial x^{T}Wx}{\partial x}= x^{T}(W+W^{T})$

Method 4. Without Matrix Differentiation

We can rewrite $S(\beta)$ as following:

S (β) = ⟨ β, β ⟩ - 2 ⟨ β, (X T X) - 1 X T y ⟩ + ⟨ (X T X) - 1 X T y, (X T X) - 1 X T y ⟩ + C

$S(\beta)=\left \langle \beta, \beta \right \rangle-2\left \langle \beta, (X^{T}X)^{-1}X^{T}y \right \rangle+\left \langle (X^{T}X)^{-1}X^{T}y, (X^{T}X)^{-1}X^{T}y \right \rangle+C$

where $\langle \cdot ,\cdot \rangle$ is the inner product defined by

⟨ x, y ⟩ = x T (X T X) y .

$\langle x,y\rangle =x^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} )y.$

The idea is to rewrite $S(\beta)$ into the form of $S(\beta)=(x-a)^2+b$ such that $x$ can be solved exactly.

Method 5. Statistical Learning Theory

An alternative method to derive the normal equation arises from the statistical learning theory. The aim of this task is to minimize the expected prediction error given by:

E P E (β) = \int (y - x T β) P r (d x, d y)

$\mathrm{EPE}(\beta)=\int (y-x^{T}\beta)\mathrm{Pr}(dx, dy)$
where

x $x$ stands for a column vector of random variables,

y $y$ denotes the target random variable, and

β $\beta$ denotes a column vector of parameters (Note the definitions are different from the notations before).
Differentiating

EPE(β) $\mathrm{EPE}(\beta)$ w.r.t.

β $\beta$ gives:

\partial E P E ( β ) \partial β = \int 2 (y - x T β) (- 1) x T P r (d x, d y)

$\frac{\partial \mathrm{EPE}(\beta)}{\partial \beta}=\int 2(y-x^{T}\beta)(-1)x^{T}\mathrm{Pr}(dx, dy)$

Before we proceed, let’s check the dimensions to make sure the partial derivative is correct. $\mathrm{EPE}$ is the expected error: a $1\times1$ vector. $\beta$ is a column vector that is $N\times 1$ . According to the Jacobian in vector calculus, the resulting partial derivative should take the form

\partial E P E \partial β = (\partial E P E \partial β 1, \partial E P E \partial β 2, \dots, \partial E P E \partial β N)

$\frac{\partial EPE}{\partial \mathbf{\beta}}= \left (\frac{\partial EPE}{\partial \beta_{1}}, \frac{\partial EPE}{\partial \beta_{2}}, \dots, \frac{\partial EPE}{\partial \beta_{N}} \right )$

which is a $1\times N$ vector. Looking back at the right-hand side of the equation above, we find $2(y-x^{T}\beta)(-1)$ being a constant while $x^{T}$ being a row vector, resuling the same $1\times N$ dimension. Thus, we conclude the above partial derivative is correct. This derivative mirrors the relationship between the expected error and the way to adjust parameters so as to reduce the error. To understand why, imagine $2(y-x^{T}\beta)(-1)$ being the errors incurred by the current parameter configurations $\beta$ and $x^{T}$ being the values of the input attributes. The resulting derivative equals to the error times the scales of each input attribute. Another way to make this point is: the contribution of error from each parameter $\beta_{i}$ has a monotonic relationship with the error $2(y-x^{T}\beta)(-1)$ as well as the scalar $x^{T}$ that was multiplied to each $\beta_{i}$ .

Now, let’s go back to the derivation. Because $2(y-x^{T}\beta)(-1)$ is $1\times1$ , we can rewrite it with its transpose:

$\frac{\partial \mathrm{EPE}(\beta)}{\partial \beta}=\int 2(y-x^{T}\beta)^{T}(-1)x^{T}\mathrm{Pr}(dx, dy)$ .

Solving $\frac{\partial \mathrm{EPE}(\beta)}{\partial \beta}=0$ gives:

E [y T x T - β T x x T] = 0 E [β T x x T] = E [y T x T] E [x x T β] = E [x y] β = E [x x T] - 1 E [x y] .

$\\E\left [y^{T}x^{T}-\beta^{T}xx^{T} \right ]=0 \\ E\left [\beta^{T}xx^{T} \right ]=E\left [y^{T}x^{T} \right ] \\ E\left [xx^{T}\beta \right ]=E\left [xy \right ] \\\beta=E\left [xx^{T} \right ]^{-1}E\left [ xy \right ].$

References
[1] Wikipedia: Linear Least Squares
[2] The elements of statistical Learning