Machine Learning|吴恩达公式总结（1）

Machine Learning|吴恩达公式总结（1）
- 多元线性回归（Multivariate Linear Regression）
- 逻辑回归（Logistic Regression）

多元线性回归（Multivariate Linear Regression）

【原课程地址：https://www.coursera.org/learn/machine-learning/home/welcome。本系列笔记只关注课程中涉及的算法公式。因为课程中对许多算法只提及结果不进行过程推导。而本人数学较差，微积分、概率统计这些当初让人伤感的课程现在感觉就仿佛从没学过一样。但通过自己推导过的相关算法过程，尽量在理解清晰的情况下展现记录下来，以供类似我这样数学水平的后来者可以参考。】

申明:
$x_{j}^{(i)}$ =第 $i^{th}$ 个训练样本的第j个特征（value of feature j in the $i^{th}$ training）
$x^{(i)}$ =第 $i^{th}$ 训练样本，含其所有特征（the input (features) of the $i^{th}$ training ）
$m$ = 训练样本的总数（the number of training examples）
$n$ = 特征总数（the number of features）

$x^{(i)}∈R^{(n)};1<j\leq n;1<i\leq m$
方程式：

h_{θ} x^{(i)} = θ_{0} + θ_{1} x_{1}^{(i)} + θ_{2} x_{2}^{(i)} + θ_{3} x_{3}^{(i)} + \dots + θ_{n} x_{n}^{(i)}

$h_{\theta}x^{(i)}=\theta_{0}+\theta_{1}x_{1}^{(i)}+\theta_{2}x_{2}^{(i)}+\theta_{3}x_{3}^{(i)}+\dots+\theta_{n}x_{n}^{(i)}$
假设（Assume）:

x_{0}^{i} = 1; x^{(i)} \in R^{(n + 1)}

$x_{0}^{i}=1;x^{(i)}∈R^{(n+1)}$
方程式的向量表述(vector presentation):

h_{θ} (x) = [θ_{0} θ_{1} \dots θ_{n}] [\begin{matrix} x_{0} \\ x_{1} \\ ⋮ \\ x_{n} \end{matrix}] = θ^{T} x

$h_{\theta}(x)=[\theta_{0} \quad \theta_{1}\quad \dots \quad \theta_{n}]\begin{bmatrix} x_{0} \\ x_{1} \\ \vdots \\x_{n} \end{bmatrix}= \theta^{T}x$

损失函数（cost function）:

J (θ_{0}, θ_{1}, \dots, θ_{m}) = \frac{1}{2 m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)})^{2}

$J(\theta_{0},\theta_{1},\dots,\theta_{m})=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^{2}$

梯度下降公式（Gradient Descent）:

( $x_{0}^{i}=1$ )

θ_{j} := θ_{j} - α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) x_{j}^{(i)}

$\theta_{j}:=\theta_{j}-\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}$

偏导推导过程（partial derivative work processes）:

$e.g., \quad [(ax+b)^n]'=n(ax+b)\cdot(ax+b)'=an(ax+b)$

扫描二维码关注公众号，回复： 1585344 查看本文章

$\frac{∂}{∂\theta_{j}}[(h_{\theta}(x^{(i)})-y^{(i)})^{2}]'=2\cdot(h_\theta(x^{(i)}) - y^{(i)})\cdot\frac{∂}{∂\theta_{j}}(h_\theta(x^{(i)})- y^{(i)})'$
$=2\cdot(h_\theta(x^{(i)}) - y^{(i)})\cdot \frac{∂}{∂\theta_{j}}(\theta_{0}x_{0}^{(i)}+\theta_{1}x_{1}^{(i)}+\dots+\theta_{j}x_{j}^{(i)}+\dots+\theta_{n}x_{n}^{(i)}- y^{(i)})'$
(all the $\theta_{0}x_{0}^{(i)}+\theta_{1}x_{1}^{(i)}+\dots+\theta_{j}x_{j}^{(i)}+\dots+\theta_{n}x_{n}^{(i)}- y^{(i)}$ are constants otherwise $\theta_{j}x_{j}^{(i)}$ ,and derivative constant will be zero)
$=2\cdot(h_\theta(x^{(i)}) - y^{(i)})\cdot \frac{∂}{∂\theta_{j}}(\theta_{j}x_{j}^{(i)}+constant)'$
$=2\cdot(h_\theta(x^{(i)}) - y^{(i)})\cdot x_{j}^{(i)}$

$\frac{∂}{∂\theta_{j}}[\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^{2}]'=\frac{1}{2m}\cdot 2\cdot\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x_{j}^{(i)}$

正规化方程（Normal Equation）

（即函数求解法）在“正规化方程”方法里，我们最小化损失函数J，通过设定 $∂\theta_{j}=0$ .这将使我们可以直接求得最优 $\theta$ ，而无需进行迭代式的训练过程。方程如下：
In the “Normal Equation” method, we will minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation formula is given below:

θ = (X^{T} X)^{- 1} X^{T} y

$\theta=(X^{T}X)^{-1}X^{T}y$
“正规化方程”法，无需提前对特征进行缩放，下面我们对梯度下降法与正规化方程法的优缺点进行对比：
There is no need to do feature scaling with the normal equation.
The following is a comparison of gradient descent and the normal equation:

梯度下降Gradient Descent	正规化方程Normal Equation
需要选择一个学习率 $\alpha$ Need to choose alpha	无需设定学习率No need to choose alpha
需要进行迭代训练 Needs many iterations	没有训练过程 No need to iterate
计算复杂度： $O(kn^{2})$	$O(n^{3})$ , need to calculate inverse of $X^{T}X$ （需要计算 $X^{T}X$ 的逆举证，计算复杂度高）
当特征量n很大时仍能很好的工作Works well when n is large	n很大时运算量指数增长，运算变得十分缓慢Slow if n is very large

Atention: Sometimes, $X^{T}X$ will be noninvertible,the common causes might be having :
$X^{T}X$ 有可能不可逆，可能会是以下原因：
- 重复的特征，比如两个特征的特征值存在线性依赖（i.e. x2=x1*3+2,x2=x1*5.5 但x2=x1^2属于非线性而不包含在内）
- 特征量太多 (e.g. m ≤ n特征量大于等于样本数量). 这种情况删除一些特征或使用“正则化”（regularization） (to be explained in a later lesson).
【删除多余特征，或者说叫特征降维，可以采用PCA主成分分析法，后面的课程有介绍】

逻辑回归（Logistic Regression）

$h_{\theta}(x)=g(\theta^{T}x)$
$z=\theta^{T}x$
$g(z)=\frac{1}{1+e^{-z}}$
合并后公式 : $h_{\theta}(x)=\frac{1}{1+e^{-\theta^{T}x}}$
The following image shows us what the sigmoid function looks like:
sigmoid的函数图像，注意它的上下限：

Cost Function:

在逻辑回归中，我们无法使用相同的cost函数就像我们在线性回归中做的那样。因为，逻辑回归函数会产生输出是波浪曲线，产生大量的局部最优区。换句话说，它是一个非凸函数。因此，我们设定逻辑回归的损失函数如下：

J (θ) = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} l o g (h_{θ} (x^{(i)})) + (1 - y^{(i)}) l o g (1 - h_{θ} (x^{(i)}))]

$J(\theta)=- \frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(h_{\theta}(x^{(i)}))+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))]$

$if\quad y^{(i)}=0 : \quad J(\theta)=- \frac{1}{m} \sum_{i=1}^{m}[(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))]$
$if\quad y^{(i)}=1 : \quad J(\theta)=- \frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(h_{\theta}(x^{(i)}))]$

公式的向量化表达式:
$h=g(X\theta)$
$J(\theta)= \frac{1}{m} \cdot\ \bigg(-y^{T}log(h)-(1-y)^{T}log(1-h)\bigg)$

Gradient Descent:

$\alpha=$ 学习率（learning rate）

R e p e a t {

$Repeat\{$

θ_{j} := θ_{j} - \frac{α}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) x_{j}^{(i)}

$\quad \quad\quad \theta_{j}:=\theta_{j}- \frac{\alpha}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}$

}

$\}$

偏导计算过程:

注意：课程公式中所有的log(x),都是指ln(x）或 ‘ $log_{e}^{(x)}$ ’
$u=f(x)$
$[g(u)]' = g'(u)*f')$ 根据复合函数求导公式，以后的推导会用到以下公式
$\Longrightarrow$ $[a\cdot f(x)]'=a\cdot f'(x)$
$\Longrightarrow$ $[log(f(x))]'=\frac{1}{f(x)}\cdot f'(x)$
$\Longrightarrow [a^{f(x)}]'=f(x)\cdot a^{(f(x)-1)}\cdot f'(x)$
$\Longrightarrow [e^{f(x)}]'=e^{f(x)}\cdot f'(x)$
$\Longrightarrow [a+b\cdot f(x)]'=b\cdot f'(x)$

$\frac{∂}{∂\theta_{j}}[h_{\theta}(x^{(i)})]'=(\frac{1}{1+e^{-\theta ^{T}x^{(i)}}})'=[(1+e^{-\theta ^{T}x^{(i)}})^{-1}]'=-1\cdot(1+e^{-\theta ^{T}x^{(i)}})^{-2} \cdot (1+e^{-\theta ^{T}x^{(i)}})'$
$=-1\cdot(1+e^{-\theta ^{T}x^{(i)}})^{-2}\cdot (e^{-\theta ^{T}x^{(i)}})'=-1\cdot(1+e^{-\theta ^{T}x^{(i)}})^{-2}\cdot (e^{-\theta ^{T}x^{(i)}})\cdot (-\theta ^{T}x^{(i)})'$
(回顾线性回归损失函数,式子 $\theta_{0}x_{0}^{(i)}+\theta_{1}x_{1}^{(i)}+\dots+\theta_{j}x_{j}^{(i)}+\dots+\theta_{n}x_{n}^{(i)}- y^{(i)}$ 的所有项都是常数，除了 $\theta_{j}x_{j}^{(i)}$ ,常数的导数是0)
$=-1\cdot(1+e^{-\theta ^{T}x^{(i)}})^{-2}\cdot (e^{-\theta ^{T}x^{(i)}})\cdot (-x_{j}^{(i)})$
$=\frac{1}{(1+e^{-\theta ^{T}x^{(i)}})}\cdot \frac{e^{-\theta ^{T}x^{(i)}}}{(1+e^{-\theta ^{T}x^{(i)}})}\cdot x_{j}^{(i)}$
$=\frac{1}{(1+e^{-\theta ^{T}x^{(i)}})}\cdot \frac{(e^{-\theta ^{T}x^{(i)}}+1)-1}{(1+e^{-\theta ^{T}x^{(i)}})}\cdot x_{j}^{(i)}$
$(recall: h_{\theta}(x)=\frac{1}{1+e^{-\theta^{T}x}})$
$=h_{\theta}(x)\cdot (1 - h_{\theta}(x))\cdot x_{j}^{(i)}$

合并（combin）:

$\frac{∂}{∂\theta_{j}}[y^{(i)}log(h_{\theta}(x^{(i)}))+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))]'$
$\Longrightarrow$
$=\frac{y^{(i)}(h_{\theta}(x^{(i)}))'}{h_{\theta}(x^{(i)})}+(1-y^{(i)})\cdot\frac{1}{1-h_{\theta}(x^{(i)})}\cdot (1-h_{\theta}(x^{(i)}))'$
$=\frac{y^{(i)}(h_{\theta}(x^{(i)}))'}{h_{\theta}(x^{(i)})}+\frac{-(1-y^{(i)})\cdot (h_{\theta}(x^{(i)}))'}{1-h_{\theta}(x^{(i)})}$
$=(h_{\theta}(x^{(i)}))'\cdot \frac{y^{(i)}-h_{\theta}(x^{(i)})}{h_{\theta}(x^{(i)})\cdot (1-h_{\theta}(x^{(i)}))}$
$=-(h_{\theta}(x^{(i)})-y^{(i)})\cdot x_{j}^{(i)}$

$\theta_{j}:=\theta_{j}- \alpha \cdot [\frac{∂}{∂\theta_{j}}J(\theta)]$
$\theta_{j}:=\theta_{j}- \alpha \cdot \bigg[- \frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(h_{\theta}(x^{(i)}))+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))]\bigg]'$
$\theta_{j}:=\theta_{j}- \frac{\alpha}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}$

中英对照：
Multivariate Linear Regression：多元线性回归
cost function：损失函数
Gradient Descent：梯度下降
Logistic Regression：逻辑回归