一、逻辑回归二分类

1.1、Hypothesis Representation

$\large h_\theta (x) = g ( \theta^T x )$

$\large z = \theta^T x \newline$

$\large g(z) = \dfrac{1}{1 + e^{-z}}$

与线性回归的区别是将预测函数 $\large h_\theta (x) = \theta^T x$ 换成了 $\large h_\theta (x) = g ( \theta^T x )$ ，即将预测值的范围限制在了0-1之间，以便于二值分类，sigmoid函数如下图：

标题:sigmoid函数

其中预测函数 $\large h_\theta (x)$ 的实质是给出了输出为1的可能性，例如 $\large h_\theta (x) = 0.7$ 代表输出为1的可能性为70%。

$\large h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta)$

$\large P(y = 0 | x;\theta) + P(y = 1 | x ; \theta) = 1$

1.2、Decision Boundary

为了处理0-1分类问题，我们可以将hypothesis function 转换为：

$\large h_\theta(x) \geq 0.5 \rightarrow y = 1$

$\large h_\theta(x) < 0.5 \rightarrow y = 0$

即：

$\large \begin{align*} &\theta^T x \geq 0 \Rightarrow y = 1\\ &\theta^T x < 0 \Rightarrow y = 0 \end{align*}$

可以看出此时的 Decision Boundary便是超平面 $\large \theta^T x=0$ ，Decision Boundary的作用便是认为分割为 y = 0 和 y = 1的区域。

例：

$\large \begin{align*} & \theta =\begin{bmatrix} 5\\ -1\\ 0 \end{bmatrix} \\ & y = 1 \; if \; 5 + (-1) x_1 + 0 x_2 \geq 0 \\ & 5 - x_1 \geq 0 \\ & - x_1 \geq -5 \\& x_1 \leq 5 \end{align*}$

此时我们的decision boundary是一条垂直线 $\large x_1=5$ ，在直线右侧 y = 1，直线左侧 y = 0。

注意： $g(z)$ 的参数 $z=\theta^TX$ 不一定必须为线性的，也可以为非线性的，以拟合非线性的情况，如圆：（e.g. $e.g. \ z=\theta_0+\theta_1x_1^2+\theta_2x_2^2$ ）

1.3、Cost Function

模型：

$\begin{align*}& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \end{align*}$

说明：逻辑回归的cost function不能写成线性回归那样，由于 $\large h_\theta (x) = g ( \theta^T x )$ ，是非线性的，会使得 $\large J(\theta)$ 存在波动波动，使得cost最小化时，有多个局部最优解，即不是（convex function）凸函数，不能得到最优的参数 $\large \theta$

如图：

设计logistic Cost Function

$\large \begin{align*}& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \\ & \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \; & \text{if y = 1} \\ & \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \; & \text{if y = 0}\end{align*}$

当 y=1时，得 $\large J(\theta) \ vs \ h_\theta(x)$

当 y = 0 时，

$\large \begin{align*}& \mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y \\ & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1 \\ & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0 \newline \end{align*}$

使用这个方法可以保证cost function是凸函数

1.3.1、Simplified Cost Function and Gradient Descent

将y=0和y=1合并在一起，结果如下：

$\large Cost(h_\theta(x),y)=-ylog(h_\theta(x))-(1-y)log(1-h_\theta(x))$

完整的 m 个样本的cost为：

$\large \begin{align*}&Cost(h_\theta(x),y)=-\frac{1}{m} \sum_{i=1}^{m}-y^{(i)}log(h_\theta(x^{(i))}))-(1-y^{(i)})log(1-h_\theta(x^{(i))}))\end{align*}$

向量化：

$\large \begin{align*} & h = g(X\theta)\\ & J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right) \end{align*}$

1.4、Gradient Descent

由：

$\large \begin{align*}& Repeat \; \lbrace \\ & \; \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) \\ & \rbrace\end{align*}$

得：

$\large \begin{align*} & Repeat \; \lbrace \\ & \; \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \\ & \rbrace \end{align*}$

向量化：

$\large \theta:=\theta-\frac{\alpha}{m}X^T(g(X\theta)-\vec{y})$

二、多分类问题（one-vs-all）

2.1、模型

此时标签有多个，有prediction function

$\large \begin{align*}& y \in \lbrace0, 1 ... n\rbrace \\& h_\theta^{(0)}(x) = P(y = 0 | x ; \theta) \\& h_\theta^{(1)}(x) = P(y = 1 | x ; \theta) \\& \cdots \\& h_\theta^{(n)}(x) = P(y = n | x ; \theta) \\& \mathrm{prediction} = \max_i( h_\theta ^{(i)}(x) )\\\end{align*}$

对于多分类问题，我们通常先选定一类，然后将其它的类均归为第二类，将多分类问题转化为二分类问题。然后循环所有类，然后将最高的 $\large h_\theta ^{(i)}(x)$ 作为多分类的预测值。

例：有三个classes

2.2、过拟合

解决办法：

1) Reduce the number of features:

Manually select which features to keep.
Use a model selection algorithm (studied later in the course).

2) Regularization

Keep all the features, but reduce the magnitude of parameters \theta_jθj.
Regularization works well when we have a lot of slightly useful features.

2.2.1、正则化（Regularization）

线性回归：

正则化后的cost function：

$\large min_\theta\frac{1}{2m}\sum _{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum _{j=1}^n \theta_j^2$

其中λ是个超参数，需要手动调节

正则化后的Gradient Descent:

$\large \begin{align*} & \text{Repeat}\ \lbrace \\ & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \\ & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline & \rbrace \end{align*}$

其中 $\large \frac{\lambda}{m}\theta_j$ 正则化项。

逻辑回归：

正则化后的cost function：

$\large \begin{align*} & J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]+\frac{\lambda}{2m}\sum_{j=1}^{m}\theta_j^2\end{align*}$

其中 $\large \sum_{j=1}^{n}\theta_j^2$ 即为正则化项

参考：吴恩达老师：机器学习

机器学习算法——逻辑回归（logistics regression）