李宏毅深度学习课程

预测宝可梦的战斗力

Regression

Market Forecast——预测明天股价如何？
self-driving car——预测方向盘角度
Recommendation——购买可能性（推荐系统）

$f(x(宝可梦))=y\;'\;CP\;after\;evolution\;'$

$x_{cp}:进化前战斗力、x_s:物种、x_{hp}:生命值、x_w:重量、x_h:高度$

Step 1. Model

A set of function … ————→ Model ( f1, f2, f3 … )

linear Model :
$b+w\cdot{x_{cp}}$
$w$ and $b$ are parameters (can be any value)
$b+\sum{w_ix_i}$
$x_i$ : an attribute of input $X$ (feature). —— $X$ 的各种属性

$w_i$ : weight

$b$ : bias

Step 2. Goodness of function

function input : function output (scalar) :

$x^1, x^2, x^3 ...$ $\widehat{y}^1, \widehat{y}^2, \widehat{y}^3 ...$

Loss function L :

input : a function
output : how bad it is

$L(f)=L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2$

estimation error : estimated $y$ based on input function .

在衡量一组 $w$ , $b$ 的好坏。

Step 3. Gradient Descent

Best function :

( pick the ‘‘best’’ function)
$f^*=argmin_{f}\;L(f)\\w^*,b^*=argmin_{w,b}\;L(w,b)=argmin_{w,b}\;\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2$
Consider loss function $L (w)$ with one parameter $w$ .
$w^*=argmin_w\;L(w)$
可微分的数可以回传，进行梯度下降

( Randomly ) Pick an initial value $w^0$
compute

$\frac{dL}{dw}|_{w=w^0} \\-\eta\frac{dL}{dw}|_{w=w^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0}$

$\eta$ is called ‘learning rate’ .

Many iteration

Local optimal, not global optimal .

How obout two parameters ?
$w^*,b^*=argmin_{w,b}\;L(w,b)$

( Randomly ) Pick an initial value $w^0,\;b^0$
compute

$\frac{dL}{dw}|_{w=w^0,b=b^0} \\\frac{dL}{db}|_{w=w^0,b=b^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0,b=b^0} \\b^1=b^0-\eta\frac{dL}{db}|_{w=w^0,b=b^0}$

$\bigtriangledown=\left[\begin{array}{rcl} \frac{\alpha{L}}{\alpha{w}} \\\frac{\alpha{L}}{\alpha{b}} \end{array}\right]_{gradient}$

The linear regression, the loss function $L$ is convex. ( No local optimal )

Fomulation of $\frac{\alpha{L}}{\alpha{w}}$ and $\frac{\alpha{L}}{\alpha{b}}$
$L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 \\\frac{\alpha{L}}{\alpha{w}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-x_{cp}^n) \\\frac{\alpha{L}}{\alpha{b}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-1)$

Model Selection

$Model\;1:y=w_1x+b \\Model\;2:y=w_1x+w_2x+b \\Model\;3:y=w_1x+w_2x+w_3x+b \\...$

A more complex model does not always lead to better performance on testing data. This is overfitting.

Let’s collect more data. There is more hidden factors influence the previous model. : the type of pokeman

Back to Step 1: Redesign the Model

$x_s$ = species of $x$

X ——→
$if\;x_s=Pidgey:\;y=b_1+w_1\cdot{x_{cp}} \\ if \;x_s=Weedle:\;y=b_2+w_2\cdot{x_{cp}} \\ if \;x_s=Caterpie:\;y=b_3+w_3\cdot{x_{cp}} \\ if \;x_s=Eevee:\;y=b_4+w_4\cdot{x_{cp}}$
——→ $y$
$y=b_1\delta(x_s=Pidey)+w_1\cdot\delta(x_s=Pidey)x_{cp} \\+... \\+b_4\delta(x_s=Eevee)+w_4\cdot\delta(x_s=Eevee)x_{cp}$

$\delta(x_s=Pidey)=\left\{\begin{array}{rcl}1, & if\;x_s=Pidey \\0,&otherwise \end{array}\right.$

Are there any other hidden factors?

Back to Step 2: Regulazation

$y=b+\sum{w_ix_i}$

$L=\sum_n(\widehat{y}^n-(b+\sum{w_ix_i}))^2+\lambda\sum(w_i)^2$

training error + 正则化

$b$ 对 function 的平滑程度无关，所以正则化时不考虑 $b$

The functions with smaller $w_i$ are better. $w_i$ 越小越平滑。
Training error: larger $\lambda$ , considering the training error less.

$\lambda$ 越大越平滑，但是不可以太平滑

why smooth function are preferred?

平滑 function 对输入杂物影响小。if some noises corrupt input $x_i$ when testing, a smooth function has less influence.

where are the errors from?

bias
variance

simpler model is less influenced by the sample data.

simple model → small variance, large bias ( underfitting )
complex model → large variance, small bias ( overfitting )

复杂模型包含简单模型

For bias, redesign your model:

add more features as input
a more complex model

what to do with large variance?

more data ( 采集真实数据，生成假数据 ) —— very effective, but not always practical
regularization

深度学习——李宏毅第一课2020