李宏毅深度学习课程
预测宝可梦的战斗力
Regression
- Market Forecast——预测明天股价如何?
- self-driving car——预测方向盘角度
- Recommendation——购买可能性(推荐系统)
f ( x ( 宝 可 梦 ) ) = y ′ C P a f t e r e v o l u t i o n ′ f(x(宝可梦))=y\;'\;CP\;after\;evolution\;' f(x(宝可梦))=y′CPafterevolution′
x c p : 进 化 前 战 斗 力 、 x s : 物 种 、 x h p : 生 命 值 、 x w : 重 量 、 x h : 高 度 x_{cp}:进化前战斗力、x_s:物种、x_{hp}:生命值、x_w:重量、x_h:高度 xcp:进化前战斗力、xs:物种、xhp:生命值、xw:重量、xh:高度
Step 1. Model
A set of function … ————→ Model ( f1, f2, f3 … )
linear Model :
y = b + w ⋅ x c p y = b+w\cdot{x_{cp}} y=b+w⋅xcp
w w w and b b b are parameters (can be any value)
y = b + ∑ w i x i y = b+\sum{w_ix_i} y=b+∑wixi
x i x_i xi : an attribute of input X X X (feature). —— X X X 的各种属性
w i w_i wi : weight
b b b : bias
Step 2. Goodness of function
function input : function output (scalar) :
x 1 , x 2 , x 3 . . . x^1, x^2, x^3 ... x1,x2,x3... y ^ 1 , y ^ 2 , y ^ 3 . . . \widehat{y}^1, \widehat{y}^2, \widehat{y}^3 ... y 1,y 2,y 3...
Loss function L :
- input : a function
- output : how bad it is
L ( f ) = L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 L(f)=L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 L(f)=L(w,b)=n=1∑10(y n−(b+w⋅xcpn))2
estimation error : estimated y y y based on input function .
在衡量一组 w w w, b b b 的好坏。
Step 3. Gradient Descent
Best function :
( pick the ‘‘best’’ function)
f ∗ = a r g m i n f L ( f ) w ∗ , b ∗ = a r g m i n w , b L ( w , b ) = a r g m i n w , b ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 f^*=argmin_{f}\;L(f)\\w^*,b^*=argmin_{w,b}\;L(w,b)=argmin_{w,b}\;\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 f∗=argminfL(f)w∗,b∗=argminw,bL(w,b)=argminw,bn=1∑10(y
n−(b+w⋅xcpn))2
Consider loss function L ( w ) L(w) L(w) with one parameter w w w .
w ∗ = a r g m i n w L ( w ) w^*=argmin_w\;L(w) w∗=argminwL(w)
可微分的数可以回传,进行梯度下降
- ( Randomly ) Pick an initial value w 0 w^0 w0
- compute
d L d w ∣ w = w 0 − η d L d w ∣ w = w 0 w 1 = w 0 − η d L d w ∣ w = w 0 \frac{dL}{dw}|_{w=w^0} \\-\eta\frac{dL}{dw}|_{w=w^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0} dwdL∣w=w0−ηdwdL∣w=w0w1=w0−ηdwdL∣w=w0
η \eta η is called ‘learning rate’ .
- Many iteration
Local optimal, not global optimal .
How obout two parameters ?
w ∗ , b ∗ = a r g m i n w , b L ( w , b ) w^*,b^*=argmin_{w,b}\;L(w,b) w∗,b∗=argminw,bL(w,b)
- ( Randomly ) Pick an initial value w 0 , b 0 w^0,\;b^0 w0,b0
- compute
d L d w ∣ w = w 0 , b = b 0 d L d b ∣ w = w 0 , b = b 0 w 1 = w 0 − η d L d w ∣ w = w 0 , b = b 0 b 1 = b 0 − η d L d b ∣ w = w 0 , b = b 0 \frac{dL}{dw}|_{w=w^0,b=b^0} \\\frac{dL}{db}|_{w=w^0,b=b^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0,b=b^0} \\b^1=b^0-\eta\frac{dL}{db}|_{w=w^0,b=b^0} dwdL∣w=w0,b=b0dbdL∣w=w0,b=b0w1=w0−ηdwdL∣w=w0,b=b0b1=b0−ηdbdL∣w=w0,b=b0
▽ = [ α L α w α L α b ] g r a d i e n t \bigtriangledown=\left[\begin{array}{rcl} \frac{\alpha{L}}{\alpha{w}} \\\frac{\alpha{L}}{\alpha{b}} \end{array}\right]_{gradient} ▽=[αwαLαbαL]gradient
The linear regression, the loss function L L L is convex. ( No local optimal )
Fomulation of α L α w \frac{\alpha{L}}{\alpha{w}} αwαL and α L α b \frac{\alpha{L}}{\alpha{b}} αbαL
L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 α L α w = ∑ n = 1 10 2 ( y ^ n − ( b + w ⋅ x c p n ) ) ( − x c p n ) α L α b = ∑ n = 1 10 2 ( y ^ n − ( b + w ⋅ x c p n ) ) ( − 1 ) L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 \\\frac{\alpha{L}}{\alpha{w}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-x_{cp}^n) \\\frac{\alpha{L}}{\alpha{b}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-1) L(w,b)=n=1∑10(y
n−(b+w⋅xcpn))2αwαL=n=1∑102(y
n−(b+w⋅xcpn))(−xcpn)αbαL=n=1∑102(y
n−(b+w⋅xcpn))(−1)
Model Selection
M o d e l 1 : y = w 1 x + b M o d e l 2 : y = w 1 x + w 2 x + b M o d e l 3 : y = w 1 x + w 2 x + w 3 x + b . . . Model\;1:y=w_1x+b \\Model\;2:y=w_1x+w_2x+b \\Model\;3:y=w_1x+w_2x+w_3x+b \\... Model1:y=w1x+bModel2:y=w1x+w2x+bModel3:y=w1x+w2x+w3x+b...
A more complex model does not always lead to better performance on testing data. This is overfitting.
Let’s collect more data. There is more hidden factors influence the previous model. : the type of pokeman
Back to Step 1: Redesign the Model
x s x_s xs = species of x x x
X ——→
i f x s = P i d g e y : y = b 1 + w 1 ⋅ x c p i f x s = W e e d l e : y = b 2 + w 2 ⋅ x c p i f x s = C a t e r p i e : y = b 3 + w 3 ⋅ x c p i f x s = E e v e e : y = b 4 + w 4 ⋅ x c p if\;x_s=Pidgey:\;y=b_1+w_1\cdot{x_{cp}} \\ if \;x_s=Weedle:\;y=b_2+w_2\cdot{x_{cp}} \\ if \;x_s=Caterpie:\;y=b_3+w_3\cdot{x_{cp}} \\ if \;x_s=Eevee:\;y=b_4+w_4\cdot{x_{cp}} ifxs=Pidgey:y=b1+w1⋅xcpifxs=Weedle:y=b2+w2⋅xcpifxs=Caterpie:y=b3+w3⋅xcpifxs=Eevee:y=b4+w4⋅xcp
——→ y y y
y = b 1 δ ( x s = P i d e y ) + w 1 ⋅ δ ( x s = P i d e y ) x c p + . . . + b 4 δ ( x s = E e v e e ) + w 4 ⋅ δ ( x s = E e v e e ) x c p y=b_1\delta(x_s=Pidey)+w_1\cdot\delta(x_s=Pidey)x_{cp} \\+... \\+b_4\delta(x_s=Eevee)+w_4\cdot\delta(x_s=Eevee)x_{cp} y=b1δ(xs=Pidey)+w1⋅δ(xs=Pidey)xcp+...+b4δ(xs=Eevee)+w4⋅δ(xs=Eevee)xcp
δ ( x s = P i d e y ) = { 1 , i f x s = P i d e y 0 , o t h e r w i s e \delta(x_s=Pidey)=\left\{\begin{array}{rcl}1, & if\;x_s=Pidey \\0,&otherwise \end{array}\right. δ(xs=Pidey)={ 1,0,ifxs=Pideyotherwise
Are there any other hidden factors?
Back to Step 2: Regulazation
y = b + ∑ w i x i y=b+\sum{w_ix_i} y=b+∑wixi
L = ∑ n ( y ^ n − ( b + ∑ w i x i ) ) 2 + λ ∑ ( w i ) 2 L=\sum_n(\widehat{y}^n-(b+\sum{w_ix_i}))^2+\lambda\sum(w_i)^2 L=n∑(y n−(b+∑wixi))2+λ∑(wi)2
training error + 正则化
b b b 对 function 的平滑程度无关,所以正则化时不考虑 b b b
- The functions with smaller w i w_i wi are better. w i w_i wi 越小越平滑。
- Training error: larger λ \lambda λ , considering the training error less.
λ \lambda λ 越大越平滑,但是不可以太平滑
why smooth function are preferred?
平滑 function 对输入杂物影响小。if some noises corrupt input x i x_i xi when testing, a smooth function has less influence.
where are the errors from?
- bias
- variance
simpler model is less influenced by the sample data.
- simple model → small variance, large bias ( underfitting )
- complex model → large variance, small bias ( overfitting )
复杂模型包含简单模型
For bias, redesign your model:
- add more features as input
- a more complex model
what to do with large variance?
- more data ( 采集真实数据,生成假数据 ) —— very effective, but not always practical
- regularization