Target
wˆ=argminw∑i=1nL(w,zi),||w||1≤s
wˆ=argminw∑i=1nL(w,zi)+λ||w||1
Gradient Descent
W(t+1)=W(t)−η(t)G(t)=W(t)−η(t)∇ζW(W(t),Z)
Stochastic Gradient Descent
W(t+1)=W(t)−η(t)G(t)j=W(t)−η(t)∇ζW(W(t),Zj)
Momentum
mt=μmt−1+G(t)
W(t+1)=W(t)−η(t)mt
Nesterov
扫描二维码关注公众号,回复:
2332826 查看本文章
mt=μmt−1+G(t)
W(t+1)=W(t)−ημmt−1−ηG(t)
Adagrad
nt=nt−1+(G(t))2
W(t+1)=W(t)−ηnt+ϵ‾‾‾‾‾‾√G(t)
Adadelta
nt=νnt−1+(1−ν)(G(t))2
W(t+1)=W(t)−ηnt+ϵ‾‾‾‾‾‾√G(t)
With L1 Regulization
W(t+1)=W(t)−η(t)G(t)−η(t)λsgn(W(t))
Simple Truncated
T0(v,θ)={0 if|v|≤θv otherwise
W(t+1)=T0(W(t)−η(t)G(t),θ)
Truncated Gradient
T1(v,α,θ)=⎧⎩⎨⎪⎪max(0,v−α)min(0,v+α)vif v∈[0,θ]if v∈[−θ,0]otherwise
W(t+1)=T1(W(t)−η(t)G(t),η(t)λ(t),θ)
Adam
mt=μmt−1+(1−μ)G(t)
nt=νnt−1+(1−ν)(G(t))2
mˆt=mt1−μt
nˆt=nt1−νt
W(t+1)=W(t)−mˆtnˆt‾‾√+ϵη
FOBOS
W(t+0.5)=W(t)−η(t)G(t)
W(t+1)=argminw{12||W−W(t+0.5)||22+η(t+0.5)Ψ(W)}
RDA
W(t+1)=argminw{1t∑r=1tG(r)⋅W+Ψ(W)+β(t)th(w)}
FTRL
W(t+1)=argminw{G(1:t)⋅W+λ1||W||1+λ22+12∑s=1t||W−W(s)||22}