1. Sigmoid函数
\qquad
本文中Sigmoid函数用
S
(
x
)
S(x)
S ( x ) 表示:
S
(
x
)
=
1
1
+
e
−
x
\qquad\qquad S(x)=\dfrac{1}{1+e^{-x}}
S ( x ) = 1 + e − x 1
\qquad
Sigmoid函数具有特殊的性质:
[
S
(
x
)
]
′
=
S
(
x
)
[
1
−
S
(
x
)
]
[S(x) ]^{'}=S(x) [ 1-S(x) ]
[ S ( x ) ] ′ = S ( x ) [ 1 − S ( x ) ]
Sigmoid函数的曲线在中心
(
x
=
0
,
y
=
0.5
)
(x=0,y=0.5 )
( x = 0 , y = 0 . 5 ) 附近增长速度较快,在两端增长速度缓慢 其中,虚线为阶梯函数
2. Logistic Regression模型
\qquad
如果将Sigmoid函数
S
(
x
)
S(x)
S ( x ) 作为线性模型
f
(
x
)
=
w
T
x
+
b
f(\boldsymbol x)=\boldsymbol{w}^T \boldsymbol x + b
f ( x ) = w T x + b 的变换函数,那么:
y
(
x
)
=
S
[
f
(
x
)
]
=
1
1
+
e
−
(
w
T
x
+
b
)
\qquad\qquad y(\boldsymbol{x})=S [ f(\boldsymbol x) ]=\dfrac{1}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}
y ( x ) = S [ f ( x ) ] = 1 + e − ( w T x + b ) 1
\qquad
对于某个样本
x
∗
\boldsymbol{x^{\ast}}
x ∗ 来说,其输出值为
y
=
y
(
x
∗
)
y=y(\boldsymbol x^{\ast})
y = y ( x ∗ ) ,可得到:
ln
(
y
1
−
y
)
=
w
T
x
∗
+
b
\qquad\qquad \ln\left( \dfrac{y}{1-y}\right)=\boldsymbol{w}^{T}\boldsymbol{x^{\ast}}+b
ln ( 1 − y y ) = w T x ∗ + b
\qquad
如果将
y
y
y 看作是样本
x
∗
\boldsymbol{x^{\ast}}
x ∗ 为正例 的可能性(概率),将
1
−
y
1-y
1 − y 看作是样本
x
∗
\boldsymbol{x^{\ast}}
x ∗ 为负例 的可能性(概率),两者的比率取对数
ln
(
y
1
−
y
)
\ln\left( \dfrac{y}{1-y}\right)
ln ( 1 − y y ) 反映了对样本
x
∗
\boldsymbol{x^{\ast}}
x ∗ 进行“线性分类”的情况 (如下图所示):
1
)
\qquad1)
1 ) 如果
y
=
0.5
y=0.5
y = 0 . 5 ,那么
1
−
y
=
0.5
1-y=0.5
1 − y = 0 . 5 ,
ln
(
y
1
−
y
)
=
0
\ln\left( \dfrac{y}{1-y}\right)=0
ln ( 1 − y y ) = 0 ,此时
w
T
x
∗
+
b
=
0
\boldsymbol{w}^{T}\boldsymbol{x^{\ast}}+b=0
w T x ∗ + b = 0 。
\qquad
从线性模型的角度来说,样本
x
∗
\boldsymbol{x^{\ast}}
x ∗ 正好处在分界面(红色直线)上,作为正例和负例的可能性是相同的。
2
)
\qquad2)
2 ) 如果
y
>
0.5
y>0.5
y > 0 . 5 ,那么
1
−
y
<
0.5
1-y<0.5
1 − y < 0 . 5 ,
ln
(
y
1
−
y
)
>
0
\ln\left( \dfrac{y}{1-y}\right)>0
ln ( 1 − y y ) > 0 ,此时
w
T
x
∗
+
b
>
0
\boldsymbol{w}^{T}\boldsymbol{x^{\ast}}+b>0
w T x ∗ + b > 0 。
\qquad
这说明了,样本
x
∗
\boldsymbol{x^{\ast}}
x ∗ 处在分界面的上侧。
3
)
\qquad3)
3 ) 如果
y
<
0.5
y<0.5
y < 0 . 5 ,那么
1
−
y
>
0.5
1-y>0.5
1 − y > 0 . 5 ,
ln
(
y
1
−
y
)
<
0
\ln\left( \dfrac{y}{1-y}\right)<0
ln ( 1 − y y ) < 0 ,此时
w
T
x
∗
+
b
<
0
\boldsymbol{w}^{T}\boldsymbol{x^{\ast}}+b<0
w T x ∗ + b < 0 。
\qquad
这说明了,样本
x
∗
\boldsymbol{x^{\ast}}
x ∗ 处在分界面的下侧。
通过Sigmoid函数可以将线性模型
y
=
w
T
x
∗
+
b
y=\boldsymbol{w}^{T}\boldsymbol{x^{\ast}}+b
y = w T x ∗ + b 的输出值
y
y
y 转化为
[
0
,
1
]
[0,1]
[ 0 , 1 ] 之间 如果事件发生的概率为
p
p
p ,那么该事件发生的几率 (odds)定义为
p
1
−
p
\dfrac{p}{1-p}
1 − p p ,该事件的对数几率 (log odds)定义为
ln
(
p
1
−
p
)
\ln\left(\dfrac{p}{1-p}\right)
ln ( 1 − p p )
\qquad
如果采用变量
c
=
1
c=1
c = 1 表示上图中的
R
1
\mathcal R_1
R 1 区域,用
c
=
0
c=0
c = 0 表示上图中的
R
2
\mathcal R_2
R 2 区域,那么可将
y
(
x
)
y(\boldsymbol x)
y ( x ) 的值视为类后验概率 ,即:
p
(
c
=
1
∣
x
)
=
y
(
x
)
=
1
1
+
e
−
(
w
T
x
+
b
)
\qquad\qquad p(c=1|\boldsymbol{x})=y(\boldsymbol{x})=\dfrac{1}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}
p ( c = 1 ∣ x ) = y ( x ) = 1 + e − ( w T x + b ) 1
\qquad
此时,关于
p
(
c
=
1
∣
x
)
p(c=1|\boldsymbol{x})
p ( c = 1 ∣ x ) 的对数几率就是线性模型:
ln
p
(
c
=
1
∣
x
)
p
(
c
=
0
∣
x
)
=
w
T
x
+
b
\qquad\qquad\ln \dfrac{p(c=1|\boldsymbol{x})}{p(c=0|\boldsymbol{x})}=\boldsymbol{w}^T\boldsymbol{x}+b
ln p ( c = 0 ∣ x ) p ( c = 1 ∣ x ) = w T x + b
x
\boldsymbol{x}
x 为正例
(
c
=
1
)
(c=1)
( c = 1 ) 的概率:
p
(
c
=
1
∣
x
)
=
1
1
+
e
−
(
w
T
x
+
b
)
\qquad\qquad p(c=1|\boldsymbol{x})=\dfrac{1}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}
p ( c = 1 ∣ x ) = 1 + e − ( w T x + b ) 1
\qquad
线性函数
w
T
x
+
b
\boldsymbol{w}^{T}\boldsymbol{x}+b
w T x + b 的值越接近
+
∞
+\infty
+ ∞ ,概率值越接近
1
1
1
\qquad
线性函数
w
T
x
+
b
\boldsymbol{w}^{T}\boldsymbol{x}+b
w T x + b 的值越接近
−
∞
-\infty
− ∞ ,概率值越接近
0
0
0
x
\boldsymbol{x}
x 为负例
(
c
=
0
)
(c=0)
( c = 0 ) 的概率:
p
(
c
=
0
∣
x
)
=
1
−
p
(
y
=
1
∣
x
)
=
e
−
(
w
T
x
+
b
)
1
+
e
−
(
w
T
x
+
b
)
=
1
1
+
e
(
w
T
x
+
b
)
\qquad\qquad\begin{aligned} p(c=0|\boldsymbol{x})&=1-p(y=1|\boldsymbol{x})\\ &=\dfrac{e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}\\ &=\dfrac{1}{1+e^{(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}\end{aligned}
p ( c = 0 ∣ x ) = 1 − p ( y = 1 ∣ x ) = 1 + e − ( w T x + b ) e − ( w T x + b ) = 1 + e ( w T x + b ) 1
\qquad
线性函数
w
T
x
+
b
\boldsymbol{w}^{T}\boldsymbol{x}+b
w T x + b 的值越接近
+
∞
+\infty
+ ∞ ,概率值越接近
0
0
0
\qquad
线性函数
w
T
x
+
b
\boldsymbol{w}^{T}\boldsymbol{x}+b
w T x + b 的值越接近
−
∞
-\infty
− ∞ ,概率值越接近
1
1
1
\qquad
显然,对于新的输入样本
x
∗
\boldsymbol{x^{\ast}}
x ∗ 而言,按照最大后验概率准则:如果
p
(
y
=
1
∣
x
∗
)
>
p
(
y
=
0
∣
x
∗
)
p(y=1|\boldsymbol{x^{\ast}})>p(y=0|\boldsymbol{x^{\ast}})
p ( y = 1 ∣ x ∗ ) > p ( y = 0 ∣ x ∗ ) ,则认为
x
∗
\boldsymbol{x^{\ast}}
x ∗ 属于
R
1
R_{1}
R 1 ;如果
p
(
y
=
1
∣
x
∗
)
<
p
(
y
=
0
∣
x
∗
)
p(y=1|\boldsymbol{x^{\ast}})<p(y=0|\boldsymbol{x^{\ast}})
p ( y = 1 ∣ x ∗ ) < p ( y = 0 ∣ x ∗ ) ,则认为
x
∗
\boldsymbol{x^{\ast}}
x ∗ 属于
R
2
R_{2}
R 2 。
\qquad
3. 模型的参数估计
\qquad
假设训练样本为
{
(
x
i
,
c
i
)
}
i
=
1
N
\{ ( \boldsymbol{x}_{i},c_{i}) \} _{i=1}^N
{ ( x i , c i ) } i = 1 N ,其中
x
i
∈
R
n
,
c
i
∈
{
0
,
1
}
\boldsymbol{x}_{i}\in R^{n},c_{i}\in \{0,1\}
x i ∈ R n , c i ∈ { 0 , 1 } ,采用最大似然估计来求模型的参数
(
w
,
b
)
(\boldsymbol{w},b)
( w , b ) 。
1
)
\qquad1)
1 ) 由
y
(
x
)
=
p
(
c
=
1
∣
x
)
y(\boldsymbol{x})=p(c=1|\boldsymbol{x})
y ( x ) = p ( c = 1 ∣ x ) ,训练样本集的似然函数可表示为:
L
(
w
,
b
)
=
∏
i
=
1
N
y
(
x
i
)
c
i
[
1
−
y
(
x
i
)
]
1
−
c
i
\qquad\qquad L(\boldsymbol{w},b)=\displaystyle\prod_{i=1}^N y(\boldsymbol{x}_{i})^{c_{i}}\left[ 1-y(\boldsymbol{x}_{i})\right] ^{1-c_{i}}
L ( w , b ) = i = 1 ∏ N y ( x i ) c i [ 1 − y ( x i ) ] 1 − c i
2
)
\qquad2)
2 ) 对数似然函数可表示为:
ln
L
(
w
,
b
)
=
∑
i
=
1
N
{
c
i
ln
[
y
(
x
i
)
]
+
(
1
−
c
i
)
ln
[
1
−
y
(
x
i
)
]
}
=
∑
i
=
1
N
{
c
i
ln
y
(
x
i
)
1
−
y
(
x
i
)
+
ln
[
1
−
y
(
x
i
)
]
}
=
∑
i
=
1
N
{
c
i
(
w
T
x
i
+
b
)
−
ln
[
1
+
e
(
w
T
x
i
+
b
)
]
}
\qquad\qquad\begin{aligned} \ln L(\boldsymbol{w},b)&=\displaystyle\sum_{i=1}^N \{ c_{i}\ln\left[ y\left( \boldsymbol{x}_{i}\right) \right] +(1-c_{i})\ln\left[ 1-y\left( \boldsymbol{x}_{i}\right) \right] \}\\ &=\displaystyle\sum_{i=1}^N\left\{ c_{i}\ln\dfrac{ y\left( \boldsymbol{x}_{i}\right)}{1-y\left( \boldsymbol{x}_{i}\right)} +\ln\left[ 1-y\left( \boldsymbol{x}_{i}\right) \right] \right\} \\ &=\displaystyle\sum_{i=1}^N\left\{ c_{i}\left(\boldsymbol{w}^{T}\boldsymbol{x}_{i}+b\right)-\ln\left[ 1+e^{\left(\boldsymbol{w}^{T}\boldsymbol{x}_{i}+b\right)} \right] \right\} \\ \end{aligned}
ln L ( w , b ) = i = 1 ∑ N { c i ln [ y ( x i ) ] + ( 1 − c i ) ln [ 1 − y ( x i ) ] } = i = 1 ∑ N { c i ln 1 − y ( x i ) y ( x i ) + ln [ 1 − y ( x i ) ] } = i = 1 ∑ N { c i ( w T x i + b ) − ln [ 1 + e ( w T x i + b ) ] }
3
)
\qquad3)
3 ) 若令
β
=
[
w
T
,
b
]
T
,
x
^
i
=
[
x
i
T
,
1
]
T
\boldsymbol{\beta}=[\boldsymbol{w}^T,b]^{T},\ \hat{\boldsymbol{x}}_{i}=[\boldsymbol{x}_{i}^T,1]^{T}
β = [ w T , b ] T , x ^ i = [ x i T , 1 ] T ,那么线性模型
w
T
x
i
+
b
=
β
T
x
^
i
\boldsymbol{w}^{T}\boldsymbol{x}_{i}+b=\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}
w T x i + b = β T x ^ i ,从而有
ln
L
(
β
)
=
∑
i
=
1
N
[
c
i
β
T
x
^
i
−
ln
(
1
+
e
β
T
x
^
i
)
]
\qquad\qquad \ln L(\boldsymbol{\beta})=\displaystyle\sum_{i=1}^N \left[ c_{i}\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}-\ln\left( 1+e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}} \right) \right]
ln L ( β ) = i = 1 ∑ N [ c i β T x ^ i − ln ( 1 + e β T x ^ i ) ]
\qquad
通过最大似然估计,可以估计出Logistic Regression模型的参数
β
=
[
w
T
,
b
]
T
\boldsymbol{\beta}=[\boldsymbol{w}^{T},b]^{T}
β = [ w T , b ] T 。
4. 模型学习的最优化算法
\qquad
一般取“负的对数似然函数”作为损失函数,即:
l
(
β
)
=
−
ln
L
(
w
,
b
)
=
−
ln
L
(
β
)
l(\boldsymbol{\beta})=-\ln L(\boldsymbol{w},b)=-\ln L(\boldsymbol{\beta})
l ( β ) = − ln L ( w , b ) = − ln L ( β ) 。最大化似然函数,相当于最小化损失函数
l
(
β
)
l(\boldsymbol{\beta})
l ( β ) 。
l
(
β
)
=
−
ln
L
(
β
)
=
−
∑
i
=
1
N
[
c
i
β
T
x
^
i
−
ln
(
1
+
e
β
T
x
^
i
)
]
\qquad\qquad l(\boldsymbol{\beta})=-\ln L(\boldsymbol{\beta})=-\displaystyle\sum_{i=1}^N \left[ c_{i}\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}-\ln\left( 1+e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}} \right) \right]
l ( β ) = − ln L ( β ) = − i = 1 ∑ N [ c i β T x ^ i − ln ( 1 + e β T x ^ i ) ]
\qquad
由于
l
(
β
)
l(\boldsymbol{\beta})
l ( β ) 是关于
β
\boldsymbol{\beta}
β 的高阶可导连续凸函数,可采用数值优化方法对
β
\boldsymbol{\beta}
β 进行求解。
\qquad
4.1 梯度下降法
\qquad
采用梯度下降法求解时,需要求出“负梯度方向”作为下降方向:
∂
l
(
β
)
∂
β
=
−
∑
i
=
1
N
(
c
i
x
^
i
−
1
1
+
e
β
T
x
^
i
e
β
T
x
^
i
x
^
i
)
=
−
∑
i
=
1
N
(
c
i
−
e
β
T
x
^
i
1
+
e
β
T
x
^
i
)
x
^
i
=
−
∑
i
=
1
N
[
c
i
−
y
(
x
i
)
]
x
^
i
(
1
)
\qquad\qquad \begin{aligned} \dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}&=-\displaystyle\sum_{i=1}^N \left( c_{i}\hat{\boldsymbol{x}}_{i}-\dfrac{1}{1+e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}}}e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}}\hat{\boldsymbol{x}}_{i} \right)\\ &=-\displaystyle\sum_{i=1}^N \left( c_{i}-\dfrac{e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}}}{1+e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}}} \right)\hat{\boldsymbol{x}}_{i} \\ &=-\displaystyle\sum_{i=1}^N [ c_{i}-y(\boldsymbol{x}_{i}) ]\hat{\boldsymbol{x}}_{i} \qquad\qquad\qquad\ (1)\\ \end{aligned}
∂ β ∂ l ( β ) = − i = 1 ∑ N ( c i x ^ i − 1 + e β T x ^ i 1 e β T x ^ i x ^ i ) = − i = 1 ∑ N ( c i − 1 + e β T x ^ i e β T x ^ i ) x ^ i = − i = 1 ∑ N [ c i − y ( x i ) ] x ^ i ( 1 )
\qquad
由于参数
β
=
[
w
T
,
b
]
T
=
[
w
1
,
⋯
,
w
n
,
b
]
T
\boldsymbol{\beta}=[\boldsymbol{w}^T,b]^{T}=[w_1,\cdots,w_n,b]^T
β = [ w T , b ] T = [ w 1 , ⋯ , w n , b ] T 以及
x
^
i
=
[
x
i
T
,
1
]
T
=
[
x
i
(
1
)
,
⋯
,
x
i
(
n
)
,
1
]
T
\hat\boldsymbol{x}_{i}=[\boldsymbol{x}_{i}^T,1]^T=[x_i^{(1)},\cdots,x_i^{(n)},1]^T
x ^ i = [ x i T , 1 ] T = [ x i ( 1 ) , ⋯ , x i ( n ) , 1 ] T ,公式(1)实际上为:
{
∂
l
(
β
)
∂
w
=
−
∑
i
=
1
N
[
c
i
−
y
(
x
i
)
]
x
i
(
2
)
∂
l
(
β
)
∂
b
=
−
∑
i
=
1
N
[
c
i
−
y
(
x
i
)
]
(
3
)
\qquad\qquad\begin{cases}\ \ \ \dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol w}=-\displaystyle\sum_{i=1}^N [ c_{i}-y\left( \boldsymbol{x}_{i}\right) ]\boldsymbol{x}_{i}\qquad\qquad\ \ (2) \\ \\ \ \ \ \dfrac{\partial l(\boldsymbol{\beta})}{\partial b}=-\displaystyle\sum_{i=1}^N\left[ c_{i}-y\left( \boldsymbol{x}_{i}\right) \right] \qquad\qquad(3) \end{cases}
⎩ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎧ ∂ w ∂ l ( β ) = − i = 1 ∑ N [ c i − y ( x i ) ] x i ( 2 ) ∂ b ∂ l ( β ) = − i = 1 ∑ N [ c i − y ( x i ) ] ( 3 )
\qquad
若考虑
x
i
=
[
x
i
(
1
)
,
⋯
,
x
i
(
n
)
]
T
\boldsymbol{x}_{i}=[x_i^{(1)},\cdots,x_i^{(n)}]^T
x i = [ x i ( 1 ) , ⋯ , x i ( n ) ] T 的每一个分量
x
i
(
j
)
x_{i}^{(j)}
x i ( j ) ,公式(2)还可以表示为:
∂
l
(
β
)
∂
w
j
=
−
∑
i
=
1
N
[
c
i
−
y
(
x
i
)
]
x
i
(
j
)
\qquad\qquad \dfrac{\partial l(\boldsymbol{\beta})}{\partial w_{j}}=-\displaystyle\sum_{i=1}^N [ c_{i}-y\left( \boldsymbol{x}_{i}\right) ]x_{i}^{(j)}
∂ w j ∂ l ( β ) = − i = 1 ∑ N [ c i − y ( x i ) ] x i ( j )
\qquad
对数似然函数的梯度可以表示为:
∂
l
(
β
)
∂
β
=
[
∂
l
(
β
)
∂
w
1
,
⋯
,
∂
l
(
β
)
∂
w
n
,
∂
l
(
β
)
∂
b
]
T
\qquad\qquad \dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=\left[\dfrac{\partial l(\boldsymbol{\beta})}{\partial w_1},\cdots,\dfrac{\partial l(\boldsymbol{\beta})}{\partial w_n},\dfrac{\partial l(\boldsymbol{\beta})}{\partial b} \right]^{T}
∂ β ∂ l ( β ) = [ ∂ w 1 ∂ l ( β ) , ⋯ , ∂ w n ∂ l ( β ) , ∂ b ∂ l ( β ) ] T
\qquad
因此,采用梯度下降法的权值更新公式为(
α
\alpha
α 为步长):
β
t
+
1
=
β
t
−
α
∂
l
(
β
)
∂
β
\qquad\qquad \boldsymbol{\beta}^{t+1}=\boldsymbol{\beta}^{t}-\alpha\dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}
β t + 1 = β t − α ∂ β ∂ l ( β )
\qquad
4.2 牛顿法
\qquad
采用牛顿法求解最优化问题时,是在搜索点取泰勒级数的二阶近似的导数为
0
0
0 。除了要求梯度之外,还需要求出
h
e
s
s
i
a
n
hessian
h e s s i a n 矩阵的逆。
\qquad
由于已经求出:
∂
l
(
β
)
∂
β
=
−
∑
i
=
1
N
[
c
i
−
y
(
x
i
)
]
x
^
i
\ \dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=-\displaystyle\sum_{i=1}^N [ c_{i}-y(\boldsymbol{x}_{i}) ]\hat{\boldsymbol{x}}_{i}
∂ β ∂ l ( β ) = − i = 1 ∑ N [ c i − y ( x i ) ] x ^ i
\qquad
那么,
h
e
s
s
i
a
n
hessian
h e s s i a n 矩阵就为:
∂
∂
β
T
(
∂
l
(
β
)
∂
β
)
=
∂
∂
β
T
(
−
∑
i
=
1
N
[
c
i
−
y
(
x
i
)
]
x
^
i
)
=
∂
∂
β
T
(
∑
i
=
1
N
y
(
x
i
)
x
^
i
)
=
∑
i
=
1
N
y
(
x
i
)
[
1
−
y
(
x
i
)
]
x
^
i
∂
∂
β
T
(
β
T
x
^
i
)
=
∑
i
=
1
N
y
(
x
i
)
[
1
−
y
(
x
i
)
]
x
^
i
x
^
i
T
\qquad\qquad \begin{aligned} \dfrac{\partial}{\partial \boldsymbol\beta^T}\left(\dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol\beta}\right) &=\dfrac{\partial}{\partial \boldsymbol\beta^T}\left(-\displaystyle\sum_{i=1}^N [ c_{i}-y(\boldsymbol{x}_{i}) ]\hat{\boldsymbol{x}}_{i}\right)\\ &=\dfrac{\partial}{\partial \boldsymbol\beta^T}\left(\displaystyle\sum_{i=1}^Ny(\boldsymbol{x}_{i})\hat\boldsymbol{x}_i\right)\\ &=\displaystyle\sum_{i=1}^Ny(\boldsymbol{x}_{i})[1-y(\boldsymbol{x}_{i})]\hat\boldsymbol{x}_i\dfrac{\partial}{\partial \boldsymbol\beta^T}\left(\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}\right)\\ &=\displaystyle\sum_{i=1}^Ny(\boldsymbol{x}_{i})[1-y(\boldsymbol{x}_{i})]\hat\boldsymbol{x}_i\hat{\boldsymbol{x}}_{i}^T\\ \end{aligned}
∂ β T ∂ ( ∂ β ∂ l ( β ) ) = ∂ β T ∂ ( − i = 1 ∑ N [ c i − y ( x i ) ] x ^ i ) = ∂ β T ∂ ( i = 1 ∑ N y ( x i ) x ^ i ) = i = 1 ∑ N y ( x i ) [ 1 − y ( x i ) ] x ^ i ∂ β T ∂ ( β T x ^ i ) = i = 1 ∑ N y ( x i ) [ 1 − y ( x i ) ] x ^ i x ^ i T
\qquad
因此,采用牛顿法的权值更新公式为:
β
t
+
1
=
β
t
−
(
∂
2
l
(
β
)
∂
β
∂
β
T
)
−
1
∂
l
(
β
)
∂
β
\qquad\qquad \boldsymbol{\beta}^{t+1}=\boldsymbol{\beta}^{t}-\left(\dfrac{\partial^2 l(\boldsymbol\beta)}{\partial \boldsymbol\beta\partial \beta^T}\right)^{-1}\dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}
β t + 1 = β t − ( ∂ β ∂ β T ∂ 2 l ( β ) ) − 1 ∂ β ∂ l ( β )
\qquad
5. 模型训练步骤
\qquad
若采用梯度下降法来求解模型的参数,则训练步骤如下:
1
)
\qquad1)
1 ) 随机选择
β
=
[
w
T
,
b
]
T
\boldsymbol{\beta}=[\boldsymbol{w}^T,b]^{T}
β = [ w T , b ] T 的初始值
β
0
\boldsymbol{\beta}^{0}
β 0
2
)
\qquad2)
2 ) 选择步长
α
\alpha
α ,迭代计算下列公式,直到满足终止条件
∂
l
(
β
)
∂
β
=
−
∑
i
=
1
N
[
c
i
−
y
(
x
i
)
]
x
^
i
=
−
∑
i
=
1
N
[
c
i
−
y
(
x
i
)
]
[
x
i
1
]
\qquad\qquad \begin{aligned} \dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} &=-\displaystyle\sum_{i=1}^N [ c_{i}-y(\boldsymbol{x}_{i}) ]\hat{\boldsymbol{x}}_{i}\\ &=-\displaystyle\sum_{i=1}^N [ c_{i}-y(\boldsymbol{x}_{i}) ]\left[\begin{matrix}\boldsymbol{x}_{i}\\ 1\end{matrix}\right]\\ \end{aligned}
∂ β ∂ l ( β ) = − i = 1 ∑ N [ c i − y ( x i ) ] x ^ i = − i = 1 ∑ N [ c i − y ( x i ) ] [ x i 1 ]
β
t
+
1
=
β
t
−
α
∂
l
(
β
)
∂
β
\qquad\qquad \boldsymbol{\beta}^{t+1}=\boldsymbol{\beta}^{t}-\alpha\dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}
β t + 1 = β t − α ∂ β ∂ l ( β )
\qquad
6. 实现代码(二分类)
1) 定义Sigmoid函数
def sigmoid ( x) :
'''Sigmoid function
'''
return 1.0 / ( 1 + np. exp( - x) )
2) 读取训练/测试集数据函数
假设在二维平面
R
2
R^{2}
R 2 上生成的数据集的格式为
(
x
i
,
y
i
)
=
(
x
i
(
1
)
,
x
i
(
2
)
,
y
i
)
,
y
i
∈
{
0
,
1
}
(\boldsymbol{x}_{i},y_{i})=(x_{i}^{(1)},x_{i}^{(2)},y_{i}), \ y_{i}\in\{0,1\}
( x i , y i ) = ( x i ( 1 ) , x i ( 2 ) , y i ) , y i ∈ { 0 , 1 } 的形式:
3.562302
,
25.329208
,
1.000000
−
24.268267
,
1.272092
,
1.000000
25.405790
,
8.463017
,
1.000000
−
6.908775
,
23.298889
,
1.000000
40.621010
,
−
25.134052
,
0.000000
−
9.305521
,
14.983097
,
1.000000
20.041330
,
−
25.381725
,
0.000000
37.298540
,
−
26.767307
,
0.000000
35.856177
,
−
31.080316
,
0.000000
−
17.976889
,
4.244106
,
1.000000
⋅
⋅
⋅
⋅
⋅
⋅
3.562302,25.329208,1.000000\newline -24.268267,1.272092,1.000000\newline 25.405790,8.463017,1.000000\newline -6.908775,23.298889,1.000000\newline 40.621010,-25.134052,0.000000\newline -9.305521,14.983097,1.000000\newline 20.041330,-25.381725,0.000000\newline 37.298540,-26.767307,0.000000\newline 35.856177,-31.080316,0.000000\newline -17.976889,4.244106,1.000000\newline \cdot\cdot\cdot\cdot\cdot\cdot
3 . 5 6 2 3 0 2 , 2 5 . 3 2 9 2 0 8 , 1 . 0 0 0 0 0 0 − 2 4 . 2 6 8 2 6 7 , 1 . 2 7 2 0 9 2 , 1 . 0 0 0 0 0 0 2 5 . 4 0 5 7 9 0 , 8 . 4 6 3 0 1 7 , 1 . 0 0 0 0 0 0 − 6 . 9 0 8 7 7 5 , 2 3 . 2 9 8 8 8 9 , 1 . 0 0 0 0 0 0 4 0 . 6 2 1 0 1 0 , − 2 5 . 1 3 4 0 5 2 , 0 . 0 0 0 0 0 0 − 9 . 3 0 5 5 2 1 , 1 4 . 9 8 3 0 9 7 , 1 . 0 0 0 0 0 0 2 0 . 0 4 1 3 3 0 , − 2 5 . 3 8 1 7 2 5 , 0 . 0 0 0 0 0 0 3 7 . 2 9 8 5 4 0 , − 2 6 . 7 6 7 3 0 7 , 0 . 0 0 0 0 0 0 3 5 . 8 5 6 1 7 7 , − 3 1 . 0 8 0 3 1 6 , 0 . 0 0 0 0 0 0 − 1 7 . 9 7 6 8 8 9 , 4 . 2 4 4 1 0 6 , 1 . 0 0 0 0 0 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
生成具有两个中心的二维高斯分布的数据集:
def gen_gausssian ( mean1, mean2, cov1, cov2, num) :
'''generate 2-d gaussian dataset with 2 clusters
'''
data1 = np. random. multivariate_normal( mean1, cov1, num)
label1 = np. ones( ( 1 , num) ) . T
data_pos = np. append( data1, label1, axis= 1 )
data2 = np. random. multivariate_normal( mean2, cov2, num)
label2 = np. zeros( ( 1 , num) ) . T
data_neg = np. append( data2, label2, axis= 1 )
data = np. append( data_pos, data_neg, axis= 0 )
shuffle_data = np. random. permutation( data)
x1, y1 = data1. T
x2, y2 = data2. T
plt. scatter( x1, y1, c= 'r' , s= 3 )
plt. plot( mean1[ 0 ] , mean1[ 1 ] , 'ko' )
plt. scatter( x2, y2, c= 'b' , s= 3 )
plt. plot( mean2[ 0 ] , mean2[ 1 ] , 'ko' )
plt. axis( )
plt. title( "2-d gaussian dataset with 2 clusters" )
plt. xlabel( "x" )
plt. ylabel( "y" )
plt. show( )
np. savetxt( 'gaussdata.txt' , shuffle_data, fmt= '%f' , delimiter= ',' )
return shuffle_data, data_pos, data_neg
用散点图表示: 读取以
(
x
i
,
y
i
)
=
(
x
i
(
1
)
,
x
i
(
2
)
,
y
i
)
,
y
i
∈
{
0
,
1
}
(\boldsymbol{x}_{i},y_{i})=(x_{i}^{(1)},x_{i}^{(2)},y_{i}), \ y_{i}\in\{0,1\}
( x i , y i ) = ( x i ( 1 ) , x i ( 2 ) , y i ) , y i ∈ { 0 , 1 } 格式保存的训练数据,返回numpy数组形式:
def load_data ( filename) :
'''Load data of training or testing set
'''
tdata = [ ]
with open ( filename) as f:
while True :
line = f. readline( )
if not line:
break
line = line. split( ',' )
tdata. append( [ float ( item) for item in line] )
f. close( )
return np. array( tdata)
3) 迭代计算公式
(
2
)
(2)
( 2 ) ,并显示每次迭代后的损失函数值
def lr_train ( xhat, c, alpha, num) :
beta = np. random. rand( 3 , 1 )
for i in range ( num) :
yx = sigmoid( np. dot( xhat, beta) )
beta = beta + alpha * np. dot( xhat. T, ( c - yx) )
print ( '#' + str ( i) + ',training loss:' + str ( train_loss( c, yx) ) )
return beta
由公式
−
ln
L
(
w
,
b
)
=
−
∑
i
=
1
N
{
c
i
ln
[
y
(
x
i
)
]
+
(
1
−
c
i
)
ln
[
1
−
y
(
x
i
)
]
}
\ -\ln L(\boldsymbol{w},b)=-\displaystyle\sum_{i=1}^N \{ c_{i}\ln\left[ y\left( \boldsymbol{x}_{i}\right) \right] +(1-c_{i})\ln\left[ 1-y\left( \boldsymbol{x}_{i}\right) \right] \}
− ln L ( w , b ) = − i = 1 ∑ N { c i ln [ y ( x i ) ] + ( 1 − c i ) ln [ 1 − y ( x i ) ] }
计算损失函数(误差值) :
def train_loss ( c, yx) :
err = 0.0
for i in range ( len ( yx) ) :
if yx[ i, 0 ] > 0 and ( 1 - yx[ i, 0 ] ) > 0 :
err -= c[ i, 0 ] * np. log( yx[ i, 0 ] ) + ( 1 - c[ i, 0 ] ) * np. log( 1 - yx[ i, 0 ] )
return err
主程序:
mean1 = [ 3 , - 1 ]
cov1 = [ [ 5 , 0 ] , [ 0 , 10 ] ]
mean2 = [ - 5 , 7 ]
cov2 = [ [ 10 , 0 ] , [ 0 , 5 ] ]
data, data_pos, data_neg = gen_gausssian( mean1, mean2, cov1, cov2, 1100 )
training_data = data
tmp1 = training_data[ 0 : 2000 , 0 : 2 ]
tmp2 = np. ones( ( 2000 , 1 ) )
xhat = np. concatenate( ( tmp1, tmp2) , axis= 1 )
target = training_data[ 0 : 2000 , 2 : ]
beta = lr_train( xhat, target, 0.01 , 100 )
print ( 'beta:\n' , beta)
tmp1 = training_data[ 2000 : 2200 , 0 : 2 ]
tmp2 = np. ones( ( 200 , 1 ) )
testing_data = np. concatenate( ( tmp1, tmp2) , axis= 1 )
target = training_data[ 2000 : 2200 , 2 : ]
y1 = classification( testing_data, beta)
print ( np. abs ( y1- target) . T)
对一个大小为2000的数据进行训练,可得到输出结果为: #0,training loss:2767.7605301149197 #1,training loss:28706.32704095256 #2,training loss:24304.21071966826 #3,training loss:20729.807928831706 #4,training loss:18031.980567667095 #5,training loss:15793.907613945637 #6,training loss:13876.408972848896 #7,training loss:12260.25776604957 #8,training loss:10857.914022333194 #9,training loss:9702.173760769088 #10,training loss:8739.995403737194 #11,training loss:7909.116254144592 #12,training loss:7237.015743718265 #13,training loss:6581.515845960798 #14,training loss:6155.195323818418 #15,training loss:5782.624246205244 #16,training loss:5451.120323877512 #17,training loss:5159.985309063984 #18,training loss:4921.653117909279 #19,training loss:4728.055485820308 #20,training loss:4546.101559000789 #21,training loss:4368.003415240011 #22,training loss:4196.188712568878 #23,training loss:4032.1962049440162 #24,training loss:3876.771728177838 #25,training loss:3694.5715060625985 #26,training loss:3554.8126869561006 #27,training loss:3418.8100373192524 #28,training loss:3321.3029188728215 #29,training loss:3189.8131265721095 #30,training loss:3060.4306284382133 #31,training loss:2932.529506577584 #32,training loss:2807.0843716420854 #33,training loss:2684.4698423911955 #34,training loss:2564.789169175422 #35,training loss:2447.909093126709 #36,training loss:2333.712055985516 #37,training loss:2222.305120198585 #38,training loss:2114.0629245811747 #39,training loss:2009.9696271327145 #40,training loss:1911.4641438101942 #41,training loss:1818.4131000336629 #42,training loss:1731.1576524394175 #43,training loss:1648.321160807572 #44,training loss:1568.548376402433 #45,training loss:1491.2975705058457 #46,training loss:1416.3652001741157 #47,training loss:1343.7359069149327 #48,training loss:1273.1915049002964 #49,training loss:1204.2529637870934 #50,training loss:1136.9266223350025 #51,training loss:1071.3457943359633 #52,training loss:1007.323134162851 #53,training loss:944.9219846916478 #54,training loss:885.1816608689702 #55,training loss:899.9599868299116 #56,training loss:845.0057775193546 #57,training loss:793.5441317445959 #58,training loss:745.7136807432933 #59,training loss:701.6553680843865 #60,training loss:696.1322808438866 #61,training loss:689.6549290071879 #62,training loss:648.194522863791 #63,training loss:608.4750616604265 #64,training loss:570.7085584125251 #65,training loss:535.7643996771293 #66,training loss:504.1081114497143 #67,training loss:475.508191984496 #68,training loss:450.66041076529723 #69,training loss:429.67911785665814 #70,training loss:411.2956082830516 #71,training loss:394.87591024838343 #72,training loss:380.24926997738123 #73,training loss:403.36316867372153 #74,training loss:391.38976162245194 #75,training loss:381.3801916299097 #76,training loss:372.99093649597427 #77,training loss:366.0959147066188 #78,training loss:360.56881602093216 #79,training loss:355.8694556375197 #80,training loss:351.9153236373713 #81,training loss:348.4531107835549 #82,training loss:345.4202325927137 #83,training loss:342.7408477041635 #84,training loss:340.38193613578653 #85,training loss:338.2928279212097 #86,training loss:336.440189142936 #87,training loss:334.7845603199353 #88,training loss:333.29353615870866 #89,training loss:331.9350276518051 #90,training loss:330.68392791191314 #91,training loss:329.51840299766616 #92,training loss:328.4199642611809 #93,training loss:327.3721997842567 #94,training loss:326.36511808367675 #95,training loss:325.3868163444934 #96,training loss:324.43080556455044 #97,training loss:323.49135041237633 #98,training loss:322.5630405032738 #99,training loss:321.64316225672036
beta: [[ 5.96987205] [-6.41668657] [30.84845393]]
这里的beta值就是用梯度下降法所求得
β
=
[
w
T
,
b
]
T
\boldsymbol{\beta}=[\boldsymbol{w}^{T},b]^{T}
β = [ w T , b ] T 的结果。
def classification ( testing_data, beta) :
y = sigmoid( np. dot( testing_data, beta) )
for i in range ( len ( y) ) :
if y[ i, 0 ] < 0.5 :
y[ i, 0 ] = 0.0
else :
y[ i, 0 ] = 1.0
return y
判别结果:200个测试数据,有2个数据被错误分类 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]