细讲逻辑斯蒂回归与朴素贝叶斯、最大熵原理的爱恨交织(四)

第四节:神奇的吻合 —— 逻辑斯蒂回归的损失函数


1. Logistic Loss —— Negative sum of log accuracy

假设预测对得1分,否则0分,label \in {1, -1}

那么,对于第i条训练数据,若真实 label = 1,得1分的概率为 1 1 + e x p ( w T x i ) \frac{1}{1+exp(-\vec{w}^T\vec{x_i})}

若真实 label = -1,得1分的概率为 e x p ( w T x i ) 1 + e x p ( w T x i ) = 1 1 + e x p ( w T x i ) \frac{exp(-\vec{w}^T\vec{x_i})}{1+exp(-\vec{w}^T\vec{x_i})} = \frac{1}{1+exp(\vec{w}^T\vec{x_i})}

把这两种情况综合一下,得1分的概率为 P(accurate) = 1 1 + e x p ( y i w T x i ) \frac{1}{1+exp(-\color{red}y_i\color{black}\vec{w}^T\vec{x_i})}

L o s s = N e g a t i v e   s u m   o f   l o g   a c c u r a c y Loss = Negative\ sum\ of\ log\ accuracy
  = i = 1 n l o g ( P ( a c c u r a t e ) ) \quad\quad\ =-\displaystyle\sum^{n}_{i=1}log(P(accurate))
  = i = 1 n l o g ( 1 1 + e x p ( y i w T x i ) ) \quad\quad\ =-\displaystyle\sum^{n}_{i=1}log(\frac{1}{1+exp(-y_i\vec{w}^T\vec{x_i})})
  = i = 1 n l o g [   1 + e x p ( y i w T x i )   ] \quad\quad\ =\displaystyle\sum^{n}_{i=1}log[\ 1+exp(-y_i\vec{w}^T\vec{x_i})\ ]

n 是 batch_size

如果我们用SGD(stochastic gradient descent)的话,n = 1。对 Loss 求关于 w \vec w 的导数:

L o s s w = e x p ( y i w T x i ) ( y i x i ) 1 + e x p ( y i w T x i ) \frac{\partial{Loss}}{\partial{\vec w}} = \frac{exp(-y_i\vec{w}^T\vec{x_i}) (-y_i\vec{x_i})}{1+exp(-y_i\vec{w}^T\vec{x_i})}

   = ( y i x i ) P ( n o t   a c c u r a t e ) \quad\quad\quad\ \ =(-y_i\vec{x_i})P(not\ accurate)

还记得梯度下降的权重更新方法吗?    w = w α d \Rightarrow\ \ \vec w = \vec w-\alpha d

其中, α \alpha 是 learning rate, d d 是gradient,也就是刚才算的 ( y i x i ) P ( n o t   a c c u r a t e ) (-y_i\vec{x_i})P(not\ accurate)

权重更新:   w = w α P ( n o t   a c c u r a t e ) y i x i \ \vec w = \vec w-\alpha P(not\ accurate)y_i\vec{x_i}

一般情况下,logistic loss 的公式为 L ( y , f ( x ) ) = l o g [   1 + e x p ( y f ( x ) )   ] \color{#FF7256}L(y, f(x))=log[\ 1+exp(-yf(x))\ ] 。也就是说,在logistic regression 中, f ( x ) = w T x f(x)=\vec w^T\vec x ,而在一般情况下, f ( x ) f(x) 可以用更复杂的函数甚至是神经网络代替。


2. Maximum Likelihood Loss

刚才我们定义 label (y) \in {1,-1}。现在令 t = y + 1 2 \frac{y+1}{2} ,则 t \in {1,0}

{ y = 1     t = 1 y = 1     t = 0 \begin{cases} y=1\ \Rightarrow \ t=1 \\ y=-1\ \Rightarrow \ t=0 \end{cases}

在模型中条件概率 P ( t i     x i ) P(t_i\ |\ \vec{x_i}) 符合 B e r n o u l l i Bernoulli 分布。

{ P ( t i = 1     x i ) = σ ( w T x i ) P ( t i = 0     x i ) = 1 σ ( w T x i ) \begin{cases} P(t_i=1\ |\ \vec{x_i}) = \sigma(\vec w^T\vec{x_i}) \\ P(t_i=0\ |\ \vec{x_i}) = 1-\sigma(\vec w^T\vec{x_i}) \end{cases}

在极大似然的视角下假设这个模型是正确的,那么看到我们数据集中的 n 个样本: ( x 1 , t 1 ) , ( x 2 , t 2 ) , . . . , ( x n , t n ) (\vec{x_1}, t_1), (\vec{x_2}, t_2), ..., (\vec{x_n}, t_n) 的概率(likelihood)为:

L i k e l i h o o d = i = 1 n   [ σ ( w T x i ) ] t i   [ 1 σ ( w T x i ) ] 1 t i Likelihood=\displaystyle \prod^{n}_{i=1}\ [\sigma(\vec w^T\vec{x_i})]^{t_i}\ [1-\sigma(\vec w^T\vec{x_i})]^{1-t_i}

l o g ( L i k e l i h o o d ) = i = 1 n   {   t i l o g [ σ ( w T x i ) ] + ( 1 t i ) l o g [ 1 σ ( w T x i ) ]   } log(Likelihood)=\displaystyle \sum^{n}_{i=1}\ \{\ t_ilog[\sigma(\vec w^T\vec{x_i})]+(1-t_i)log[1-\sigma(\vec w^T\vec{x_i})]\ \}

m a x w   { l o g ( L i k e l i h o o d ) } = m i n w   { l o g ( l i k e l i h o o d ) } max_{\vec w}\ \{ log(Likelihood)\}= min_{\vec w}\ \{ -log(likelihood)\}

L o s s = l o g ( L i k e l i h o o d ) = i = 1 n   {   t i l o g [ σ ( w T x i ) ] + ( 1 t i ) l o g [ 1 σ ( w T x i ) ]   } \Rightarrow \color{red}Loss\color{black}=-log(Likelihood) = \color{red}-\displaystyle \sum^{n}_{i=1}\ \{\ t_ilog[\sigma(\vec w^T\vec{x_i})]+(1-t_i)log[1-\sigma(\vec w^T\vec{x_i})]\ \}


3. Cross Entropy Loss

看到最大似然损失的时候,大家有没有发觉最大似然损失的公式跟二项交叉熵损失是一样的呀!

C r o s s E n t r o p y L o s s = i = 1 n   {   t i l o g [ σ ( w T x i ) ] + ( 1 t i ) l o g [ 1 σ ( w T x i ) ]   } \color{red}Cross Entropy Loss=-\displaystyle \sum^{n}_{i=1}\ \{\ t_ilog[\sigma(\vec w^T\vec{x_i})]+(1-t_i)log[1-\sigma(\vec w^T\vec{x_i})]\ \}

不过我们依旧要从信息论的视角来好好地盘一盘交叉熵。

熵是随机变量不确定性的度量,它也是平均意义上描述随机变量所需的编码长度的度量。

相对熵,也叫 K-L散度,描述两个随机分布之间距离。它的定义如下:( x \displaystyle \sum_{x} 代表对随机变量 x x 所有可能的取值求和)

K L ( P     Q ) = x P ( x ) l o g P ( x ) Q ( x ) KL(P\ ||\ Q)=\displaystyle \sum_{x}P(x)log\frac{P(x)}{Q(x)}

K L ( P     Q ) = x P ( x ) l o g Q ( x )     ( x P ( x ) l o g P ( x ) ) \Leftrightarrow KL(P\ ||\ Q)=-\displaystyle \sum_{x}P(x)logQ(x)\ -\ (-\displaystyle \sum_{x}P(x)logP(x))

K L ( P     Q ) = C r o s s E n t r o p y ( P     Q ) H ( P ) \color{#FF7256}\Leftrightarrow KL(P\ ||\ Q)=CrossEntropy(P\ ||\ Q)-H(P)

K L ( P     Q ) = P Q P \Leftrightarrow KL(P\ ||\ Q)= 分布P和Q的交叉熵-分布 P 的信息熵

信息论课本上摘录内容:相对熵度量真实分布为 P P 而假定分布为 Q Q 时的无效性。例如,已知随机变量的真是分布为 P P ,于是可以构造平均长度为 H ( P ) H(P) 的编码描述这个随机变量。但如果使用针对分布 Q Q 的编码,那么平均编码长度为 H ( P ) + K L (   P   Q ) = C r o s s E n t r o p y ( P     Q ) H(P)+KL(\ P|\ Q) = CrossEntropy(P\ |\ Q)

C r o s s E n t r o p y ( P     Q ) = x P ( x ) l o g Q ( x ) \color{#FF7256}CrossEntropy(P\ |\ Q)=-\displaystyle\sum_{x}P(x)logQ(x)

在机器学习中,真实分布 P P 就是样本数据的分布,而假定分布 Q Q 就等同于模型认为的分布。我们希望 P P Q Q 的距离越近越好。由于 H ( P ) H(P) 跟模型参数无关,所以在训练参数的时候,缩小 K L ( P     Q ) KL(P\ ||\ Q) 等同于缩小 C r o s s E n t r o p y ( P     Q ) CrossEntropy(P\ ||\ Q) 。这就是为啥机器学习中这么喜欢用交叉熵。它的含义:描述两个概率分布的差异,差异越小,交叉熵越小。

在二分类问题中,随机变量 l a b e l label 代表类别,它只有两种可能的取值:1,0. 然后我们考虑在已知特征 x i \vec {x_i} 情况下的条件概率分布。真实的条件概率分布为 R R (为了避免和后面 B e r n o u l l i Bernoulli 分布里的 P i P_i 弄混),模型模拟的条件概率分布为 Q Q

C r o s s E n t r o p y i ( R     Q ) = k = 1 , 0 R ( l a b e l = k     x i )   l o g Q ( l a b e l = k     x i ) CrossEntropy_i(R\ |\ Q)=-\displaystyle \sum_{k=1, 0}R(label=k\ |\ \vec{x_i})\ logQ(label=k\ |\ \vec{x_i})

其中, l o g Q ( l a b e l = 1     x i ) = σ ( w T x i ) = P i logQ(label=1\ |\ \vec{x_i})=\sigma(\vec w^T\vec{x_i})=P_i l o g Q ( l a b e l = 0     x i ) = 1 σ ( w T x i ) = 1 P i logQ(label=0\ |\ \vec{x_i})=1 -\sigma(\vec w^T\vec{x_i})=1-P_i


由于样本标签已确定,所以样本中的条件概率分布就不是随机的了:

 ①  If label = 1, then R ( l a b e l = 1     x i ) = 1 R(label=1\ |\ \vec{x_i})=1 R ( l a b e l = 0     x i ) = 0 R(label=0\ |\ \vec{x_i})=0

C r o s s E n t r o p y i ( R     Q ) = [ 1 l o g P i + 0 l o g ( 1 P i ) ] \Rightarrow CrossEntropy_i(R\ |\ Q)=-[1·logP_i+0·log(1-P_i)]

 ②  If label = 0, then R ( l a b e l = 1     x i ) = 0 R(label=1\ |\ \vec{x_i})=0 R ( l a b e l = 0     x i ) = 1 R(label=0\ |\ \vec{x_i})=1

C r o s s E n t r o p y i ( R     Q ) = [ 0 l o g P i + 1 l o g ( 1 P i ) ] \Rightarrow CrossEntropy_i(R\ |\ Q)=-[0·logP_i+1·log(1-P_i)]

综合一下这两个case: C r o s s E n t r o p y i ( R     Q ) = [ l a b e l l o g P i + ( 1 l a b e l ) l o g ( 1 P i ) ] CrossEntropy_i(R\ |\ Q)=-[label·logP_i+(1-label)·log(1-P_i)]

对每一条数据的损失进行加和,就得到了交叉熵损失的公式(刚才列出来过了):

i = 1 n   { t i l o g [ σ ( w T x i ) ] + ( 1 t i ) l o g [ 1 σ ( w T x i ) ] } -\displaystyle \sum^{n}_{i=1}\ \Big\{t_ilog[\sigma(\vec w^T\vec{x_i})]+(1-t_i)log[1-\sigma(\vec w^T\vec{x_i})]\Big\}


小结

Maximum Likelihood Loss 与 Cross Entropy Loss 是等价的,也就是说,使Likelihood最大化的过程也是使交叉熵最小化的过程,也就等同于使模型模拟的分布与样本数据的真是分布越来越近。这看似一个神奇的巧合,事实上,是有理论的链条将它们连接起来的。相对熵在统计学中对应的就是 “似然比的对数期望”

K L ( P     Q ) = x P ( x ) l o g P ( x ) Q ( x ) = E x [ l o g P ( x ) Q ( x ) ] KL(P\ ||\ Q)=\displaystyle \sum_{x}P(x)log\frac{P(x)}{Q(x)}=E_x\Big[log\frac{P(x)}{Q(x)}\Big]

按照这个思路推导下去,那么交叉熵其实就是 “似然的对数期望”:(设 x 是随机变量)

w = a r g m a x w i = 1 n Q ( x     w )   \vec w^*=argmax{\vec w} \displaystyle \prod^{n}_{i=1} Q(x\ |\ \vec w)\ \leftarrow 对参数做最大似然估计

  = a r g m i n w i = 1 n Q ( x     w ) \quad\ =argmin_{\vec w} \displaystyle \prod^{n}_{i=1} Q(x\ |\ \vec w)

  = a r g m i n w i = 1 n l o g Q ( x     w ) \quad\ =argmin_{\vec w} \displaystyle \sum^{n}_{i=1} logQ(x\ |\ \vec w)

  = a r g m i n w x C o u n t ( X = x ) l o g Q ( x     w )   \quad\ =argmin_{\vec w} \displaystyle \sum_{x} Count(X=x)logQ(x\ |\ \vec w)\ \leftarrow 从按index的遍历改为按x的取值遍历

  = a r g m i n w x ( C o u n t ( X = x ) n ) l o g Q ( x     w )   \quad\ =argmin_{\vec w} \displaystyle \sum_{x} (\frac{Count(X=x)}{n})logQ(x\ |\ \vec w)\ \leftarrow 给目标函数除以了样本总数 n,缩放目标函数不会改变 a r g m i n argmin 的结果。

  = a r g m i n w x R ^ ( X = x ) l o g Q ( x     w )     R ^ ( X = x ) = C o u n t ( X = x ) n \quad\ =argmin_{\vec w} \displaystyle \sum_{x} \hat{R}(X=x)logQ(x\ |\ \vec w)\ \leftarrow\ \color{#8AD597}\hat{R}(X=x)=\frac{Count(X=x)}{n} 是样本数据的经验分布

  = a r g m i n w E x R ^ l o g Q ( x     w ) \quad\ =argmin_{\vec w} E_{x\sim \hat{R}}logQ(x\ |\ \vec w)

  = a r g m i n w E x R l o g Q ( x     w )   = \quad\ =argmin_{\vec w} E_{x\sim R}logQ(x\ |\ \vec w)\ \leftarrow \color{#8AD597} 当数据量足够大并且采样方法足够好时,样本数据的经验分布=真实分布

  = a r g m i n w x R ( X = x ) l o g Q ( x     w ) \quad\ =argmin_{\vec w} \displaystyle \sum_{x} R(X=x)logQ(x\ |\ \vec w) \leftarrow 现在已经推导到了 “求使交叉熵最小化的参数” 这个形式

接下来还能推出 L o g i s t i c   L o s s Logistic\ Loss 与最大似然损失也是等价的:

If label \in {1,-1},then Q ( y i     x i ) = 1 1 + e x p ( y i w T x i ) = σ ( y i w T x i ) Q(y_i\ |\ \vec{x_i}) = \frac{1}{1+exp(-y_i\vec{w}^T\vec{x_i})}=\sigma(y_i\vec{w}^T\vec{x_i})

L i k e l i h o o d = i = 1 n σ ( y i w T x i ) Likelihood= \prod^n_{i=1}\sigma(y_i\vec{w}^T\vec{x_i})

l o g ( L i k e l i h o o d ) = i = 1 n σ ( y i w T x i ) = i = 1 n l o g [   1 + e x p ( y i w T x i )   ] -log(Likelihood)=-\sum^n_{i=1}\sigma(y_i\vec{w}^T\vec{x_i})=\displaystyle\sum^{n}_{i=1}log[\ 1+exp(-y_i\vec{w}^T\vec{x_i})\ ]

If label \in {1,0},then Q ( y i     x i ) = σ ( w T x i ) ] y i   [ 1 σ ( w T x i ) ] 1 y i Q(y_i\ |\ \vec{x_i}) =\sigma(\vec w^T\vec{x_i})]^{y_i}\ [1-\sigma(\vec w^T\vec{x_i})]^{1-y_i}

L i k e l i h o o d = i = 1 n σ ( w T x i ) ] t i   [ 1 σ ( w T x i ) ] 1 t i Likelihood=\prod^n_{i=1}\sigma(\vec w^T\vec{x_i})]^{t_i}\ [1-\sigma(\vec w^T\vec{x_i})]^{1-t_i}

l o g ( L i k e l i h o o d ) = i = 1 n { y i l o g [ σ ( w T x i ) ] + ( 1 y i ) l o g [ 1 σ ( w T x i ) ] } -log(Likelihood)=-\sum^n_{i=1}\Big\{y_ilog[\sigma(\vec w^T\vec{x_i})]+(1-y_i)log[1-\sigma(\vec w^T\vec{x_i})]\Big\}

可见 L o g i s t i c   L o s s Logistic\ Loss 与极大似然损失等价,区别仅仅是标签的定义不同

这就是本节标题中 “神奇的吻合” 的含义:逻辑斯蒂回归模型的参数无论是用 Logistic Loss 还是 Maximum Likelihood Loss 还是 Cross Entropy Loss 训练出来的是同一组参数!但是这并非纯属偶然,而是同一个概念在信息论中和在统计学中本来就有对应关系。

猜你喜欢

转载自blog.csdn.net/weixin_43928665/article/details/106817285