LR与softmax损失函数总结笔记

LR中的损失函数

在线性回归中损失函数的推导是根据一个假设——若误差是独立同分布,那么根据中心极限定理可以知道这个误差服从均值为0,它的方差为 σ 2 \sigma^{2} ,则可以得到进一步的推导到对数似然函数,也就是损失函数:

( θ ) = log L ( θ ) = log i = 1 m 1 2 π σ exp ( ( y ( i ) θ T x ( i ) ) 2 2 σ 2 ) = i = 1 m log 1 2 π σ exp ( ( y ( i ) θ T x ( i ) ) 2 2 σ 2 ) = m log 1 2 π σ 1 σ 2 1 2 i = 1 m ( y ( i ) θ T x ( i ) ) 2 J ( θ ) = 1 2 i = 1 m ( h θ ( x ( i ) ) y ( i ) ) 2 \begin{aligned} \ell(\theta) &=\log L(\theta) \\ &=\log \prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2}}{2 \sigma^{2}}\right) \\ &=\sum_{i=1}^{m} \log \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2}}{2 \sigma^{2}}\right) \\=& m \log \frac{1}{\sqrt{2 \pi} \sigma}-\frac{1}{\sigma^{2}} \cdot \frac{1}{2} \sum_{i=1}^{m}\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2} \\ J(\theta) &=\frac{1}{2} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2} \end{aligned}

更详细的推导与代码可以看我之前的博客:

从线性回归到梯度下降法详细笔记


交叉熵损失函数

对于线性回归模型,我们定义的代价函数是所有模型误差的平方和。理论上来说,我们也可以对逻辑回归模型沿用这个定义。假设一共有在m组已知样本, ( x ( i ) , y ( i ) ) \left(x^{(i)}, y^{(i)}\right) 表示第 i i 组数据及其对应的类别标记,其中 x ( i ) = ( 1 , x 1 ( i ) , x 2 ( i ) , , x p ( i ) ) T x^{(i)}=\left(1, x_{1}^{(i)}, x_{2}^{(i)}, \ldots, x_{p}^{(i)}\right)^{T} 为p+1维向量, y ( i ) y^{(i)} 则为表示类别的一个数(这里仅考虑分类问题),那么模型的参数为 θ = ( θ 0 , θ 1 , θ 2 , , θ p ) T \theta=\left(\theta_{0}, \theta_{1}, \theta_{2}, \ldots, \theta_{p}\right)^{T} ,因此有:

θ T x ( i ) : = θ 0 + θ 1 x 1 ( i ) + + θ p x p ( i ) \theta^{T} x^{(i)}:=\theta_{0}+\theta_{1} x_{1}^{(i)}+\cdots+\theta_{p} x_{p}^{(i)}

假设函数(hypothesis function)定义为:

h θ ( x ( i ) ) = 1 1 + e θ T x ( i ) h_{\theta}\left(x^{(i)}\right)=\frac{1}{1+e^{-\theta^{T} x^{(i)}}}

因为我们做的是0/1分类问题,所以可以直接理解得到我们想要的损失函数,我们将概率取对数,其单调性不变,为:

log P ( y ^ ( i ) = 1 x ( i ) ; θ ) = log h θ ( x ( i ) ) = log 1 1 + e θ T x ( i ) \log P\left(\hat{y}^{(i)}=1 | x^{(i)} ; \theta\right)=\log h_{\theta}\left(x^{(i)}\right)=\log \frac{1}{1+e^{-\theta^{T} x^{(i)}}}

log P ( y ^ ( i ) = 0 x ( i ) ; θ ) = log ( 1 h θ ( x ( i ) ) ) = log e θ T x ( i ) 1 + e θ T x ( i ) \log P\left(\hat{y}^{(i)}=0 | x^{(i)} ; \theta\right)=\log \left(1-h_{\theta}\left(x^{(i)}\right)\right)=\log \frac{e^{-\theta^{T} x^{(i)}}}{1+e^{-\theta^{T} x^{(i)}}}

扫描二维码关注公众号,回复: 11276541 查看本文章

那么对于一共 m m 组样本,我们就可以得到模型对于整体训练样本的表现能力:

i = 1 m y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 y ( i ) ) log ( 1 h θ ( x ( i ) ) ) \sum_{i=1}^{m}y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))

但这里就会出现一个矛盾,我们希望似然函数越大越好,代表整个事件发生的概率约高,但另一方面损失函数又要求我们的值越小越好,所以我们不妨对上面的对数几率取相反数就解决了这个问题:

J ( θ ) = 1 m i = 1 m y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 y ( i ) ) log ( 1 h θ ( x ( i ) ) ) J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))

我们重新定义逻辑回归的损失函数为 J ( θ ) = 1 m i = 1 m cos t ( h θ ( x ( i ) ) , y ( i ) ) J(\theta)=\frac{1}{m} \sum_{i=1}^{m} \cos t\left(h_{\theta}\left(x^{(i)}\right), y^{(i)}\right) ,其中:

cost ( h θ ( x ) , y ) = { log ( h θ ( x ) )  if  y = 1 log ( 1 h θ ( x ) )  if  y = 0 \operatorname{cost}\left(h_{\theta}(x), y\right)=\left\{\begin{aligned}-\log \left(h_{\theta}(x)\right) & \text { if } y=1 \\-\log \left(1-h_{\theta}(x)\right) & \text { if } y=0 \end{aligned}\right.

从曲线的角度来看,沿用原线性的定义,代入似然函数中,将得到的代价函数将是一个非凸函数(non-convexfunction)。给似然函数加负号改变似然函数曲线的趋势,变成类似凸函数的下凸形状,再取对数变成真正可导的凸函数,再取平均值变成可以代表单个样本的函数,就是最终的目标函数.

在这里插入图片描述

import numpy as np 
def cost(theta, X, y):   
	theta = np.matrix(theta)   
	X = np.matrix(X)   
	y = np.matrix(y)   
	first = np.multiply(-y, np.log(sigmoid(X* theta.T)))   
	second = np.multiply((1 - y), np.log(1 - sigmoid(X* theta.T)))   
	return np.sum(first - second) / (len(X))

交叉熵函数推导

上述最后对数函数取相反数的结果,便是交叉熵,与之对应的还有相对熵、信息熵。。。交叉熵是用来衡量两个概率分布之间的差异。交叉熵越大,两个分布之间的差异越大,越对实验结果感到意外,反之,交叉熵越小,两个分布越相似,越符合预期。

所以对于分类问题,交叉熵损失函数可以写为:

J ( θ ) = 1 m i = 1 m y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 y ( i ) ) log ( 1 h θ ( x ( i ) ) ) J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)

其中:

log h θ ( x ( i ) ) = log 1 1 + e θ T x ( i ) = log ( 1 + e θ T x ( i ) )   , log ( 1 h θ ( x ( i ) ) ) = log ( 1 1 1 + e θ T x ( i ) ) = log ( e θ T x ( i ) 1 + e θ T x ( i ) ) = log ( e θ T x ( i ) ) log ( 1 + e θ T x ( i ) ) = θ T x ( i ) log ( 1 + e θ T x ( i ) )   . \log h_\theta(x^{(i)})=\log\frac{1}{1+e^{-\theta^T x^{(i)}} }=-\log ( 1+e^{-\theta^T x^{(i)}} )\ ,\\ \log(1- h_\theta(x^{(i)}))=\log(1-\frac{1}{1+e^{-\theta^T x^{(i)}} })=\log(\frac{e^{-\theta^T x^{(i)}}}{1+e^{-\theta^T x^{(i)}} })\\=\log (e^{-\theta^T x^{(i)}} )-\log ( 1+e^{-\theta^T x^{(i)}} )=-\theta^T x^{(i)}-\log ( 1+e^{-\theta^T x^{(i)}} ) \ .

由此,我们便可以用梯度下降法求得能使损失函数最小的参数了,得到:

J ( θ ) = 1 m i = 1 m [ y ( i ) ( log ( 1 + e θ T x ( i ) ) ) + ( 1 y ( i ) ) ( θ T x ( i ) log ( 1 + e θ T x ( i ) ) ) ] = 1 m i = 1 m [ y ( i ) θ T x ( i ) θ T x ( i ) log ( 1 + e θ T x ( i ) ) ] = 1 m i = 1 m [ y ( i ) θ T x ( i ) log e θ T x ( i ) log ( 1 + e θ T x ( i ) ) ] = 1 m i = 1 m [ y ( i ) θ T x ( i ) ( log e θ T x ( i ) + log ( 1 + e θ T x ( i ) ) ) ] = 1 m i = 1 m [ y ( i ) θ T x ( i ) log ( 1 + e θ T x ( i ) ) ] \begin{aligned} \\ J(\theta ) &=-\frac{1}{m}\sum_{i=1}^m{\left[ -y^{\left( i \right)}\left( \log \left( 1+e^{-\theta ^Tx^{\left( i \right)}} \right) \right) +\left( 1-y^{\left( i \right)} \right) \left( -\theta ^Tx^{\left( i \right)}-\log \left( 1+e^{-\theta ^Tx^{\left( i \right)}} \right) \right) \right]}\\ &=-\frac{1}{m}\sum_{i=1}^m{\left[ y^{\left( i \right)}\theta ^Tx^{\left( i \right)}-\theta ^Tx^{\left( i \right)}-\log \left( 1+e^{-\theta ^Tx^{\left( i \right)}} \right) \right]}\\ &=-\frac{1}{m}\sum_{i=1}^m{\left[ y^{\left( i \right)}\theta ^Tx^{\left( i \right)}-\log e^{\theta ^Tx^{\left( i \right)}}-\log \left( 1+e^{-\theta ^Tx^{\left( i \right)}} \right) \right]}\\ &=-\frac{1}{m}\sum_{i=1}^m{\left[ y^{\left( i \right)}\theta ^Tx^{\left( i \right)}-\left( \log e^{\theta ^Tx^{\left( i \right)}}+\log \left( 1+e^{-\theta ^Tx^{\left( i \right)}} \right) \right) \right]}\\ &=-\frac{1}{m}\sum_{i=1}^m{\left[ y^{\left( i \right)}\theta ^Tx^{\left( i \right)}-\log \left( 1+e^{\theta ^Tx^{\left( i \right)}} \right) \right]}\\ \end{aligned}

这次再计算 J ( θ ) J(\theta ) 对第 j j 个参数分量 θ j \theta_{j} 求偏导:

θ j J ( θ ) = θ j ( 1 m i = 1 m [ log ( 1 + e θ T x ( i ) ) y ( i ) θ T x ( i ) ] ) = 1 m i = 1 m [ θ j log ( 1 + e θ T x ( i ) ) θ j ( y ( i ) θ T x ( i ) ) ] = 1 m i = 1 m ( x j ( i ) e θ T x ( i ) 1 + e θ T x ( i ) y ( i ) x j ( i ) ) = 1 m i = 1 m ( h θ ( x ( i ) ) y ( i ) ) x j ( i ) \begin{aligned} \\ \frac{\partial}{\partial \theta _j}J\left( \theta \right) &=\frac{\partial}{\partial \theta _j}\left( \frac{1}{m}\sum_{i=1}^m{\left[ \log \left( 1+e^{\theta ^Tx^{\left( i \right)}} \right) -y^{\left( i \right)}\theta ^Tx^{\left( i \right)} \right]} \right)\\ &=\frac{1}{m}\sum_{i=1}^m{\left[ \frac{\partial}{\partial \theta _j}\log \left( 1+e^{\theta ^Tx^{\left( i \right)}} \right) -\frac{\partial}{\partial \theta _j}\left( y^{\left( i \right)}\theta ^Tx^{\left( i \right)} \right) \right]}\\ &=\frac{1}{m}\sum_{i=1}^m{\left( \frac{x_{j}^{\left( i \right)}e^{\theta ^Tx^{\left( i \right)}}}{1+e^{\theta ^Tx^{\left( i \right)}}}-y^{\left( i \right)}x_{j}^{\left( i \right)} \right)}\\ &=\frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right)}x_{j}^{\left( i \right)}\\ \end{aligned}

注意:虽然得到的梯度下降算法表面上看上去与线性回归的梯度下降算法一样,但是这里的 h θ ( x ) h_{\theta}(x) 与线性回归中的不同,所以实际上是不一样的。另外,在运行梯度下降算法之前,进行特征缩放依旧是非常必要的

python与pytorch交叉熵版本对比

import torch
import numpy as np


class Entropy:
    def __init__(self):
        self.nx = None
        self.ny = None
        self.dnx = None

    def loss(self, nx, ny):
        self.nx = nx
        self.ny = ny
        loss = np.sum(- ny * np.log(nx))
        return loss

    def backward(self):
        self.dnx = - self.ny / self.nx
        return self.dnx


np.random.seed(123)
np.set_printoptions(precision=3, suppress=True, linewidth=120)

entropy = Entropy()

x = np.random.random([5, 10])
y = np.random.random([5, 10])
x_tensor = torch.tensor(x, requires_grad=True)
y_tensor = torch.tensor(y, requires_grad=True)

loss_numpy = entropy.loss(x, y)
grad_numpy = entropy.backward()

loss_tensor = (- y_tensor * torch.log(x_tensor)).sum()
loss_tensor.backward()
grad_tensor = x_tensor.grad

print("Python Loss :", loss_numpy)
print("PyTorch Loss :", loss_tensor.data.numpy())

print("\nPython dx :")
print(grad_numpy)
print("\nPyTorch dx :")
print(grad_tensor.data.numpy())



参考与推荐:
[1]. 吴恩达机器学习笔记
[2]. 交叉熵损失函数原理详解
[3]. 知乎——逻辑回归的损失函数怎么理解?
[4]. 交叉熵代价函数(损失函数)及其求导推导

猜你喜欢

转载自blog.csdn.net/submarineas/article/details/104295118