写在前面

本篇博客是Caffe源码解析系列的第9篇，也是Caffe深度学习梯度反传代码解析的第3篇。在caffe源码深入学习7和8中，笔者分别解析了Caffe框架中ReLU激活层和和池化层的反传过程是如何实现的，在这两个层中，既不包含可训练参数，也没有复杂的求导过程，因此代码是相对简单的。从本篇博客开始，在对Caffe框架中的反传代码进行解析时，笔者将解析更复杂的实现代码，比如包含可训练参数的层与复杂求导过程的层。

本篇博客作为神经网络反传解析的第三讲，笔者打算解析一下Caffe中的计算误差的层，里面往往都包含比较复杂的求导实现代码。本想解析一下大家都熟悉，用的比较多的softmax_loss_layer，但是笔者发现博主王里扬洛夫已经在原创博文CAFFE源码学习笔记之softmax_layer中解析过了，写得清晰详细。因此，笔者就不打算重复解析softmax_loss_layer了。经过思考后，笔者打算解析一下对比损失，即contrastive_loss_layer，也是在人脸验证，图像检索中使用的非常广泛的一个层。废话不多说，下面开始干货。

理论解析

Contrastive loss(对比损失)是由深度学习理论先行者Yann LeCun于2005年提出的，论文题目为Dimensionality Reduction by Learning an Invariant Mapping，发表在CVPR 2006中。作用主要是区分样本对是否相似，对于输入的两个向量a与b，d表示a与b之间的距离(大多数时采用欧式距离)。contrastive loss可如下表述：

$Loss_{contrastive} = \frac{1}{2N}\sum_{i=1}^{N}(sim \times d^2 + (1-sim) \times max(margin-d, 0)^2)$
在上式中，N表示样本对总数， $sim\in\{0,1\}$ 。如果sim取1，表示a与b是同类或者相似的样本对，那么Loss的目标是减少两者的距离；相反，如果sim取0，表示a与b是不同类或者不相似的样本对，那么Loss的目标是拉大两者的距离，使得两者的距离至少要大于超参数margin。

在Caffe框架实现contrastive loss时，我们可以先看看caffe.proto中是如何定义层参数的：

message ContrastiveLossParameter {
    
    
  // margin for dissimilar pair
  optional float margin = 1 [default = 1.0];
  // The first implementation of this cost did not exactly match the cost of
  // Hadsell et al 2006 -- using (margin - d^2) instead of (margin - d)^2.
  // legacy_version = false (the default) uses (margin - d)^2 as proposed in the
  // Hadsell paper. New models should probably use this version.
  // legacy_version = true uses (margin - d^2). This is kept to support /
  // reproduce existing models and results
  optional bool legacy_version = 2 [default = false];
}

大家可以看到，在层参数中设计中，除了上面公式中的margin，还有一个legacy_version，该参数在a与b不是同类或非相似样本对时，对loss计算有一点小改动，可以用如下公式表示：

$Loss_{contrastive} = \begin{cases} \frac{1}{2N}\sum_{i=1}^{N}d^2 & sim=1 \\ \frac{1}{2N}\sum_{i=1}^{N}max(margin-d, 0)^2 & sim=0 \quad and \quad legacy\_version=false \\ \frac{1}{2N}\sum_{i=1}^{N}max(margin-d^2, 0) & sim=0 \quad and \quad legacy\_version=true \end{cases}$
假设d为欧氏距离，即 $d=\sqrt{\sum_{i=1}^{n}(a_i-b_i)^2}$ 。那么，在训练时进行反传的时候，在一个容量为N的batch中，对于顶层的梯度，经过链式求导，那么 $a_i$ 对应的梯度可由下述公式表示：

$\frac{\partial_{Loss}}{\partial_{a_i}} = \begin{cases} \frac{d}{N} \times \frac{\partial_{d}}{\partial_{a_i}} \times top\_diff & sim=1 \\ -\frac{margin-d}{N} \times \frac{\partial_{d}}{\partial_{a_i}} \times top\_diff & sim=0 \quad and \quad legacy\_version=false \quad and \quad margin-d>0\\ -\frac{d}{N} \times \frac{\partial_{d}}{\partial_{a_i}} \times top\_diff & sim=0 \quad and \quad legacy\_version=true \quad and \quad margin-d^2>0\\ 0 & sim=0 \quad and \quad legacy\_version=false \quad and \quad margin-d\leq0\\ 0 & sim=0 \quad and \quad legacy\_version=true \quad and \quad margin-d^2\leq0 \end{cases}$
在上述公式中，有一个 $\frac{\partial_{d}}{\partial_{a_i}}$ ，由于 $d=\sqrt{\sum_{i=1}^{n}(a_i-b_i)^2}$ 。那么，对于任意的 $i$ ， $\frac{\partial_{d}}{\partial_{a_i}}$ 可由下述公示表示：
$\frac{\partial_{d}}{\partial_{a_i}}=\frac{a_i-b_i}{d}$
在求得 $\frac{\partial_{d}}{\partial_{a_i}}$ 后， $\frac{\partial_{Loss}}{\partial_{a_i}}$ 可表示为：
$\frac{\partial_{Loss}}{\partial_{a_i}} = \begin{cases} \frac{a_i-b_i}{N} \times top\_diff & sim=1 \\ -\frac{margin-d}{N} \times \frac{a_i - b_i}{d} \times top\_diff & sim=0 \quad and \quad legacy\_version=false \quad and \quad margin-d>0\\ -\frac{a_i-b_i}{N} \times top\_diff & sim=0 \quad and \quad legacy\_version=true \quad and \quad margin-d^2>0\\ 0 & sim=0 \quad and \quad legacy\_version=false \quad and \quad margin-d\leq0\\ 0 & sim=0 \quad and \quad legacy\_version=true \quad and \quad margin-d^2\leq0 \end{cases}$
如上所示，就能在bottom[0]，即a上进行每一个 $a_i$ 的梯度反传了。同理，对于任意的 $i$ ， $b_i$ 的梯度，只需要在 $a_i$ 的梯度上取相反数就行了。因为
$\frac{\partial_{d}}{\partial_{b_i}}=-\frac{a_i-b_i}{d}$
原理解析清楚了，下面放出源码与注释。

源码及注释

首先还是contrastive_loss_layer.hpp的源码：

#ifndef CAFFE_CONTRASTIVE_LOSS_LAYER_HPP_
#define CAFFE_CONTRASTIVE_LOSS_LAYER_HPP_

#include <vector>

#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"

#include "caffe/layers/loss_layer.hpp"

namespace caffe {
    
    

/**
 * @brief Computes the contrastive loss @f$
 *          E = \frac{1}{2N} \sum\limits_{n=1}^N \left(y\right) d^2 +
 *              \left(1-y\right) \max \left(margin-d, 0\right)^2
 *          @f$ where @f$
 *          d = \left| \left| a_n - b_n \right| \right|_2 @f$. This can be
 *          used to train siamese networks.
 *
 * @param bottom input Blob vector (length 3)
 *   -# @f$ (N \times C \times 1 \times 1) @f$
 *      the features @f$ a \in [-\infty, +\infty]@f$
 *   -# @f$ (N \times C \times 1 \times 1) @f$
 *      the features @f$ b \in [-\infty, +\infty]@f$
 *   -# @f$ (N \times 1 \times 1 \times 1) @f$
 *      the binary similarity @f$ s \in [0, 1]@f$
 * @param top output Blob vector (length 1)
 *   -# @f$ (1 \times 1 \times 1 \times 1) @f$
 *      the computed contrastive loss: @f$ E =
 *          \frac{1}{2N} \sum\limits_{n=1}^N \left(y\right) d^2 +
 *          \left(1-y\right) \max \left(margin-d, 0\right)^2
 *          @f$ where @f$
 *          d = \left| \left| a_n - b_n \right| \right|_2 @f$.
 * This can be used to train siamese networks.
 */
template <typename Dtype>
class ContrastiveLossLayer : public LossLayer<Dtype> {
    
    
 public:
  explicit ContrastiveLossLayer(const LayerParameter& param)
      : LossLayer<Dtype>(param), diff_() {
    
    } //空的构造函数
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top); //LayerSetUp函数

  virtual inline int ExactNumBottomBlobs() const {
    
     return 3; } //输入必须是3个Blob，即a, b和sim
  virtual inline const char* type() const {
    
     return "ContrastiveLoss"; }
  /**
   * Unlike most loss layers, in the ContrastiveLossLayer we can backpropagate
   * to the first two inputs.
   */
  virtual inline bool AllowForceBackward(const int bottom_index) const {
    
    
    return bottom_index != 2;
  } //允许在第0个和第1个输入Blob上进行强制反传

 protected:
  /// @copydoc ContrastiveLossLayer
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top); //cpu前传
  virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top); //gpu前传

  /**
   * @brief Computes the Contrastive error gradient w.r.t. the inputs.
   *
   * Computes the gradients with respect to the two input vectors (bottom[0] and
   * bottom[1]), but not the similarity label (bottom[2]).
   *
   * @param top output Blob vector (length 1), providing the error gradient with
   *      respect to the outputs
   *   -# @f$ (1 \times 1 \times 1 \times 1) @f$
   *      This Blob's diff will simply contain the loss_weight* @f$ \lambda @f$,
   *      as @f$ \lambda @f$ is the coefficient of this layer's output
   *      @f$\ell_i@f$ in the overall Net loss
   *      @f$ E = \lambda_i \ell_i + \mbox{other loss terms}@f$; hence
   *      @f$ \frac{\partial E}{\partial \ell_i} = \lambda_i @f$.
   *      (*Assuming that this top Blob is not used as a bottom (input) by any
   *      other layer of the Net.)
   * @param propagate_down see Layer::Backward.
   * @param bottom input Blob vector (length 2)
   *   -# @f$ (N \times C \times 1 \times 1) @f$
   *      the features @f$a@f$; Backward fills their diff with
   *      gradients if propagate_down[0]
   *   -# @f$ (N \times C \times 1 \times 1) @f$
   *      the features @f$b@f$; Backward fills their diff with gradients if
   *      propagate_down[1]
   */
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom); //cpu反传
  virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom); //gpu反传

  Blob<Dtype> diff_;  // cached for backward pass 反传时使用的存储a和b差值的Blob
  Blob<Dtype> dist_sq_;  // cached for backward pass 反传时使用的存储a和b欧式距离平方的Blob
  Blob<Dtype> diff_sq_;  // tmp storage for gpu forward pass gpu前传时所需的暂存Blob
  Blob<Dtype> summer_vec_;  // tmp storage for gpu forward pass gpu前传时所需的暂存Blob
};

}  // namespace caffe

#endif  // CAFFE_CONTRASTIVE_LOSS_LAYER_HPP_

}

然后是contrastive_loss_layer.cpp的源码：

#include <algorithm>
#include <vector>

#include "caffe/layers/contrastive_loss_layer.hpp"
#include "caffe/util/math_functions.hpp"

namespace caffe {
    
    

template <typename Dtype>
void ContrastiveLossLayer<Dtype>::LayerSetUp( //LayerSetUp函数，进行部分初始化工作
  const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
    
    
  LossLayer<Dtype>::LayerSetUp(bottom, top);
  CHECK_EQ(bottom[0]->channels(), bottom[1]->channels()); //检查输入数据a和b的通道是否相同，注意没有保证a和b的个数是否相同
  CHECK_EQ(bottom[0]->height(), 1); //height和width的检查确保a是一个向量
  CHECK_EQ(bottom[0]->width(), 1);
  CHECK_EQ(bottom[1]->height(), 1); //height和width的检查确保b是一个向量
  CHECK_EQ(bottom[1]->width(), 1);
  CHECK_EQ(bottom[2]->channels(), 1); //channel,height和width的检查确保sim是一个值，0表示data0和data1不同类，1表示同类
  CHECK_EQ(bottom[2]->height(), 1);
  CHECK_EQ(bottom[2]->width(), 1);
  diff_.Reshape(bottom[0]->num(), bottom[0]->channels(), 1, 1); //diff_ Blob形状初始化为(n, c, 1, 1)
  diff_sq_.Reshape(bottom[0]->num(), bottom[0]->channels(), 1, 1); //diff_sq_ Blob形状同样初始化为(n, c, 1, 1)
  dist_sq_.Reshape(bottom[0]->num(), 1, 1, 1); //dist_sq_ Blob形状初始化为(n, 1, 1, 1)，用来记录距离
  // vector of ones used to sum along channels
  summer_vec_.Reshape(bottom[0]->channels(), 1, 1, 1); //summer_vec_ Blob形状初始化为(n, 1, 1, 1)
  for (int i = 0; i < bottom[0]->channels(); ++i)
    summer_vec_.mutable_cpu_data()[i] = Dtype(1); //初始化一下summer_vec_中的值，全部初始化为1
}

template <typename Dtype>
void ContrastiveLossLayer<Dtype>::Forward_cpu(
    const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
    
     //对比损失计算前传函数
  int count = bottom[0]->count(); //首先取得a和b的数据量
  caffe_sub(
      count,
      bottom[0]->cpu_data(),  // a
      bottom[1]->cpu_data(),  // b
      diff_.mutable_cpu_data());  // a_i-b_i 对a和b逐元相减，并将结果存储在diff_ Blob中
  const int channels = bottom[0]->channels(); //取得a的通道数
  Dtype margin = this->layer_param_.contrastive_loss_param().margin(); //取得层设置文件中的margin参数
  bool legacy_version =
      this->layer_param_.contrastive_loss_param().legacy_version(); //取得层设置文件中的legacy_version参数
  Dtype loss(0.0); //初始化loss为0
  for (int i = 0; i < bottom[0]->num(); ++i) {
    
     //在一个batch中逐对进行计算
    dist_sq_.mutable_cpu_data()[i] = caffe_cpu_dot(channels,
        diff_.cpu_data() + (i*channels), diff_.cpu_data() + (i*channels)); //首先计算a和b之差的二范数的平方，也称为a和b欧式距离(d)的平方，记为d^2
    if (static_cast<int>(bottom[2]->cpu_data()[i])) {
    
      // similar pairs //如果sim为1，表示a和b相同
      loss += dist_sq_.cpu_data()[i]; //直接在loss上加上d^2
    } else {
    
      // dissimilar pairs //如果sim为0，表示a和b不同
      if (legacy_version) {
    
     //如果legacy_version参数等于true
        loss += std::max(margin - dist_sq_.cpu_data()[i], Dtype(0.0)); //loss直接加上max{margin-d^2, 0}
      } else {
    
     //如果legacy_version参数等于false
        Dtype dist = std::max<Dtype>(margin - sqrt(dist_sq_.cpu_data()[i]), //计算max{margin-d, 0}
          Dtype(0.0));
        loss += dist*dist; //loss直接加上上式中最大值的平方
      }
    }
  }
  loss = loss / static_cast<Dtype>(bottom[0]->num()) / Dtype(2); //loss值除以2n
  top[0]->mutable_cpu_data()[0] = loss; //将loss值赋予top[0]
}

template <typename Dtype>
void ContrastiveLossLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
    
     //对比损失计算反传函数
  Dtype margin = this->layer_param_.contrastive_loss_param().margin(); //取得层设置文件中的margin参数
  bool legacy_version =
      this->layer_param_.contrastive_loss_param().legacy_version(); //取得层设置文件中的legacy_version参数
  for (int i = 0; i < 2; ++i) {
    
     //在a和b上分别进行反传
    if (propagate_down[i]) {
    
     //如果梯度需要反传到该Blob上
      const Dtype sign = (i == 0) ? 1 : -1; //如果是计算反传到a的梯度，sign为1,；如果是计算反传到b的梯度，sign为-1
      const Dtype alpha = sign * top[0]->cpu_diff()[0] /
          static_cast<Dtype>(bottom[i]->num()); //计算sign * top_diff / n，存储在alpha中
      int num = bottom[i]->num(); //获得该Blob的n(一个batch中的前传向量个数)
      int channels = bottom[i]->channels(); //获得该Blob的c(通道数)
      for (int j = 0; j < num; ++j) {
    
     //在batch中一个一个来计算
        Dtype* bout = bottom[i]->mutable_cpu_diff(); //bout存储梯度计算的结果
        if (static_cast<int>(bottom[2]->cpu_data()[j])) {
    
      // similar pairs 如果是同类的图像对
          caffe_cpu_axpby(
              channels,
              alpha,
              diff_.cpu_data() + (j*channels),
              Dtype(0.0),
              bout + (j*channels)); //计算sign * diff_ * top_diff / n，并存储在bout中
        } else {
    
      // dissimilar pairs 如果是不同的图像对
          Dtype mdist(0.0); //初始化mdist为0
          Dtype beta(0.0); //初始化beta为0
          if (legacy_version) {
    
     //如果legacy_version参数等于true
            mdist = margin - dist_sq_.cpu_data()[j]; //mdist为margin-d^2
            beta = -alpha; //beta为-sign * top_diff / n
          } else {
    
     //如果legacy_version参数等于false
            Dtype dist = sqrt(dist_sq_.cpu_data()[j]); //dist为d
            mdist = margin - dist; //mdist为margin-d
            beta = -alpha * mdist / (dist + Dtype(1e-4)); //beta为-sign * top_diff * (margin-d) / (n * d)
          }
          if (mdist > Dtype(0.0)) {
    
     //如果mdist参数大于0
            caffe_cpu_axpby(
                channels,
                beta,
                diff_.cpu_data() + (j*channels),
                Dtype(0.0),
                bout + (j*channels)); //计算beta * diff_，并存储在bout中
          } else {
    
    
            caffe_set(channels, Dtype(0), bout + (j*channels)); //如果前传时loss输出0，那么梯度直接置0
          }
        }
      }
    }
  }
}

#ifdef CPU_ONLY
STUB_GPU(ContrastiveLossLayer);
#endif

INSTANTIATE_CLASS(ContrastiveLossLayer);
REGISTER_LAYER_CLASS(ContrastiveLoss);

}  // namespace caffe

}

源码分析

由于前传和反传的原理在源码之上已经给出了，在这里笔者就给出相对简略的分析了。

首先在前传部分，在contrastive_loss_layer.cpp的Forward_cpu函数中，将a与b欧式距离的平方存储在了dist_sq_中，sim则表示为static_cast(bottom[2]->cpu_data()[i])，按照sim区分进行损失值计算。当sim不为0时，直接在loss上加上dist_sq_；否则区分legacy_version分别进行loss的计算。

然后，在反传部分，在contrastive_loss_layer.cpp的Backward_cpu函数中，通过sign来区分给a和b反传的梯度。接着，首先计算好alpha=sign * top_diff / N，alpha在之后计算梯度中各种情况都会使用到。然后还是先按照sim的值区分梯度进行计算，如果sim不为0，那么直接通过caffe_cpu_axpby函数计算出梯度值，注意，diff_指的就是公式中的 $a_i-b_i$ ：

if (static_cast<int>(bottom[2]->cpu_data()[j])) {
    
      // similar pairs 如果是同类的图像对
          caffe_cpu_axpby(
              channels,
              alpha,
              diff_.cpu_data() + (j*channels),
              Dtype(0.0),
              bout + (j*channels)); //计算sign * diff_ * top_diff / n，并存储在bout中
        }

如果sim不为0，那么就根据legacy_version进行判断，分别按照公式输出相应的梯度值，

if (legacy_version) {
    
     //如果legacy_version参数等于true
            mdist = margin - dist_sq_.cpu_data()[j]; //mdist为margin-d^2
            beta = -alpha; //beta为-sign * top_diff / n
          } else {
    
     //如果legacy_version参数等于false
            Dtype dist = sqrt(dist_sq_.cpu_data()[j]); //dist为d
            mdist = margin - dist; //mdist为margin-d
            beta = -alpha * mdist / (dist + Dtype(1e-4)); //beta为-sign * top_diff * (margin-d) / (n * d)
          }
          if (mdist > Dtype(0.0)) {
    
     //如果mdist参数大于0
            caffe_cpu_axpby(
                channels,
                beta,
                diff_.cpu_data() + (j*channels),
                Dtype(0.0),
                bout + (j*channels)); //计算beta * diff_，并存储在bout中
          }

最后，在前传时loss输出0时，反传的梯度也置0。

else {
    
    
            caffe_set(channels, Dtype(0), bout + (j*channels)); //如果前传时loss输出0，那么梯度直接置0
          }

各位读者朋友可以对照公式看看，求导结果和代码实现一模一样。

写在最后

到这里，整篇博文就接近尾声了。

本篇博文是笔者第一次解析深度学习中相对复杂的包含求导逻辑的梯度反传代码。整体解析下来，笔者自己感觉收获非常大。随着深度学习的快速发展，TensorFlow，PyTorch等高层包装，上手速度快的不需要用户实现梯度反传的框架能解决我们很多科研与项目问题，但是作为深度学习研究者和工程师，很有必要懂得梯度反传代码的逻辑与实现细节。这样，才不会囿于框架内部。尤其是在进行科学研究与算法探索时，更应该对代码底层实现与数学原理加深理解。

欢迎阅读笔者后续博客，各位读者朋友的支持与鼓励是我最大的动力！

written by jiong
为有牺牲多壮志，敢教日月换新天

干货！caffe源码深入学习9：caffe框架神经网络反传代码解析（三）之contrastive_loss_layer源码解析

caffe源码深入学习9：caffe框架神经网络反传代码解析（三）之contrastive_loss_layer源码解析

写在前面

理论解析

源码及注释

源码分析

写在最后

猜你喜欢