CTPN(3)model

前言

今天我们介绍一下模型的主要框架，model部分，先看一下模型框架图：在这里插入图片描述
(1)首先，使用VGG16作为base net提取特征，得到conv5_3的特征作为feature map，大小是W×H×C

(2) 然后在这个feature map上做滑窗，窗口大小是3×3。也就是每个窗口都能得到一个长度为3×3×C的特征向量。这个特征向量将用来预测和10个anchor之间的偏移距离，也就是说每一个窗口中心都会预测出10个text propsoal。

(3) 将上一步得到的特征输入到一个双向的LSTM中，得到长度为W×256的输出，然后接一个512的全连接层，准备输出。

(4) 输出层部分主要有三个输出。2k个vertical coordinate，因为一个anchor用的是中心位置的高（y坐标）和矩形框的高度两个值表示的，所以一个用2k个输出。（注意这里输出的是相对anchor的偏移）。2k个score，因为预测了k个text proposal，所以有2k个分数，text和non-text各有一个分数。k个side-refinement，这部分主要是用来精修文本行的两个端点的，表示的是每个proposal的水平平移量。

(5)该方法得到密集预测的text proposal，所以会使用一个标准的非极大值抑制算法来滤除多余的box。

(6)最后使用基于图的文本行构造算法，将得到的一个一个的文本段合并成文本行。

model()

def model(image):
    image = mean_image_subtraction(image)
    with slim.arg_scope(vgg.vgg_arg_scope()):
        conv5_3 = vgg.vgg_16(image)

    rpn_conv = slim.conv2d(conv5_3, 512, 3)

    lstm_output = Bilstm(rpn_conv, 512, 128, 512, scope_name='BiLSTM')

    bbox_pred = lstm_fc(lstm_output, 512, 10 * 4, scope_name="bbox_pred")
    cls_pred = lstm_fc(lstm_output, 512, 10 * 2, scope_name="cls_pred")

    # transpose: (1, H, W, A x d) -> (1, H, WxA, d)
    cls_pred_shape = tf.shape(cls_pred)
    cls_pred_reshape = tf.reshape(cls_pred, [cls_pred_shape[0], cls_pred_shape[1], -1, 2])

    cls_pred_reshape_shape = tf.shape(cls_pred_reshape)
    cls_prob = tf.reshape(tf.nn.softmax(tf.reshape(cls_pred_reshape, [-1, cls_pred_reshape_shape[3]])),
                          [-1, cls_pred_reshape_shape[1], cls_pred_reshape_shape[2], cls_pred_reshape_shape[3]],
                          name="cls_prob")

    return bbox_pred, cls_pred, cls_prob

第一句：image = mean_image_subtraction(image) `

def mean_image_subtraction(images, means=[123.68, 116.78, 103.94]):
    num_channels = images.get_shape().as_list()[-1]
    if len(means) != num_channels:
        raise ValueError('len(means) must match the number of channels')
    channels = tf.split(axis=3, num_or_size_splits=num_channels, value=images)
    for i in range(num_channels):
        channels[i] -= means[i]
    return tf.concat(axis=3, values=channels)

拆分图像，然后减去相应值然后合并，相当于对彩色图片的处理。
conv5_3 = vgg.vgg_16(image)

def vgg_16(inputs, scope='vgg_16'):
    with tf.variable_scope(scope, 'vgg_16', [inputs]) as sc:
        with slim.arg_scope([slim.conv2d, slim.fully_connected, slim.max_pool2d]):
            net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
            net = slim.max_pool2d(net, [2, 2], scope='pool1')
            net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
            net = slim.max_pool2d(net, [2, 2], scope='pool2')
            net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
            net = slim.max_pool2d(net, [2, 2], scope='pool3')
            net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
            net = slim.max_pool2d(net, [2, 2], scope='pool4')
            net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')

    return net

这块是模块的第一部分VGG16网络的构建，在vgg.py构建后在这里调用。对应第一步。

rpn_conv = slim.conv2d(conv5_3, 512, 3)

在feature map上做滑窗，窗口大小是3×3。也就是每个窗口都能得到一个长度为3×3×C的特征向量。这个特征向量将用来预测和10个anchor之间的偏移距离，也就是说每一个窗口中心都会预测出10个text propsoal。对应第二步

lstm_output = Bilstm(rpn_conv, 512, 128, 512, scope_name=‘BiLSTM’)

def Bilstm(net, input_channel, hidden_unit_num, output_channel, scope_name):
    # width--->time step
    with tf.variable_scope(scope_name) as scope:
        shape = tf.shape(net)
        N, H, W, C = shape[0], shape[1], shape[2], shape[3]
        net = tf.reshape(net, [N * H, W, C])
        net.set_shape([None, None, input_channel])
        lstm_fw_cell = tf.contrib.rnn.LSTMCell(hidden_unit_num, state_is_tuple=True)
        lstm_bw_cell = tf.contrib.rnn.LSTMCell(hidden_unit_num, state_is_tuple=True)

        lstm_out, last_state = tf.nn.bidirectional_dynamic_rnn(lstm_fw_cell, lstm_bw_cell, net, dtype=tf.float32)
        lstm_out = tf.concat(lstm_out, axis=-1)

        lstm_out = tf.reshape(lstm_out, [N * H * W, 2 * hidden_unit_num])

        init_weights = tf.contrib.layers.variance_scaling_initializer(factor=0.01, mode='FAN_AVG', uniform=False)
        init_biases = tf.constant_initializer(0.0)
        weights = make_var('weights', [2 * hidden_unit_num, output_channel], init_weights)
        biases = make_var('biases', [output_channel], init_biases)

        outputs = tf.matmul(lstm_out, weights) + biases

        outputs = tf.reshape(outputs, [N, H, W, output_channel])
        return outputs

这块是双向LSTM的定义将上一步得到的特征输入到一个双向的LSTM中，得到长度为W×256的输出，然后接一个512的全连接层，准备输出。是第三阶段。

bbox_pred = lstm_fc(lstm_output, 512, 10 * 4, scope_name=“bbox_pred”)
cls_pred = lstm_fc(lstm_output, 512, 10 * 2, scope_name=“cls_pred”)

def lstm_fc(net, input_channel, output_channel, scope_name):
    with tf.variable_scope(scope_name) as scope:
        shape = tf.shape(net)
        N, H, W, C = shape[0], shape[1], shape[2], shape[3]
        net = tf.reshape(net, [N * H * W, C])

        init_weights = tf.contrib.layers.variance_scaling_initializer(factor=0.01, mode='FAN_AVG', uniform=False)
        init_biases = tf.constant_initializer(0.0)
        weights = make_var('weights', [input_channel, output_channel], init_weights)
        biases = make_var('biases', [output_channel], init_biases)

        output = tf.matmul(net, weights) + biases
        output = tf.reshape(output, [N, H, W, output_channel])
    return output

最后经过全连接层输出位置预测和分数预测最后通过tf.nn.softmax层，返回bbox_pred, cls_pred, cls_prob三个结果。
这里我们对网络函数做几句注解：

net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')

这块是神经网络的重复定义方法第二个参数2相当于重复定义两次。

lstm_fw_cell = tf.contrib.rnn.LSTMCell(hidden_unit_num, state_is_tuple=True)

lstm网络的构建

bidirectional_dynamic_rnn(
    cell_fw, # 前向RNN
    cell_bw, # 后向RNN
    inputs, # 输入
    sequence_length=None,# 输入序列的实际长度（可选，默认为输入序列的最大长度）
    initial_state_fw=None,  # 前向的初始化状态（可选）
    initial_state_bw=None,  # 后向的初始化状态（可选）
    dtype=None, # 初始化和输出的数据类型（可选）
    parallel_iterations=None,
    swap_memory=False,
    time_major=False,
    scope=None)

tf.contrib.layers.variance_scaling_initializer初始化参数

def make_var(name, shape, initializer=None):
    return tf.get_variable(name, shape, initializer=initializer)

定义变量。

后面的话

这块写的是前五步的实现基本实现了图中除最后一部分的代码，代码很简单，基本上都是神经网络和reshape部分，所以就不详细解释，model部分就写到这，剩下后面在写，再次向各位前辈致以敬意。

葛葛葛立鹏啊

发布了20 篇原创文章 · 获赞 0 · 访问量 411

私信关注

前言

model()

后面的话

猜你喜欢