前言
今天我们介绍一下模型的主要框架,model部分,先看一下模型框架图:
(1)首先,使用VGG16作为base net提取特征,得到conv5_3的特征作为feature map,大小是W×H×C
(2) 然后在这个feature map上做滑窗,窗口大小是3×3。也就是每个窗口都能得到一个长度为3×3×C的特征向量。这个特征向量将用来预测和10个anchor之间的偏移距离,也就是说每一个窗口中心都会预测出10个text propsoal。
(3) 将上一步得到的特征输入到一个双向的LSTM中,得到长度为W×256的输出,然后接一个512的全连接层,准备输出。
(4) 输出层部分主要有三个输出。2k个vertical coordinate,因为一个anchor用的是中心位置的高(y坐标)和矩形框的高度两个值表示的,所以一个用2k个输出。(注意这里输出的是相对anchor的偏移)。2k个score,因为预测了k个text proposal,所以有2k个分数,text和non-text各有一个分数。k个side-refinement,这部分主要是用来精修文本行的两个端点的,表示的是每个proposal的水平平移量。
(5)该方法得到密集预测的text proposal,所以会使用一个标准的非极大值抑制算法来滤除多余的box。
(6)最后使用基于图的文本行构造算法,将得到的一个一个的文本段合并成文本行。
model()
def model(image):
image = mean_image_subtraction(image)
with slim.arg_scope(vgg.vgg_arg_scope()):
conv5_3 = vgg.vgg_16(image)
rpn_conv = slim.conv2d(conv5_3, 512, 3)
lstm_output = Bilstm(rpn_conv, 512, 128, 512, scope_name='BiLSTM')
bbox_pred = lstm_fc(lstm_output, 512, 10 * 4, scope_name="bbox_pred")
cls_pred = lstm_fc(lstm_output, 512, 10 * 2, scope_name="cls_pred")
# transpose: (1, H, W, A x d) -> (1, H, WxA, d)
cls_pred_shape = tf.shape(cls_pred)
cls_pred_reshape = tf.reshape(cls_pred, [cls_pred_shape[0], cls_pred_shape[1], -1, 2])
cls_pred_reshape_shape = tf.shape(cls_pred_reshape)
cls_prob = tf.reshape(tf.nn.softmax(tf.reshape(cls_pred_reshape, [-1, cls_pred_reshape_shape[3]])),
[-1, cls_pred_reshape_shape[1], cls_pred_reshape_shape[2], cls_pred_reshape_shape[3]],
name="cls_prob")
return bbox_pred, cls_pred, cls_prob
第一句:image = mean_image_subtraction(image) `
def mean_image_subtraction(images, means=[123.68, 116.78, 103.94]):
num_channels = images.get_shape().as_list()[-1]
if len(means) != num_channels:
raise ValueError('len(means) must match the number of channels')
channels = tf.split(axis=3, num_or_size_splits=num_channels, value=images)
for i in range(num_channels):
channels[i] -= means[i]
return tf.concat(axis=3, values=channels)
拆分图像,然后减去相应值然后合并,相当于对彩色图片的处理。
conv5_3 = vgg.vgg_16(image)
def vgg_16(inputs, scope='vgg_16'):
with tf.variable_scope(scope, 'vgg_16', [inputs]) as sc:
with slim.arg_scope([slim.conv2d, slim.fully_connected, slim.max_pool2d]):
net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
net = slim.max_pool2d(net, [2, 2], scope='pool1')
net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
net = slim.max_pool2d(net, [2, 2], scope='pool2')
net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
net = slim.max_pool2d(net, [2, 2], scope='pool3')
net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
net = slim.max_pool2d(net, [2, 2], scope='pool4')
net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')
return net
这块是模块的第一部分VGG16网络的构建,在vgg.py构建后在这里调用。对应第一步。
rpn_conv = slim.conv2d(conv5_3, 512, 3)
在feature map上做滑窗,窗口大小是3×3。也就是每个窗口都能得到一个长度为3×3×C的特征向量。这个特征向量将用来预测和10个anchor之间的偏移距离,也就是说每一个窗口中心都会预测出10个text propsoal。对应第二步
lstm_output = Bilstm(rpn_conv, 512, 128, 512, scope_name=‘BiLSTM’)
def Bilstm(net, input_channel, hidden_unit_num, output_channel, scope_name):
# width--->time step
with tf.variable_scope(scope_name) as scope:
shape = tf.shape(net)
N, H, W, C = shape[0], shape[1], shape[2], shape[3]
net = tf.reshape(net, [N * H, W, C])
net.set_shape([None, None, input_channel])
lstm_fw_cell = tf.contrib.rnn.LSTMCell(hidden_unit_num, state_is_tuple=True)
lstm_bw_cell = tf.contrib.rnn.LSTMCell(hidden_unit_num, state_is_tuple=True)
lstm_out, last_state = tf.nn.bidirectional_dynamic_rnn(lstm_fw_cell, lstm_bw_cell, net, dtype=tf.float32)
lstm_out = tf.concat(lstm_out, axis=-1)
lstm_out = tf.reshape(lstm_out, [N * H * W, 2 * hidden_unit_num])
init_weights = tf.contrib.layers.variance_scaling_initializer(factor=0.01, mode='FAN_AVG', uniform=False)
init_biases = tf.constant_initializer(0.0)
weights = make_var('weights', [2 * hidden_unit_num, output_channel], init_weights)
biases = make_var('biases', [output_channel], init_biases)
outputs = tf.matmul(lstm_out, weights) + biases
outputs = tf.reshape(outputs, [N, H, W, output_channel])
return outputs
这块是双向LSTM的定义将上一步得到的特征输入到一个双向的LSTM中,得到长度为W×256的输出,然后接一个512的全连接层,准备输出。是第三阶段。
bbox_pred = lstm_fc(lstm_output, 512, 10 * 4, scope_name=“bbox_pred”)
cls_pred = lstm_fc(lstm_output, 512, 10 * 2, scope_name=“cls_pred”)
def lstm_fc(net, input_channel, output_channel, scope_name):
with tf.variable_scope(scope_name) as scope:
shape = tf.shape(net)
N, H, W, C = shape[0], shape[1], shape[2], shape[3]
net = tf.reshape(net, [N * H * W, C])
init_weights = tf.contrib.layers.variance_scaling_initializer(factor=0.01, mode='FAN_AVG', uniform=False)
init_biases = tf.constant_initializer(0.0)
weights = make_var('weights', [input_channel, output_channel], init_weights)
biases = make_var('biases', [output_channel], init_biases)
output = tf.matmul(net, weights) + biases
output = tf.reshape(output, [N, H, W, output_channel])
return output
最后经过全连接层输出位置预测和分数预测最后通过tf.nn.softmax层,返回bbox_pred, cls_pred, cls_prob三个结果。
这里我们对网络函数做几句注解:
net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
这块是神经网络的重复定义方法第二个参数2相当于重复定义两次。
lstm_fw_cell = tf.contrib.rnn.LSTMCell(hidden_unit_num, state_is_tuple=True)
lstm网络的构建
bidirectional_dynamic_rnn(
cell_fw, # 前向RNN
cell_bw, # 后向RNN
inputs, # 输入
sequence_length=None,# 输入序列的实际长度(可选,默认为输入序列的最大长度)
initial_state_fw=None, # 前向的初始化状态(可选)
initial_state_bw=None, # 后向的初始化状态(可选)
dtype=None, # 初始化和输出的数据类型(可选)
parallel_iterations=None,
swap_memory=False,
time_major=False,
scope=None)
tf.contrib.layers.variance_scaling_initializer初始化参数
def make_var(name, shape, initializer=None):
return tf.get_variable(name, shape, initializer=initializer)
定义变量。
后面的话
这块写的是前五步的实现基本实现了图中除最后一部分的代码,代码很简单,基本上都是神经网络和reshape部分,所以就不详细解释,model部分就写到这,剩下后面在写,再次向各位前辈致以敬意。