前言
2018年CVPR会议上,商汤科技SenseTime被收录的论文中,有一篇《GeoNet--Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose》,提出了一个叫做GeoNet的框架——一种可以联合学习单目深度、光流和相机姿态的无监督学习框架,其表现超越了现有的一些无监督学习方法,并且取得了可以与监督学习方法相媲美的的结果。本人目前正在学习与之相关的内容,接下来将会对从网络结构解读、实验结果分析等几个方面自己研究学习的收获做一个小小的总结。
附论文地址和github地址,希望和大家多多交流~
paper:https://arxiv.org/abs/1803.02276
github:https://github.com/yzcjtr/GeoNet
整体网络结构 Network Architecture
GeoNet以无监督学习的方式感知三维场景的几何形状,整个架构分为两大部分:刚性结构重构器rigid structure reconstructor和非刚性结构定位器non-rigid motion localizer,分别来学习刚性流和目标运动,在整个过程中采用图像外观相似度来引导无监督学习。
Stage1:刚性结构推理阶段
由两个子网络DepthNet和PoseNet构成,分别用于回归出深度图(Depth Maps)和相机位姿(Camera Pose),再将二者融合到一起,得到刚性流。
Stage2:非刚性结构定位器
通过ResFlowNet实现,用于处理动态目标。ResFlowNet学习得到的非刚性流再与刚性流相结合,就推导出最终的预测流。
可以看出,三个子网络每个子网络的目标都是解决一个特定的子任务,因此复杂的场景几何理解目标就分解成立一些更简单的目标。在每个阶段都将对应阶段的视图合成(view synthesized)用作基本监督。
DepthNet & ResFlowNet
本文采用了论文《 Unsupervised monocular depth estimation with left-right consistency》中的网络结构作为DepthNet和ResFlowNet的backbone,学习像素级的几何信息。该结构主要由两部分构成:编码器encoder和解码器decoder。编码器encoder部分以ResNet50作为基本结构,解码器decoder部分由反卷器层构成,并且将特征图谱扩大到全尺度。为了同时保留全局高层次特征和局部细节信息,在encoder和decoder之间的不同分辨率上采用了skip connections,进行了多尺度的深度预测。
ResNet50的网络结构
源码:
# 用vgg与resnet50构成的depthnet作对比 # 默认情况下用resnet50构建depthnet def disp_net(opt, dispnet_inputs): is_training = opt.mode == 'train_rigid' if opt.dispnet_encoder == 'vgg': return build_vgg(dispnet_inputs, get_disp_vgg, is_training, 'depth_net') else: return build_resnet50(dispnet_inputs, get_disp_resnet50, is_training, 'depth_net')
def get_disp_vgg(x): disp = DISP_SCALING_VGG * slim.conv2d(x, 1, 3, 1, activation_fn=tf.nn.sigmoid, normalizer_fn=None) + 0.01 return disp def get_disp_resnet50(x): disp = DISP_SCALING_RESNET50 * conv(x, 1, 3, 1, activation_fn=tf.nn.sigmoid, normalizer_fn=None) + 0.01 return disp
# vgg的构建 def build_vgg(inputs, get_pred, is_training, var_scope): batch_norm_params = {'is_training': is_training} H = inputs.get_shape()[1].value W = inputs.get_shape()[2].value with tf.variable_scope(var_scope) as sc: with slim.arg_scope([slim.conv2d, slim.conv2d_transpose], normalizer_fn=slim.batch_norm, normalizer_params=batch_norm_params, weights_regularizer=slim.l2_regularizer(0.0001), activation_fn=tf.nn.relu): # ENCODING conv1 = slim.conv2d(inputs, 32, 7, 2) conv1b = slim.conv2d(conv1, 32, 7, 1) conv2 = slim.conv2d(conv1b, 64, 5, 2) conv2b = slim.conv2d(conv2, 64, 5, 1) conv3 = slim.conv2d(conv2b, 128, 3, 2) conv3b = slim.conv2d(conv3, 128, 3, 1) conv4 = slim.conv2d(conv3b, 256, 3, 2) conv4b = slim.conv2d(conv4, 256, 3, 1) conv5 = slim.conv2d(conv4b, 512, 3, 2) conv5b = slim.conv2d(conv5, 512, 3, 1) conv6 = slim.conv2d(conv5b, 512, 3, 2) conv6b = slim.conv2d(conv6, 512, 3, 1) conv7 = slim.conv2d(conv6b, 512, 3, 2) conv7b = slim.conv2d(conv7, 512, 3, 1) # DECODING upconv7 = upconv(conv7b, 512, 3, 2) # There might be dimension mismatch due to uneven down/up-sampling upconv7 = resize_like(upconv7, conv6b) i7_in = tf.concat([upconv7, conv6b], axis=3) iconv7 = slim.conv2d(i7_in, 512, 3, 1) upconv6 = upconv(iconv7, 512, 3, 2) upconv6 = resize_like(upconv6, conv5b) i6_in = tf.concat([upconv6, conv5b], axis=3) iconv6 = slim.conv2d(i6_in, 512, 3, 1) upconv5 = upconv(iconv6, 256, 3, 2) upconv5 = resize_like(upconv5, conv4b) i5_in = tf.concat([upconv5, conv4b], axis=3) iconv5 = slim.conv2d(i5_in, 256, 3, 1) upconv4 = upconv(iconv5, 128, 3, 2) i4_in = tf.concat([upconv4, conv3b], axis=3) iconv4 = slim.conv2d(i4_in, 128, 3, 1) pred4 = get_pred(iconv4) pred4_up = tf.image.resize_bilinear(pred4, [np.int(H/4), np.int(W/4)]) upconv3 = upconv(iconv4, 64, 3, 2) i3_in = tf.concat([upconv3, conv2b, pred4_up], axis=3) iconv3 = slim.conv2d(i3_in, 64, 3, 1) pred3 = get_pred(iconv3) pred3_up = tf.image.resize_bilinear(pred3, [np.int(H/2), np.int(W/2)]) upconv2 = upconv(iconv3, 32, 3, 2) i2_in = tf.concat([upconv2, conv1b, pred3_up], axis=3) iconv2 = slim.conv2d(i2_in, 32, 3, 1) pred2 = get_pred(iconv2) pred2_up = tf.image.resize_bilinear(pred2, [H, W]) upconv1 = upconv(iconv2, 16, 3, 2) i1_in = tf.concat([upconv1, pred2_up], axis=3) iconv1 = slim.conv2d(i1_in, 16, 3, 1) pred1 = get_pred(iconv1) return [pred1, pred2, pred3, pred4]
# resnet50的构建 def build_resnet50(inputs, get_pred, is_training, var_scope): batch_norm_params = {'is_training': is_training} with tf.variable_scope(var_scope) as sc: with slim.arg_scope([slim.conv2d, slim.conv2d_transpose], normalizer_fn=slim.batch_norm, normalizer_params=batch_norm_params, weights_regularizer=slim.l2_regularizer(0.0001), activation_fn=tf.nn.relu): # ENCODING conv1 = conv(inputs, 64, 7, 2) # H/2 - 64D pool1 = maxpool(conv1, 3) # H/4 - 64D conv2 = resblock(pool1, 64, 3) # H/8 - 256D conv3 = resblock(conv2, 128, 4) # H/16 - 512D conv4 = resblock(conv3, 256, 6) # H/32 - 1024D conv5 = resblock(conv4, 512, 3) # H/64 - 2048D skip1 = conv1 skip2 = pool1 skip3 = conv2 skip4 = conv3 skip5 = conv4 # DECODING upconv6 = upconv(conv5, 512, 3, 2) #H/32 upconv6 = resize_like(upconv6, skip5) concat6 = tf.concat([upconv6, skip5], 3) iconv6 = conv(concat6, 512, 3, 1) upconv5 = upconv(iconv6, 256, 3, 2) #H/16 upconv5 = resize_like(upconv5, skip4) concat5 = tf.concat([upconv5, skip4], 3) iconv5 = conv(concat5, 256, 3, 1) upconv4 = upconv(iconv5, 128, 3, 2) #H/8 upconv4 = resize_like(upconv4, skip3) concat4 = tf.concat([upconv4, skip3], 3) iconv4 = conv(concat4, 128, 3, 1) pred4 = get_pred(iconv4) upred4 = upsample_nn(pred4, 2) upconv3 = upconv(iconv4, 64, 3, 2) #H/4 concat3 = tf.concat([upconv3, skip2, upred4], 3) iconv3 = conv(concat3, 64, 3, 1) pred3 = get_pred(iconv3) upred3 = upsample_nn(pred3, 2) upconv2 = upconv(iconv3, 32, 3, 2) #H/2 concat2 = tf.concat([upconv2, skip1, upred3], 3) iconv2 = conv(concat2, 32, 3, 1) pred2 = get_pred(iconv2) upred2 = upsample_nn(pred2, 2) upconv1 = upconv(iconv2, 16, 3, 2) #H concat1 = tf.concat([upconv1, upred2], 3) iconv1 = conv(concat1, 16, 3, 1) pred1 = get_pred(iconv1) return [pred1, pred2, pred3, pred4]
PoseNet
PoseNet部分与论文《Unsupervised learning of depth and ego-motion from video》中的网络结构类似,包含8个卷积层,在输出最终预测结果之前有一个全局平均池化层。在除了输出层之外的卷积层之间都采用了Batch Normalization和ReLUs激活函数。
源码
def pose_net(opt, posenet_inputs): is_training = opt.mode == 'train_rigid' batch_norm_params = {'is_training': is_training} with tf.variable_scope('pose_net') as sc: with slim.arg_scope([slim.conv2d], normalizer_fn=slim.batch_norm, normalizer_params=batch_norm_params, weights_regularizer=slim.l2_regularizer(0.0001), activation_fn=tf.nn.relu): conv1 = slim.conv2d(posenet_inputs, 16, 7, 2) conv2 = slim.conv2d(conv1, 32, 5, 2) conv3 = slim.conv2d(conv2, 64, 3, 2) conv4 = slim.conv2d(conv3, 128, 3, 2) conv5 = slim.conv2d(conv4, 256, 3, 2) conv6 = slim.conv2d(conv5, 256, 3, 2) conv7 = slim.conv2d(conv6, 256, 3, 2) pose_pred = slim.conv2d(conv7, 6*opt.num_source, 1, 1, normalizer_fn=None, activation_fn=None) pose_avg = tf.reduce_mean(pose_pred, [1, 2]) pose_final = 0.01 * tf.reshape(pose_avg, [-1, opt.num_source, 6]) return pose_final