先贴上pointnet原文当中的网络结构,后面再慢慢分析.
可以看到图中有两个网络.一个是分类网络,另一个是语义分割网络.先说分类网络.
可以看到分割网络只是在最后把全局信息重复了n次然后concat到nx64那层网络之后…呀,顺序有点问题,LZ连分类网络还没有解释呢?
好了,从分类网络说起.我们可以看到输入是nx3,n代表的是输入点云的数量,一般取128/256/512/1024/2048/4096,当然取得点越多,获得的信息越多,accuracy相对也会越高,感觉有种在玩2048的感觉,O(∩_∩)O哈哈~
3就是(x, y, z)的三维点的坐标.
后面接着一个input transformation,这个是部分是嵌入了一个T-Net,T-Net学习的是一个3x3的矩阵,是为了解决输入特征的旋转不变性.相当于是个R,但是这里LZ还是存在一个疑惑,就是这个并没有在loss中进行归一化,怎么保证乘的3x3矩阵满足旋转矩阵的几个性质(⊙o⊙)?
下面贴出T-Net的代码:
def input_transform_net(point_cloud, is_training, bn_decay=None, K=3):
""" Input (XYZ) Transform Net, input is BxNx3 gray image
Return:
Transformation matrix of size 3xK """
batch_size = point_cloud.get_shape()[0].value
num_point = point_cloud.get_shape()[1].value
# 从尾巴增加一维向量
input_image = tf.expand_dims(point_cloud, -1)
# print(input_image.shape)
# 定义了一个卷积核,input_image=[B, N, 3, 1], number_output_channels = 64,kernel_size = [1, 3]
#
net = tf_util.conv2d(input_image, 64, [1, 3],
padding='VALID', stride=[1, 1],
bn=True, is_training=is_training,
scope='tconv1', bn_decay=bn_decay)
# print('tconv1' + '', net.shape)
net = tf_util.conv2d(net, 128, [1, 1],
padding='VALID', stride=[1, 1],
bn=True, is_training=is_training,
scope='tconv2', bn_decay=bn_decay)
# print('tconv2'+'',net.shape)
net = tf_util.conv2d(net, 1024, [1, 1],
padding='VALID', stride=[1, 1],
bn=True, is_training=is_training,
scope='tconv3', bn_decay=bn_decay)
# print('tconv3'+'',net.shape)
net = tf_util.max_pool2d(net, [num_point, 1],
padding='VALID', scope='tmaxpool')
# print('tmaxpool'+'',net.shape)
net = tf.reshape(net, [batch_size, -1])
# print(net.shape)
net = tf_util.fully_connected(net, 512, bn=True, is_training=is_training,
scope='tfc1', bn_decay=bn_decay)
# print('tfc1'+'',net.shape)
net = tf_util.fully_connected(net, 256, bn=True, is_training=is_training,
scope='tfc2', bn_decay=bn_decay)
# print('tfc2'+'',net.shape)
with tf.variable_scope('transform_XYZ') as sc:
assert (K == 3)
weights = tf.get_variable('weights', [256, 3 * K],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
biases = tf.get_variable('biases', [3 * K],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
biases += tf.constant([1, 0, 0, 0, 1, 0, 0, 0, 1], dtype=tf.float32)
transform = tf.matmul(net, weights)
transform = tf.nn.bias_add(transform, biases)
transform = tf.reshape(transform, [batch_size, 3, K])
# print(transform.shape)
return transform
可以看到T-Net其实也就是一个小型网络,可以看到3个卷积层,两个全连接层,最后得到的[batch_size, 256]维的tensor乘以[256, 3x3], 得到[batch_size, 9]后,加上biases,再通过reshape得到[batch_size, 3, 3]的tensor. LZ一开始没弄明白到底T-Net是怎么进行训练的, 后面可以理解了,把训练得到的tensor是直接乘到原始的数据上去的,可以直接看做是一个网络的嵌入,这样在优化loss的时候,就可以直接对T-Net也同时进行优化.同理,对于feature transform也是同样的,对高维也就是64维特征空间上进行旋转,所以训练出来的tensor的维度是[batch_size, 64, 64], 这里先贴出feature transform的代码,可以看到和input transform非常相似
def feature_transform_net(inputs, is_training, bn_decay=None, K=64):
""" Feature Transform Net, input is BxNx1xK
Return:
Transformation matrix of size KxK
"""
batch_size = inputs.get_shape()[0].value
num_point = inputs.get_shape()[1].value
net = tf_util.conv2d(inputs, 64, [1, 1],
padding='VALID', stride=[1, 1],
bn=True, is_training=is_training,
scope='tconv1', bn_decay=bn_decay)
# print('tconv1' + '', net.shape)
net = tf_util.conv2d(net, 128, [1, 1],
padding='VALID', stride=[1, 1],
bn=True, is_training=is_training,
scope='tconv2', bn_decay=bn_decay)
# print('tconv2' + '', net.shape)
net = tf_util.conv2d(net, 1024, [1, 1],
padding='VALID', stride=[1, 1],
bn=True, is_training=is_training,
scope='tconv3', bn_decay=bn_decay)
# print('tconv3' + '', net.shape)
net = tf_util.max_pool2d(net, [num_point, 1],
padding='VALID', scope='tmaxpool')
# print('tmaxpool' + '', net.shape)
net = tf.reshape(net, [batch_size, -1])
# print('reshape', net.shape)
net = tf_util.fully_connected(net, 512, bn=True, is_training=is_training,
scope='tfc1', bn_decay=bn_decay)
# print('tfc1' + '', net.shape)
net = tf_util.fully_connected(net, 256, bn=True, is_training=is_training,
scope='tfc2', bn_decay=bn_decay)
# print('tfc2' + '', net.shape)
with tf.variable_scope('transform_feat') as sc:
weights = tf.get_variable('weights', [256, K * K],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
biases = tf.get_variable('biases', [K * K],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
biases += tf.constant(np.eye(K).flatten(), dtype=tf.float32)
transform = tf.matmul(net, weights)
transform = tf.nn.bias_add(transform, biases)
# print('transform'+ '', transform.shape)
transform = tf.reshape(transform, [batch_size, K, K])
# print('transform_reshape' + '', transform.shape)
return transform
当中的多层感知机也就是图中的mlp其实就是我们通常理解的卷积层,直接贴上代码,简单明了:
def get_model(point_cloud, is_training, bn_decay=None):
""" Classification PointNet, input is BxNx3, output Bx40 """
batch_size = point_cloud.get_shape()[0].value
num_point = point_cloud.get_shape()[1].value
end_points = {}
with tf.variable_scope('transform_net1') as sc:
transform = input_transform_net(point_cloud, is_training, bn_decay, K=3)
point_cloud_transformed = tf.matmul(point_cloud, transform)
# print(point_cloud_transformed.shape)
input_image = tf.expand_dims(point_cloud_transformed, -1)
# net = tf_util.conv2d(input_image, 64, [1,3],
# padding='VALID', stride=[1,1],
# bn=True, is_training=is_training,
# scope='conv1', bn_decay=bn_decay)
net = tf_util.conv2d(input_image, 64, [1, 3],
padding='VALID', stride=[1, 1],
bn=True, is_training=is_training,
scope='conv1', bn_decay=bn_decay)
# print('conv1'+ '', net.shape)
net = tf_util.conv2d(net, 64, [1, 1],
padding='VALID', stride=[1, 1],
bn=True, is_training=is_training,
scope='conv2', bn_decay=bn_decay)
# print('conv2' + '', net.shape)
with tf.variable_scope('transform_net2') as sc:
transform = feature_transform_net(net, is_training, bn_decay, K=64)
end_points['transform'] = transform
net_squeeze = tf.squeeze(net, axis=[2])
# print('net_squeeze shape ', net_squeeze.shape)
net_transformed = tf.matmul(tf.squeeze(net, axis=[2]), transform)
# print('net_transformed 1:', net_transformed.shape)
net_transformed = tf.expand_dims(net_transformed, [2])
# print('net_transformed 2:', net_transformed.shape)
net = tf_util.conv2d(net_transformed, 64, [1, 1],
padding='VALID', stride=[1, 1],
bn=True, is_training=is_training,
scope='conv3', bn_decay=bn_decay)
# print('conv3' + '', net.shape)
net = tf_util.conv2d(net, 128, [1, 1],
padding='VALID', stride=[1, 1],
bn=True, is_training=is_training,
scope='conv4', bn_decay=bn_decay)
# print('conv4' + '', net.shape)
net = tf_util.conv2d(net, 1024, [1, 1],
padding='VALID', stride=[1, 1],
bn=True, is_training=is_training,
scope='conv5', bn_decay=bn_decay)
# print('conv5' + '', net.shape)
# Symmetric function: max pooling
net = tf_util.max_pool2d(net, [num_point, 1],
padding='VALID', scope='maxpool')
# print('max_pool' + '', net.shape)
net = tf.reshape(net, [batch_size, -1])
# print('reshape ' + '', net.shape)
net = tf_util.fully_connected(net, 512, bn=True, is_training=is_training,
scope='fc1', bn_decay=bn_decay)
# print('fc1 ' + '', net.shape)
net = tf_util.dropout(net, keep_prob=0.7, is_training=is_training,
scope='dp1')
net = tf_util.fully_connected(net, 256, bn=True, is_training=is_training,
scope='fc2', bn_decay=bn_decay)
# print('fc2 ' + '', net.shape)
net = tf_util.dropout(net, keep_prob=0.7, is_training=is_training,
scope='dp2')
net = tf_util.fully_connected(net, 40, activation_fn=None, scope='fc3')
# print('fc3 ' + '', net.shape)
return net, end_points
可以看到一共有五个卷积层,主要的作用就是就是把原始数据三维的特征,也就是xyz三个坐标的值升维到1024维特征,也就是最后设置的通道数,这怎么理解呢?如果直接用max pooling操作,相当于我们只取了一个点,但是一个点怎么能够代替所有点的特征呢?那么就需要对每个点的特征进行升维操作,知道最后max pooling之后的一个点的特征可以概括出所有点的特征,就可以了,实验证明这个操作是会丢失信息的,虽然解释上可以这么进行理解,但是实际上这个操作还是会丢失很多其他点和全局的信息,虽然结果已经比voxnet好很多,但是也有提升的空间在.
最后用全连接层进行连接,共有3个全连接层,使用dropout来防止网络过拟合,但实际上网络还是过拟合了/(ㄒoㄒ)/~因为使用modelnet40,所以,最后的全连接的数量是40,这个都是基础了,如果有不理解的部分建议看LZ的cs231的那部分,估计两天就差不多了解个大概了.
我们看下loss的具体设置
def get_loss(pred, label, end_points, reg_weight=0.001):
""" pred: B*NUM_CLASSES,
label: B, """
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=pred, labels=label)
classify_loss = tf.reduce_mean(loss)
tf.summary.scalar('classify loss', classify_loss)
# Enforce the transformation as orthogonal matrix
transform = end_points['transform'] # BxKxK
K = transform.get_shape()[1].value
# transform_trans = tf.transpose(transform, perm=[0, 2, 1])
# print(transform_trans.shape, transform_trans)
mat_diff = tf.matmul(transform, tf.transpose(transform, perm=[0, 2, 1]))
# print('mat_diff: ', mat_diff)
mat_diff -= tf.constant(np.eye(K), dtype=tf.float32)
mat_diff_loss = tf.nn.l2_loss(mat_diff)
tf.summary.scalar('mat loss', mat_diff_loss)
return classify_loss + mat_diff_loss * reg_weight
这部分也超级好理解,通常对于分类问题,我们直接使用的就是softmax loss,代码中也体现出来了,这里加了一个正则项,一则是为了防止过拟合,二则也是为了满足旋转矩阵的性质,至于为什么三维的input transform不用,待考证,可能效果不是很好???
这个就是分类网络的部分,最后评价标准:
pred_val = np.argmax(pred_val, 1)
correct = np.sum(pred_val == current_label[start_idx:end_idx])
total_correct += correct
total_seen += BATCH_SIZE
loss_sum += (loss_val * BATCH_SIZE)
for i in range(start_idx, end_idx):
l = current_label[i]
total_seen_class[l] += 1
total_correct_class[l] += (pred_val[i - start_idx] == l)
log_string('eval mean loss: %f' % (loss_sum / float(total_seen)))
log_string('eval accuracy: %f' % (total_correct / float(total_seen)))
log_string('eval avg class acc: %f' % (
np.mean(np.array(total_correct_class) / np.array(total_seen_class, dtype=np.float))))
eval_mean_loss = loss_sum / float(total_seen)
eval_accuracy = total_correct / float(total_seen)
eval_avg_class_acc = np.mean(np.array(total_correct_class) / np.array(total_seen_class, dtype=np.float))
很好如果predict是1, label也是1,就是计数加1,总共计数除以总数得到accuracy就是我们需要的结果,还有一个问题实际在paper中得到的训练结果有89.2%,LZ怎么尝试都只有大概88.5%左右,(⊙v⊙)嗯,可能Local optimum.