Pointet的读后感(CVPR2016):3.分类网络部分

先贴上pointnet原文当中的网络结构,后面再慢慢分析.
在这里插入图片描述
可以看到图中有两个网络.一个是分类网络,另一个是语义分割网络.先说分类网络.

可以看到分割网络只是在最后把全局信息重复了n次然后concat到nx64那层网络之后…呀,顺序有点问题,LZ连分类网络还没有解释呢?

好了,从分类网络说起.我们可以看到输入是nx3,n代表的是输入点云的数量,一般取128/256/512/1024/2048/4096,当然取得点越多,获得的信息越多,accuracy相对也会越高,感觉有种在玩2048的感觉,O(∩_∩)O哈哈~
3就是(x, y, z)的三维点的坐标.

后面接着一个input transformation,这个是部分是嵌入了一个T-Net,T-Net学习的是一个3x3的矩阵,是为了解决输入特征的旋转不变性.相当于是个R,但是这里LZ还是存在一个疑惑,就是这个并没有在loss中进行归一化,怎么保证乘的3x3矩阵满足旋转矩阵的几个性质(⊙o⊙)?

下面贴出T-Net的代码:

def input_transform_net(point_cloud, is_training, bn_decay=None, K=3):
    """ Input (XYZ) Transform Net, input is BxNx3 gray image
        Return:
            Transformation matrix of size 3xK """
    batch_size = point_cloud.get_shape()[0].value
    num_point = point_cloud.get_shape()[1].value

    # 从尾巴增加一维向量
    input_image = tf.expand_dims(point_cloud, -1)
    # print(input_image.shape)
    # 定义了一个卷积核，input_image=[B, N, 3, 1], number_output_channels = 64,kernel_size = [1, 3]
    #
    net = tf_util.conv2d(input_image, 64, [1, 3],
                         padding='VALID', stride=[1, 1],
                         bn=True, is_training=is_training,
                         scope='tconv1', bn_decay=bn_decay)
    # print('tconv1' + '', net.shape)
    net = tf_util.conv2d(net, 128, [1, 1],
                         padding='VALID', stride=[1, 1],
                         bn=True, is_training=is_training,
                         scope='tconv2', bn_decay=bn_decay)
    # print('tconv2'+'',net.shape)
    net = tf_util.conv2d(net, 1024, [1, 1],
                         padding='VALID', stride=[1, 1],
                         bn=True, is_training=is_training,
                         scope='tconv3', bn_decay=bn_decay)
    # print('tconv3'+'',net.shape)
    net = tf_util.max_pool2d(net, [num_point, 1],
                             padding='VALID', scope='tmaxpool')
    # print('tmaxpool'+'',net.shape)

    net = tf.reshape(net, [batch_size, -1])
    # print(net.shape)
    net = tf_util.fully_connected(net, 512, bn=True, is_training=is_training,
                                  scope='tfc1', bn_decay=bn_decay)
    # print('tfc1'+'',net.shape)
    net = tf_util.fully_connected(net, 256, bn=True, is_training=is_training,
                                  scope='tfc2', bn_decay=bn_decay)
    # print('tfc2'+'',net.shape)

    with tf.variable_scope('transform_XYZ') as sc:
        assert (K == 3)
        weights = tf.get_variable('weights', [256, 3 * K],
                                  initializer=tf.constant_initializer(0.0),
                                  dtype=tf.float32)
        biases = tf.get_variable('biases', [3 * K],
                                 initializer=tf.constant_initializer(0.0),
                                 dtype=tf.float32)
        biases += tf.constant([1, 0, 0, 0, 1, 0, 0, 0, 1], dtype=tf.float32)
        transform = tf.matmul(net, weights)
        transform = tf.nn.bias_add(transform, biases)

    transform = tf.reshape(transform, [batch_size, 3, K])
    # print(transform.shape)
    return transform

可以看到T-Net其实也就是一个小型网络,可以看到3个卷积层,两个全连接层,最后得到的[batch_size, 256]维的tensor乘以[256, 3x3], 得到[batch_size, 9]后,加上biases,再通过reshape得到[batch_size, 3, 3]的tensor. LZ一开始没弄明白到底T-Net是怎么进行训练的, 后面可以理解了,把训练得到的tensor是直接乘到原始的数据上去的,可以直接看做是一个网络的嵌入,这样在优化loss的时候,就可以直接对T-Net也同时进行优化.同理,对于feature transform也是同样的,对高维也就是64维特征空间上进行旋转,所以训练出来的tensor的维度是[batch_size, 64, 64], 这里先贴出feature transform的代码,可以看到和input transform非常相似

def feature_transform_net(inputs, is_training, bn_decay=None, K=64):
    """ Feature Transform Net, input is BxNx1xK
        Return:
            Transformation matrix of size KxK 
    """
    batch_size = inputs.get_shape()[0].value
    num_point = inputs.get_shape()[1].value

    net = tf_util.conv2d(inputs, 64, [1, 1],
                         padding='VALID', stride=[1, 1],
                         bn=True, is_training=is_training,
                         scope='tconv1', bn_decay=bn_decay)

    # print('tconv1' + '', net.shape)
    net = tf_util.conv2d(net, 128, [1, 1],
                         padding='VALID', stride=[1, 1],
                         bn=True, is_training=is_training,
                         scope='tconv2', bn_decay=bn_decay)

    # print('tconv2' + '', net.shape)
    net = tf_util.conv2d(net, 1024, [1, 1],
                         padding='VALID', stride=[1, 1],
                         bn=True, is_training=is_training,
                         scope='tconv3', bn_decay=bn_decay)

    # print('tconv3' + '', net.shape)
    net = tf_util.max_pool2d(net, [num_point, 1],
                             padding='VALID', scope='tmaxpool')

    # print('tmaxpool' + '', net.shape)

    net = tf.reshape(net, [batch_size, -1])

    # print('reshape', net.shape)
    net = tf_util.fully_connected(net, 512, bn=True, is_training=is_training,
                                  scope='tfc1', bn_decay=bn_decay)
    # print('tfc1' + '', net.shape)

    net = tf_util.fully_connected(net, 256, bn=True, is_training=is_training,
                                  scope='tfc2', bn_decay=bn_decay)
    # print('tfc2' + '', net.shape)

    with tf.variable_scope('transform_feat') as sc:
        weights = tf.get_variable('weights', [256, K * K],
                                  initializer=tf.constant_initializer(0.0),
                                  dtype=tf.float32)
        biases = tf.get_variable('biases', [K * K],
                                 initializer=tf.constant_initializer(0.0),
                                 dtype=tf.float32)
        biases += tf.constant(np.eye(K).flatten(), dtype=tf.float32)
        transform = tf.matmul(net, weights)
        transform = tf.nn.bias_add(transform, biases)

    # print('transform'+ '', transform.shape)
    transform = tf.reshape(transform, [batch_size, K, K])
    # print('transform_reshape' + '', transform.shape)
    return transform

当中的多层感知机也就是图中的mlp其实就是我们通常理解的卷积层,直接贴上代码,简单明了:


def get_model(point_cloud, is_training, bn_decay=None):
    """ Classification PointNet, input is BxNx3, output Bx40 """
    batch_size = point_cloud.get_shape()[0].value
    num_point = point_cloud.get_shape()[1].value
    end_points = {}

    with tf.variable_scope('transform_net1') as sc:
        transform = input_transform_net(point_cloud, is_training, bn_decay, K=3)
    point_cloud_transformed = tf.matmul(point_cloud, transform)
    # print(point_cloud_transformed.shape)
    input_image = tf.expand_dims(point_cloud_transformed, -1)

    # net = tf_util.conv2d(input_image, 64, [1,3],
    #                      padding='VALID', stride=[1,1],
    #                      bn=True, is_training=is_training,
    #                      scope='conv1', bn_decay=bn_decay)

    net = tf_util.conv2d(input_image, 64, [1, 3],
                         padding='VALID', stride=[1, 1],
                         bn=True, is_training=is_training,
                         scope='conv1', bn_decay=bn_decay)

    # print('conv1'+ '', net.shape)
    net = tf_util.conv2d(net, 64, [1, 1],
                         padding='VALID', stride=[1, 1],
                         bn=True, is_training=is_training,
                         scope='conv2', bn_decay=bn_decay)

    # print('conv2' + '', net.shape)

    with tf.variable_scope('transform_net2') as sc:
        transform = feature_transform_net(net, is_training, bn_decay, K=64)
    end_points['transform'] = transform

    net_squeeze = tf.squeeze(net, axis=[2])
    # print('net_squeeze shape ', net_squeeze.shape)
    net_transformed = tf.matmul(tf.squeeze(net, axis=[2]), transform)
    # print('net_transformed 1:', net_transformed.shape)
    net_transformed = tf.expand_dims(net_transformed, [2])
    # print('net_transformed 2:', net_transformed.shape)

    net = tf_util.conv2d(net_transformed, 64, [1, 1],
                         padding='VALID', stride=[1, 1],
                         bn=True, is_training=is_training,
                         scope='conv3', bn_decay=bn_decay)

    # print('conv3' + '', net.shape)
    net = tf_util.conv2d(net, 128, [1, 1],
                         padding='VALID', stride=[1, 1],
                         bn=True, is_training=is_training,
                         scope='conv4', bn_decay=bn_decay)

    # print('conv4' + '', net.shape)
    net = tf_util.conv2d(net, 1024, [1, 1],
                         padding='VALID', stride=[1, 1],
                         bn=True, is_training=is_training,
                         scope='conv5', bn_decay=bn_decay)

    # print('conv5' + '', net.shape)

    # Symmetric function: max pooling
    net = tf_util.max_pool2d(net, [num_point, 1],
                             padding='VALID', scope='maxpool')

    # print('max_pool' + '', net.shape)

    net = tf.reshape(net, [batch_size, -1])

    # print('reshape ' + '', net.shape)
    net = tf_util.fully_connected(net, 512, bn=True, is_training=is_training,
                                  scope='fc1', bn_decay=bn_decay)

    # print('fc1 ' + '', net.shape)
    net = tf_util.dropout(net, keep_prob=0.7, is_training=is_training,
                          scope='dp1')
    net = tf_util.fully_connected(net, 256, bn=True, is_training=is_training,
                                  scope='fc2', bn_decay=bn_decay)

    # print('fc2 ' + '', net.shape)
    net = tf_util.dropout(net, keep_prob=0.7, is_training=is_training,
                          scope='dp2')
    net = tf_util.fully_connected(net, 40, activation_fn=None, scope='fc3')

    # print('fc3 ' + '', net.shape)

    return net, end_points

可以看到一共有五个卷积层,主要的作用就是就是把原始数据三维的特征,也就是xyz三个坐标的值升维到1024维特征,也就是最后设置的通道数,这怎么理解呢?如果直接用max pooling操作,相当于我们只取了一个点,但是一个点怎么能够代替所有点的特征呢?那么就需要对每个点的特征进行升维操作,知道最后max pooling之后的一个点的特征可以概括出所有点的特征,就可以了,实验证明这个操作是会丢失信息的,虽然解释上可以这么进行理解,但是实际上这个操作还是会丢失很多其他点和全局的信息,虽然结果已经比voxnet好很多,但是也有提升的空间在.

最后用全连接层进行连接,共有3个全连接层,使用dropout来防止网络过拟合,但实际上网络还是过拟合了/(ㄒoㄒ)/~因为使用modelnet40,所以,最后的全连接的数量是40,这个都是基础了,如果有不理解的部分建议看LZ的cs231的那部分,估计两天就差不多了解个大概了.

我们看下loss的具体设置


def get_loss(pred, label, end_points, reg_weight=0.001):
    """ pred: B*NUM_CLASSES,
        label: B, """
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=pred, labels=label)
    classify_loss = tf.reduce_mean(loss)
    tf.summary.scalar('classify loss', classify_loss)

    # Enforce the transformation as orthogonal matrix
    transform = end_points['transform']  # BxKxK
    K = transform.get_shape()[1].value
    # transform_trans = tf.transpose(transform, perm=[0, 2, 1])
    # print(transform_trans.shape, transform_trans)
    mat_diff = tf.matmul(transform, tf.transpose(transform, perm=[0, 2, 1]))
    # print('mat_diff: ', mat_diff)
    mat_diff -= tf.constant(np.eye(K), dtype=tf.float32)
    mat_diff_loss = tf.nn.l2_loss(mat_diff)
    tf.summary.scalar('mat loss', mat_diff_loss)

    return classify_loss + mat_diff_loss * reg_weight

这部分也超级好理解,通常对于分类问题,我们直接使用的就是softmax loss,代码中也体现出来了,这里加了一个正则项,一则是为了防止过拟合,二则也是为了满足旋转矩阵的性质,至于为什么三维的input transform不用,待考证,可能效果不是很好???

这个就是分类网络的部分,最后评价标准:

pred_val = np.argmax(pred_val, 1)
            correct = np.sum(pred_val == current_label[start_idx:end_idx])
            total_correct += correct
            total_seen += BATCH_SIZE
            loss_sum += (loss_val * BATCH_SIZE)
            for i in range(start_idx, end_idx):
                l = current_label[i]
                total_seen_class[l] += 1
                total_correct_class[l] += (pred_val[i - start_idx] == l)

    log_string('eval mean loss: %f' % (loss_sum / float(total_seen)))
    log_string('eval accuracy: %f' % (total_correct / float(total_seen)))
    log_string('eval avg class acc: %f' % (
        np.mean(np.array(total_correct_class) / np.array(total_seen_class, dtype=np.float))))

    eval_mean_loss = loss_sum / float(total_seen)
    eval_accuracy = total_correct / float(total_seen)
    eval_avg_class_acc = np.mean(np.array(total_correct_class) / np.array(total_seen_class, dtype=np.float))

很好如果predict是1, label也是1,就是计数加1,总共计数除以总数得到accuracy就是我们需要的结果,还有一个问题实际在paper中得到的训练结果有89.2%,LZ怎么尝试都只有大概88.5%左右,(⊙v⊙)嗯,可能Local optimum.

Pointet的读后感(CVPR2016):3.分类网络部分

猜你喜欢