对Tensorflow进行性能剖析

dataset优化：https://www.tensorflow.org/performance/datasets_performance

本文翻译自Illarion Khlestov的博文：原文链接1
如今TensorFlow是最常用的机器学习库之一。有的时候，对Tensorflow进行性能剖析是十分有用的，通过性能剖析可以了解什么操作更花费时间。这可以用tensorflow timeline模块完成。不幸的是，我找不到任何明确的教程如何使用它。所以在这篇博文中，我将尝试解决这个问题，包括以下主题：

如何执行Tensorflow代码的分析
如何从多个session运行中合并时间轴
在分析过程中可能会出现什么问题以及如何解决问题

简单的例子

首先我们来定义一个简单的例子，下面是StackOverflow的答案2：

import tensorflow as tf
from tensorflow.python.client import timeline

a = tf.random_normal([2000, 5000])
b = tf.random_normal([5000, 1000])
res = tf.matmul(a, b)

with tf.Session() as sess:
    # add additional options to trace the session execution
    options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    sess.run(res, options=options, run_metadata=run_metadata)

    # Create the Timeline object, and write it to a json file
    fetched_timeline = timeline.Timeline(run_metadata.step_stats)
    chrome_trace = fetched_timeline.generate_chrome_trace_format()
    with open('timeline_01.json', 'w') as f:
        f.write(chrome_trace)

应该注意到给session运行时添加的额外的options和run_metadata，这个脚本应该在CPU和GPU上运行，在执行之后我们会获得timeline_01.json和性能剖析的数据，这些都以Chrome的trace文件形式存储。如果你运行脚本失败了，尝试Issues during profiling这一节的第一个解决方案。

要查看存储的数据，我们应该使用Chrome浏览器（不幸的是据我所知，只有它支持自己的跟踪格式）。转到chrome://跟踪页面。在左上角，您会看到加载按钮。按下它并加载我们的JSON文件。

在顶部您将看到以ms为单位的时间轴。要获得有关操作的更准确的信息，只需点击它。还有在右侧有简单的工具：选择，平移，缩放和时间。

比较复杂的例子

现在让我们用一些占位符和优化器来定义更复杂的例子：

import os
import tempfile

import tensorflow as tf
from tensorflow.contrib.layers import fully_connected as fc
from tensorflow.examples.tutorials.mnist import input_data
from tensorflow.python.client import timeline

batch_size = 100

inputs = tf.placeholder(tf.float32, [batch_size, 784])
targets = tf.placeholder(tf.float32, [batch_size, 10])

with tf.variable_scope("layer_1"):
    fc_1_out = fc(inputs, num_outputs=500, activation_fn=tf.nn.sigmoid)
with tf.variable_scope("layer_2"):
    fc_2_out = fc(fc_1_out, num_outputs=784, activation_fn=tf.nn.sigmoid)
with tf.variable_scope("layer_3"):
    logits = fc(fc_2_out, num_outputs=10)

loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=targets))
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

if __name__ == '__main__':
    mnist_save_dir = os.path.join(tempfile.gettempdir(), 'MNIST_data')
    mnist = input_data.read_data_sets(mnist_save_dir, one_hot=True)

    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    with tf.Session(config=config) as sess:
        sess.run(tf.global_variables_initializer())

        options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
        run_metadata = tf.RunMetadata()
        for i in range(3):
            batch_input, batch_target = mnist.train.next_batch(batch_size)
            feed_dict = {inputs: batch_input,
                         targets: batch_target}

            sess.run(train_op,
                     feed_dict=feed_dict,
                     options=options,
                     run_metadata=run_metadata)

            fetched_timeline = timeline.Timeline(run_metadata.step_stats)
            chrome_trace = fetched_timeline.generate_chrome_trace_format()
            with open('timeline_02_step_%d.json' % i, 'w') as f:
                f.write(chrome_trace)

现在我们的操作存储在变量范围之下。使用这种方法，操作名称将以范围名称开始，并在时间轴上明确区分。
另外，代码存储三次运行跟踪的结果。如果我们在CPU上执行脚本，我们会收到三个相对类似的时间表，如下所示：

但是，如果我们检查GPU分析的结果，第一次的结果将与接下来的不同：

你可能会注意到，第一次运行GPU比以前要花费更多的时间。这是因为在第一次运行的tensorflow执行一些GPU初始化例程，后来将被优化。如果想要得到更精确的结果，尝试先执行多次再进行分析。
此外，所有传入/输出流都以可变范围名称开始，我们确切知道源代码中存在一个或另一个操作。

将多次运行的时间轴存储在一个文件中

如果我们想在某个文件中存储多个会话运行怎么办？不幸的是，这只能手动完成。Chrome跟踪格式内存有每个事件的存储定义，它的运行时间。在第一次迭代中，我们将存储所有数据，但在下一次运行时，我们将只更新运行时间，而不是定义本身。这里只是合并事件的类定义，以及您可以在这里找到的完整示例3：

import json
class TimeLiner:
    _timeline_dict = None

    def update_timeline(self, chrome_trace):
        # convert crome trace to python dict
        chrome_trace_dict = json.loads(chrome_trace)
        # for first run store full trace
        if self._timeline_dict is None:
            self._timeline_dict = chrome_trace_dict
        # for other - update only time consumption, not definitions
        else:
            for event in chrome_trace_dict['traceEvents']:
                # events time consumption started with 'ts' prefix
                if 'ts' in event:
                    self._timeline_dict['traceEvents'].append(event)

    def save(self, f_name):
        with open(f_name, 'w') as f:
            json.dump(self._timeline_dict, f)

我们将得到非常棒的合并之后的时间轴：

似乎初始化仍然需要很多时间，所以让我们放大到右边：

现在我们可以看到一些重复的模式。 runs之间没有任何特定的分隔符，但是我们可以区分它们。

分析时的问题

在profiling期间可能存在一些麻烦。首先，它可能根本无法工作。如果你遇到这样的错误：

`1`	`I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcupti.so.8.0. LD_LIBRARY_PATH:`

并且你确保如果不进行性能分析，一切正常，那么可以根据Github的这个issue4安装额外的库libcupti-dev，下面这条指令可以解决问题：

`1`	`sudo apt-get install libcupti-dev`

其次是运行中的延误。在最后的图像上，我们看到运行之间存在差距。对于大型网络，这可能需要非常大的时间。该错误无法完全解决，但可能会使用自定义C++protobuf库来减少延迟。在tensorflow文档中清楚地描述了如何执行安装。

结论

我希望通过这样的分析，您将更深入地了解Tensorflow框架中的内容，以及计算图的哪些部分可能被优化。含有CPU和GPU上已经生成的时间轴的所有代码示例都存储在此repo5中。
谢谢阅读！

1. https://medium.com/towards-data-science/howto-profile-tensorflow-1a49fb18073d#.93geshkgu ↩

2. http://stackoverflow.com/questions/34293714/can-i-measure-the-execution-time-of-individual-operations-with-tensorflow/ ↩

3. https://github.com/ikhlestov/tensorflow_profiling/blob/master/03_merged_timeline_example.py ↩

4. https://github.com/tensorflow/tensorflow/issues/5282 ↩

5. https://github.com/ikhlestov/tensorflow_profiling ↩