TF加速-AOT之tfcompile趟坑

本文是以前写的，一些杂项（Miscs）

google groups论坛里的

https://groups.google.com/a/tensorflow.org/forum/#!msg/discuss/9LwjZC-yrYs/H7gGieXpBgAJ

讲的是编译tfcompile_test:

bazel build tensorflow/compiler/aot/tests:tfcompile_test

这句话ok的。编译时间需要很久。报错了，不用看了，就bug在tfcompile_test模块上。

一个日文网站

https://qiita.com/qiita_kuru/items/71660124b807c00ace31

讲的是tfcompile比较全面的使用，但是我就试了一个bazel build，反正不对报bug：

bazel build --config=opt --config=//tensorflow/compiler/aot:tfcompile

不知道怎么debugging，也有可能是我配置或者版本不对。

附录tfcompile 的options：

 $ tfcompile --graph=mygraph.pb --config=myfile.pbtxt --cpp_class="mynamespace::MyComputation"

usage: ./tfcompile
Flags:
        --graph=""                              string  Input GraphDef file.  If the file ends in '.pbtxt' it is expected to be in the human-readable proto text format, otherwise it is expected to be in the proto binary format.
        --config=""                             string  Input file containing Config proto.  If the file ends in '.pbtxt' it is expected to be in the human-readable proto text format, otherwise it is expected to be in the proto binary format.
        --dump_fetch_nodes=false                bool    If set, only flags related to fetches are processed, and the resulting fetch nodes will be dumped to stdout in a comma-separated list.  Typically used to format arguments for other tools, e.g. freeze_graph.
        --target_triple="x86_64-pc-linux"       string  Target platform, similar to the clang -target flag.  The general format is <arch><sub>-<vendor>-<sys>-<abi>.  http://clang.llvm.org/docs/CrossCompilation.html#target-triple.
        --target_cpu=""                         string  Target cpu, similar to the clang -mcpu flag.  http://clang.llvm.org/docs/CrossCompilation.html#cpu-fpu-abi
        --target_features=""                    string  Target features, e.g. +avx2, +neon, etc.
        --entry_point="entry"                   string  Name of the generated function.  If multiple generated object files will be linked into the same binary, each will need a unique entry point.
        --cpp_class=""                          string  Name of the generated C++ class, wrapping the generated function.  The syntax of this flag is [[<optional_namespace>::],...]<class_name>.  This mirrors the C++ syntax for referring to a class, where multiple namespaces may precede the class name, separated by double-colons.  The class will be generated in the given namespace(s), or if no namespaces are given, within the global namespace.
        --out_function_object="out_model.o"     string  Output object file containing the generated function for the TensorFlow model.
        --out_header="out.h"                    string  Output header file name.
        --out_metadata_object="out_helper.o"    string  Output object file name containing optional metadata for the generated function.
        --out_session_module=""                 string  Output session module proto.
        --gen_name_to_index=false               bool    Generate name-to-index data for Lookup{Arg,Result}Index methods.
        --gen_program_shape=false               bool    Generate program shape data for the ProgramShape method.
        --xla_generate_hlo_graph=""             string  HLO modules matching this regex will be dumped to a .dot file throughout various stages in compilation.
        --xla_hlo_graph_addresses=false         bool    With xla_generate_hlo_graph, show addresses of HLO ops in graph dump.
        --xla_hlo_graph_path=""                 string  With xla_generate_hlo_graph, dump the graphs into this path.
        --xla_hlo_dump_as_graphdef=false        bool    Dump HLO graphs as TensorFlow GraphDefs.
        --xla_hlo_graph_sharding_color=false    bool    Assign colors based on sharding assignments when generating the HLO graphs.
        --xla_hlo_tfgraph_device_scopes=false   bool    When generating TensorFlow HLO graphs, if the HLO instructions are assigned to a specific device, prefix the name scope with "devX" with X being the device ordinal.
        --xla_log_hlo_text=""                   string  HLO modules matching this regex will be dumped to LOG(INFO).
        --xla_generate_hlo_text_to=""           string  Dump all HLO modules as text into the provided directory path.
        --xla_enable_fast_math=true             bool    Enable unsafe fast-math optimizations in the compiler; this may produce faster code at the expense of some accuracy.
        --xla_llvm_enable_alias_scope_metadata=true     bool    In LLVM-based backends, enable the emission of !alias.scope metadata in the generated IR.
        --xla_llvm_enable_noalias_metadata=true bool    In LLVM-based backends, enable the emission of !noalias metadata in the generated IR.
        --xla_llvm_enable_invariant_load_metadata=true  bool    In LLVM-based backends, enable the emission of !invariant.load metadata in the generated IR.
        --xla_llvm_disable_expensive_passes=false       bool    In LLVM-based backends, disable a custom set of expensive optimization passes.
        --xla_backend_optimization_level=3      int32   Numerical optimization level for the XLA compiler backend.
        --xla_disable_hlo_passes=""             string  Comma-separated list of hlo passes to be disabled. These names must exactly match the passes' names; no whitespace around commas.
        --xla_embed_ir_in_executable=false      bool    Embed the compiler IR as a string in the executable.
        --xla_dump_ir_to=""                     string  Dump the compiler IR into this directory as individual files.
        --xla_eliminate_hlo_implicit_broadcast=true     bool    Eliminate implicit broadcasts when lowering user computations to HLO instructions; use explicit broadcast instead.
        --xla_cpu_multi_thread_eigen=true       bool    When generating calls to Eigen in the CPU backend, use multi-threaded Eigen mode.
        --xla_gpu_cuda_data_dir="./cuda_sdk_lib"        string  If non-empty, speficies a local directory containing ptxas and nvvm libdevice files; otherwise we use those from runfile directories.
        --xla_gpu_ftz=false                     bool    If true, flush-to-zero semantics are enabled in the code generated for GPUs.
        --xla_gpu_disable_multi_streaming=false bool    If true, multi-streaming in the GPU backend is disabled.
        --xla_gpu_max_kernel_unroll_factor=4    int32   Specify the maximum kernel unroll factor for the GPU backend.
        --xla_dump_optimized_hlo_proto_to=""    string  Dump Hlo after all hlo passes are executed as proto binary into this directory.
        --xla_dump_unoptimized_hlo_proto_to=""  string  Dump HLO before any hlo passes are executed as proto binary into this directory.
        --xla_dump_per_pass_hlo_proto_to=""     string  Dump HLO after each pass as an HloProto in binary file format into this directory.
        --xla_test_all_output_layouts=false     bool    Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of output layouts. For example, with a 3D shape, all permutations of the set {0, 1, 2} are tried.
        --xla_test_all_input_layouts=false      bool    Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of *input* layouts. For example, for 2 input arguments with 2D shape and 4D shape, the computation will run 2! * 4! times for every possible layouts
        --xla_hlo_profile=false                 bool    Instrument the computation to collect per-HLO cycle counts
        --xla_dump_computations_to=""           string  Dump computations that XLA executes into the provided directory path
        --xla_dump_executions_to=""             string  Dump parameters and results of computations that XLA executes into the provided directory path
        --xla_backend_extra_options=""          string  Extra options to pass to a backend; comma-separated list of 'key=val' strings (=val may be omitted); no whitespace around commas.
        --xla_reduce_precision=""               string  Directions for adding reduce-precision operations. Format is 'LOCATION=E,M:OPS;NAMES' where LOCATION is the class of locations in which to insert the operations (e.g., 'OP_OUTPUTS'), E and M are the exponent and matissa bit counts respectively, and OPS and NAMES are comma-separated (no spaces) lists of the operation types and names to which to attach the reduce-precision operations.  The NAMES string and its preceding ';' may be omitted.  This option may be repeated to define multiple sets of added reduce-precision operations.
        --xla_gpu_use_cudnn_batchnorm=false     bool    Allows the GPU backend to implement batchnorm HLOs using cudnn, rather than expanding them to a soup of HLOs.
        --xla_cpu_use_mkl_dnn=false             bool    Generate calls to MKL-DNN in the CPU backend.

大概记录一下贡献tfcompile的开发者的想法：
应该是想用tfcompile帮助使用者弄个自己定义的操作运算（ops）的头文件出来。
官方的例程文档中，用了Eigen，调用关系在例程中看得比较懵比。

官方文档

有个哥们在git主页push 了一个issue，讲的是

Unable to compile a quantized graph using XLA AOT? #11604

有contributer在2017年下半年有所回应，说正在加上对已量化的模型的XLA支持，后面也没消息了。

还有一个悲剧的哥们，在git主页上把question当issue push上去了，问了两次，被两个人叫去SO上面提问题，说他们在SO上面回答问题更多网友会受益。

然后他在SO上面提了问题，也没有人回答。在这里copy一下他SO问题主页：

How can I gdb tfcompile by using bazel build?

后面会持续跟进他这个问题的答案。

SO上有个姐们问了怎样用tfcompile，这个问题底下还算有比较多干货：

How to exercise Tensorflow XLA AOT support in tensorflow’s distribution

有个哥们说的，之前没注意，要做XLA一定要把JIT也看一下，不能为了做ARM上的AOT编译而只看AOT

在tf一个issue底下看到了一个哥们对于tfcompile编译的代码：

这个哥们引用了一个merged request页面。

先拷贝上来：

bazel build --config=opt --config=monolithic --copt=/DNOGDI --host_copt=/DNOGDI //tensorflow/compiler/aot:tfcompile

我改了一下，因为说找不到`/DNOGDI`：

bazel build --config=opt --config=monolithic  //tensorflow/compiler/aot:tfcompile

成功。编译时间比较长。

tfcompile.bzl
官方源码

记录一下，使用tfcompile的时候需要提供pb文件和pbtxt文件，pb转pbtxt和pbtxt转pb转换方法见

链接

直接在命令行里整：

pb 2 pbtxt:

unaguo@unaguo: $ python
>>>
>>> import tensorflow as tf
>>> from tensorflow.python.platform import gfile
>>> from google.protobuf import text_format
>>> filename='/home/unaguo/test/test.pb'
>>> with gfile.FastGFile(filename,'rb') as f:
...     graph_def = tf.GraphDef()
...     graph_def.ParseFromString(f.read())
...     tf.import_graph_def(graph_def, name='')
...     tf.train.write_graph(graph_def, '/home/unaguo/test/', 'test.pbtxt', as_text=True)
>>>

pbtxt 2 pb:

unaguo@unaguo: $ python
>>>
>>> import tensorflow as tf
>>> from tensorflow.python.platform import gfile
>>> from google.protobuf import text_format
>>> filename='/home/unaguo/test/test.pbtxt'
>>> with gfile.FastGFile(filename,'rb') as f:
...     graph_def = tf.GraphDef()
...     file_content = f.read()
...     text_format.Merge(file_content, graph_def)
...     tf.train.write_graph( graph_def , './' , 'protobuf.pb' , as_text = False)
>>>

github上面有个人提了一个issue：

tfcompile with --config=monolithic and -fvisibility=hidden results in undefined reference __xla_cpu_runtime_EigenMatMulF32

他想用--config=monolithic去tfcompile

我试了一下，报错了：

Non-OK-status: status status: Not found: monolithic; No such file or directory

SO上面有个问题：

tfcompile of tf.cond of constants errors

这个哥们粘贴了他用的tfcompile的代码：

tfcompile --graph=test_graph.pb --config=test_config.pb --entry_point=test_func --cpp_class=test --out_object=test_func.o --out_header=test.hpp

主要问题就是，我不知道--config=test_config.pb怎么生成，所以才会像10.中一样，寻求一个简单的方式去生成

TF加速-AOT之tfcompile趟坑

我改了一下，因为说找不到/DNOGDI：

猜你喜欢

我改了一下，因为说找不到`/DNOGDI`：