1.3 centos7源码编译tensorflow-gpu版

更新时间：2019-10-1
（1）nccl不用显式制定。
（2）bazel版本低于2.27
（3）gcc版本高于4.8.5，低于7.1

更新时间：2019-6-10

文章目录

**nccl,cuda,cudnn都用最新版本的上一个版本，例如cuda最新10.1，则用10.0**

1. 准备cuda
2. 准备NCCL
3. 安装bazel
4. 安装tensorflow
5. 失败后的查错

巧的是编译安装tensorflow-gpu版成功了。
tensorflow已经更新到1.13版，官方的linux安装文件采用的是glibc2.23，而centos只支持到glibc2.17，所以在使用pip install tensorflow-gpu安装后的使用过程中会报错：

ImportError: /lib64/libc.so.6: version `GLIBC_2.23' not found (required by /usr/local/python3.6/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow.so)

升级到glibc是不可能的，升级完系统都进不了了。只能重新源码编译tensorflow，这样就不会报错了。
下面是源码编译的过程，版本为最新版1.13：

nccl,cuda,cudnn都用最新版本的上一个版本，例如cuda最新10.1，则用10.0

1. 准备cuda

这个过程不用多说，网上教程很多，我使用是cuda 10.0 cudnn 7.5.0
参考一下： https://www.jianshu.com/p/a201b91b3d96
note：一定要记住自己的cuda版本和cudnn版本，以及cuda的安装位置，因为后面用得到。

2. 准备NCCL

nccl是tensorflow gpu版必须的，现在版本2.4.2，下载地址：https://developer.nvidia.com/nccl/nccl-download
下载后应该是rpm文件，安装命令：rpm -ivh nccl-repo-rhel7-2.4.2-ga-cuda10.0-1-1.x86_64.rpm
这个很奇怪，并不会直接安装，而只是解压了一下，产生了3个rpm文件，用命令：rpm -qpl nccl-repo-rhel7-2.4.2-ga-cuda10.0-1-1.x86_64.rpm,
可以看到文件位置：
在这里插入图片描述
到相应的文件夹下安装3个rpm文件，安装位置应该默认到/usr/lib64, 如果不确定可以用rpm -qpl xxx.rpm查看安装位置。
note：这里要记住nccl的版本和安装位置

3. 安装bazel

bazel是google的编译工具，tensorflow就是用它编译的，所以必须安装。
下载链接：https://github.com/bazelbuild/bazel/releases
选在最新版下载：
在这里插入图片描述
下载后新建一个文件夹，文件名为bazel，并把该文件放到里面，解压命令：

unzip bazel-0.24.1-dist.zip

解压后编译：

./compile.sh

等待一段时间，就会提示成功，编译后二进制执行文件在： bazel/ouput 目录下，
在bashrc里添加PATH: 在这里插入图片描述
这里的目录一定要正确，之后：source ~/.bashrc
在命令行输入： bazel 出现下面就表示成功了：

4. 安装tensorflow

git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow

开始编译配置：

./configure

注意：与cuda和nccl相关的选择Y，其他都选择no：

Found possible Python library paths:
  /DATA/119/gxrao2/usr/local/conda3/lib/python3.7/site-packages
Please input the desired Python library path to use.  Default is [/DATA/119/gxrao2/usr/local/conda3/lib/python3.7/site-packages]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: n
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]: n
No TensorRT support will be enabled for TensorFlow.

Could not find any cuda.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
of:
        '/lib64'
        '/usr'
        '/usr/lib'
        '/usr/lib64'
        '/usr/lib64/mysql'
        '/usr/lib64/qt-3.3/lib'
        '/usr/lib64/xulrunner'
Asking for detailed CUDA configuration...

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10]: 10.0


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 7.5


Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]: 2.4


Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: /DATA/119/gxrao2/usr/local/cuda-10.0,/usr


Found CUDA 10.0 in:
    /DATA/119/gxrao2/usr/local/cuda-10.0/lib64
    /DATA/119/gxrao2/usr/local/cuda-10.0/include
Found cuDNN 7 in:
    /DATA/119/gxrao2/usr/local/cuda-10.0/lib64
    /DATA/119/gxrao2/usr/local/cuda-10.0/include
Found NCCL 2 in:
    /usr/lib64
    /usr/include


Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 6.1,6.1,6.1,6.1,6.1,6.1,6.1,6.1]: 


Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/local/bin/gcc]: y


Invalid gcc path. y cannot be found.
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/local/bin/gcc]: /usr/bin/gcc


Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]: n


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
	--config=mkl         	# Build with MKL support.
	--config=monolithic  	# Config for mostly static monolithic build.
	--config=gdr         	# Build with GDR support.
	--config=verbs       	# Build with libverbs support.
	--config=ngraph      	# Build with Intel nGraph support.
	--config=numa        	# Build with NUMA support.
	--config=dynamic_kernels	# (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
	--config=noaws       	# Disable AWS S3 filesystem support.
	--config=nogcp       	# Disable GCP support.
	--config=nohdfs      	# Disable HDFS support.
	--config=noignite    	# Disable Apache Ignite support.
	--config=nokafka     	# Disable Apache Kafka support.
	--config=nonccl      	# Disable NVIDIA NCCL support.
Configuration finished
(base) gxrao2@mic119:/DATA/119/gxrao2/tf_install/tensorflow$ bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package 
WARNING: Output base '/home/gxrao2/.cache/bazel/_bazel_gxrao2/6db07e2ca57d5a7ff7347aece9f0c169' is on NFS. This may lead to surprising failures and undetermined behavior.
Starting local Bazel server and connecting to it...

注意：
在这里插入图片描述

使用编译命令编译：

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

等待结束就好，需要一定的时间，如果成功，则胜利了。
在这里插入图片描述
装换为whl文件：

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

使用pip安装文件：

pip install /tmp/tensorflow_pkg/*.whl

5. 失败后的查错

bazel版本，tensorflow对于bazel有版本要求，一般最新版的tensorflow用最新的bazel肯定没有问题。
cuda，cudnn， nccl 安装位置以及版本不能有错，在配置的过程中一定要指定正确，尤其是nccl 一定要查看安装位置，不然配置过程会找不到的。
不需要的选项不要选择，配置过程一定要正确。

追蜗牛的coder 博客专家

发布了111 篇原创文章 · 获赞 185 · 访问量 312万+

他的留言板关注