Horovod——Uber分布式深度学习框架部署实践
Horovod 是 Uber 开源的又一个深度学习工具,它的发展吸取了 Facebook一小时训练 ImageNet 论文与百度 Ring Allreduce 的优点,可为用户实现分布式训练提供帮助
Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed Deep Learning fast and easy to use.
References:
分布式模式介绍 — Tensorflow, Horovod
业界 | 详解Horovod:Uber开源的TensorFlow分布式深度学习框架
Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod github_homepage
Horovod示例代码
部署实践——Horovod in docker:
测试环境:
# Ubuntu18.04LST
# Nvidia-Driver410.93
# Docker版本18.09.2
# CUDA9.0
# Python3.6
## Ubuntu系统安装——bios设置:
# Secure Boot设置为disabled
# Boot 设置为了UEFI only
# 把U盘设置为第一启动项
显卡驱动:
# Nvidia官网下载合适驱动(.run安装包)
# 禁用nouveau驱动
# 编辑 /etc/modprobe.d/blacklist-nouveau.conf 文件,添加以下内容:
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
# 保存,关闭nouveau:
$ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
# 重启:
$ sudo update-initramfs -u
$ sudo reboot
# 获取Kernel source (非常重要):
$ sudo apt-get install linux-source
$ sudo apt-get install linux-headers-$(uname -r)
# 先按Ctrl + Alt + F1到控制台,关闭当前图形环境
$ sudo service lightdm stop
# 安装Nvidia驱动:
$ chmod +x NVIDIA-Linux-x86_64-xxx.xx.run
$ sudo ./NVIDIA-Linux-x86_64-xxx.xx.run
# 然后挂在Nvidia驱动
$ modprobe nvidia
# 检查驱动是否安装成功
$ nvidia-smi
系统版本支持及环境包配置:
# 参考文档:https://nvidia.github.io/nvidia-docker/
$ sudo passwd
$ apt-get install -y vim
$ apt-get install -y curl
$ apt-get install -y net-tools
$ apt-get install -y gcc
$ apt-get install -y make
部署docker环境:
# docker需要指定版本docker-ce=5:18.09.2~3-0~ubuntu-bionic,要先停止和删除原来的docker再安装新的
# 参考文档: https://blog.csdn.net/bingzhongdehuoyan/article/details/79411479
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
$ sudo apt-get update
# 查看当前可用的docker版本
$ apt-cache madison docker-ce
# 移除原来的docker
$ docker -v
$ sudo apt-get remove docker docker-engine docker-ce docker.io
# 选定一个版本安装
$ sudo apt-get install docker-ce=5:18.09.2~3-0~ubuntu-bionic
$ docker -v
# docker 相关命令
$ docker load < ./image_name.tar.gz # 加载已保存镜像
$ docker save docker_name > ./image_name.tar.gz # 保存docker镜像
$ docker stop $(docker ps -q) # 关闭所有容器
$ docker rm $(docker ps -a -q) # 删除所有容器
$ docker ps -a # 查看所有容器(包括正在运行和已停止)
$ nvidia-docker run -it # docker运行
通过配置DockerFile, 在线下载并部署horovod环境:
# 在部署之前可以配置Dockerfile中的软件版本——CUDA、Tensorflow、Pytorch、Python
$ mkdir horovod-docker
$ wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/uber/horovod/master/Dockerfile
$ docker build -t horovod:latest horovod-docker
Nvidia-docker2.0安装:
# 参考文档:https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-2.0)#prerequisites
# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
$ docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
$ sudo apt-get purge -y nvidia-docker
# Add the package repositories
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update
# Install nvidia-docker2 and reload the Docker daemon configuration
$ sudo apt-get install -y nvidia-docker2
$ sudo pkill -SIGHUP dockerd
# Test nvidia-smi with the latest official CUDA image
$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
多机器root_ssh配置:
# root_ssh免密登陆参考——https://www.cnblogs.com/toughlife/p/5633510.html
# 编辑ssh配置文件允许root免密ssh登录:
$ sudo vi /etc/ssh/sshd_config
# 调整PermitRootLogin参数值为yes
# 将PermitEmptyPasswords yes前面的#号去掉
# 将PermitEmptyPasswords 参数值修改为yes
$ service sshd restart
# 生成免密登陆公钥
# 公钥参考——http://www.linuxproblem.org/art_9.html
# su进入root
a@A:~> ssh-keygen -t rsa
a@A:~> ssh b@B mkdir -p .ssh
a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'
# 最后将id_rsa 和 authorized_keys 复制到/mnt/share/ssh目录
Horovod测试:
Running on single machine:
# su进入root模式
$ nvidia-docker run -it horovod:latest # 启动 nvidia-docker
root@c278c88dd552:/examples# mpirun \ # 容器内运行mpirun
-np 1 \ # 运行1个进程,p.s.一块显卡分配一个进程
-H localhost:1 \ # 本地主机允许分配的最大进程数
python keras_mnist_advanced.py # 运行训练脚本
# 完整命令: mpirun -np 1 -H localhost:1 python keras_mnist_advanced.py
Running on multiple machines:
# su进入root模式
# Primary workers:--------------------------------------------------------------------------------------------------------------------
host1$ nvidia-docker run -it \
--network=host \ # 网络配置同主机
-v /mnt/share/ssh:/root/.ssh \ # 挂载本地ssh免密登录公钥路径到容器
horovod:latest # 启动 nvidia-docker
# 完整命令: nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
root@c278c88dd552:/examples# mpirun \ # 容器内运行mpirun
-np 3 \ # 运行3个进程
-H host1:1,host2:1,host3:1 \ # 每个主机允许的最大进程数
-mca plm_rsh_args "-p 12345" \ # 默认配置及端口配置
python keras_mnist_advanced.py # 运行训练脚本
# 完整命令: mpirun -np 3 -H ai001:1,ai002:1,ai003:1 -mca plm_rsh_args "-p 12345" python keras_mnist_advanced.py
# Secondary workers:------------------------------------------------------------------------------------------------------------
host2$ nvidia-docker run -it \ # 容器内运行mpirun
--network=host \ # 网络配置同主机
-v /mnt/share/ssh:/root/.ssh \ # 挂载本地ssh免密登录公钥路径到容器
horovod:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity" # 运行horovod与相关网络配置
# 完整命令: nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity" # 配置host3同host2