error gathering device information while adding custom device “/dev/nvidia-uvm“: no such file or dir

使用docker + tensorflow 运行深度学习训练，遇到的错误：
error gathering device information while adding custom device “/dev/nvidia0”: no such file or directory
查看了宿主机的/dev目录，确实没有nvidia0文件

[root@gpu2 dev]# ls
autofs           disk       hwrng         mem                 port    rtc0   sdc       stderr  tty12  tty2   tty27  tty34  tty41  tty49  tty56  tty63  uhid     vcs4   vcsa5        zero
block            dri        initctl       memory_bandwidth    ppp     sda    sdc1      stdin   tty13  tty20  tty28  tty35  tty42  tty5   tty57  tty7   uinput   vcs5   vcsa6
bsg              fb0        input         mqueue              ptmx    sda1   sg0       stdout  tty14  tty21  tty29  tty36  tty43  tty50  tty58  tty8   urandom  vcs6   vfio
char             fd         kmsg          net                 ptp0    sda14  sg1       tty     tty15  tty22  tty3   tty37  tty44  tty51  tty59  tty9   usbmon0  vcsa   vga_arbiter
console          full       log           network_latency     pts     sda15  sg2       tty0    tty16  tty23  tty30  tty38  tty45  tty52  tty6   ttyS0  vcs      vcsa1  vhci
core             fuse       loop-control  network_throughput  random  sda2   shm       tty1    tty17  tty24  tty31  tty39  tty46  tty53  tty60  ttyS1  vcs1     vcsa2  vhost-net
cpu              hpet       mapper        null                raw     sdb    snapshot  tty10   tty18  tty25  tty32  tty4   tty47  tty54  tty61  ttyS2  vcs2     vcsa3  vhost-vsock
cpu_dma_latency  hugepages  mcelog        nvram               rtc     sdb1   snd       tty11   tty19  tty26  tty33  tty40  tty48  tty55  tty62  ttyS3  vcs3     vcsa4  vmbus

首先尝试主机能否检测到显卡：

[root@gpu2 dev]# nvidia-smi 
Fri Aug  7 19:04:20 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000001:00:00.0 Off |                    0 |
| N/A   28C    P0    35W / 250W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000002:00:00.0 Off |                    0 |
| N/A   27C    P0    33W / 250W |      0MiB / 16160MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

显卡没有问题，同时/dev下也可以看到对应的硬件了：

[root@gpu2 dev]# ls
autofs           disk       hwrng         mem                 nvidia1    random  sda2  shm       tty1   tty17  tty24  tty31  tty39  tty46  tty53  tty60  ttyS1    vcs1   vcsa2        vhost-net
block            dri        initctl       memory_bandwidth    nvidiactl  raw     sdb   snapshot  tty10  tty18  tty25  tty32  tty4   tty47  tty54  tty61  ttyS2    vcs2   vcsa3        vhost-vsock
bsg              fb0        input         mqueue              nvram      rtc     sdb1  snd       tty11  tty19  tty26  tty33  tty40  tty48  tty55  tty62  ttyS3    vcs3   vcsa4        vmbus
char             fd         kmsg          net                 port       rtc0    sdc   stderr    tty12  tty2   tty27  tty34  tty41  tty49  tty56  tty63  uhid     vcs4   vcsa5        zero
console          full       log           network_latency     ppp        sda     sdc1  stdin     tty13  tty20  tty28  tty35  tty42  tty5   tty57  tty7   uinput   vcs5   vcsa6
core             fuse       loop-control  network_throughput  ptmx       sda1    sg0   stdout    tty14  tty21  tty29  tty36  tty43  tty50  tty58  tty8   urandom  vcs6   vfio
cpu              hpet       mapper        null                ptp0       sda14   sg1   tty       tty15  tty22  tty3   tty37  tty44  tty51  tty59  tty9   usbmon0  vcsa   vga_arbiter
cpu_dma_latency  hugepages  mcelog        nvidia0             pts        sda15   sg2   tty0      tty16  tty23  tty30  tty38  tty45  tty52  tty6   ttyS0  vcs      vcsa1  vhci

而报错信息也更新为：

error gathering device information while adding custom device "/dev/nvidia-uvm": no such file or dir

尝试手动加载：

[root@gpu2 dev]# nvidia-modprobe -u -c=0
[root@gpu2 dev]# ls | grep nvidia
nvidia0
nvidia1
nvidiactl
nvidia-modeset
nvidia-uvm
nvidia-uvm-tools

贴一个官方参考

error gathering device information while adding custom device “/dev/nvidia-uvm“: no such file or dir

猜你喜欢