【mask rcnn】多个GPU训练遇到错误

1、Loaded runtime CuDNN library: 7103 (compatibility version 7100) but source was compiled with 7005 (compatibility version 7000). If using a binary install, upgrade your CuDNN library to match. If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
解析：cuda与cudnn版本不匹配问题

conda install cudatoolkit=9.0

conda install cudnn=7.1.2

安装参考：http://www.sohu.com/a/225953058_491081

2、InvalidArgumentError (see above for traceback): Integer division by zero
[[Node: training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/mod = FloorMod[T=DT_INT32, _class=["loc:@mrcnn_bbox_loss_1/concat"], _device="/job:localhost/replica:0/task:0/cpu:0"](mrcnn_bbox_loss_1/concat/axis, training/SGD/gradients/mrcnn_bbox_loss_1/concat_grad/Rank)]]
解析：更新tensorflow-gpu版本到1.7

3、 Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.13GiB. The caller indicates that this 。。。

解决方法：方法一：batch_size 小一点；

方法二：（没有使用过）

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config = config)

参考：https://stackoverflow.com/questions/36927607/how-can-i-solve-ran-out-of-gpu-memory-in-tensorflow

【mask rcnn】多个GPU训练遇到错误

猜你喜欢