问题:非分布式训练时没有问题,使用多卡分布式训练时报错
Error message
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argumentfind_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
; (2) making su re allforward
function outputs participate in calculating loss. If you already have done the above two steps, then th e distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’sforwar d
function. Please include the loss function and the structure of the return value offorward
of your module when rep orting this issue (e.g. list, dict, iterable).
原因:
模块的__init__方法中存在一些带参数的模块,但是在forward函数中没有使用。
解决
对于forward函数不使用的模块,在__init__方法中注释掉即可。也就是说用什么模块就定义哪个模块,不要定义一些模块但不去使用