RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

问题:非分布式训练时没有问题,使用多卡分布式训练时报错

Error message
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making su re all forward function outputs participate in calculating loss. If you already have done the above two steps, then th e distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forwar d function. Please include the loss function and the structure of the return value of forward of your module when rep orting this issue (e.g. list, dict, iterable).

原因:

模块的__init__方法中存在一些带参数的模块,但是在forward函数中没有使用。

解决

对于forward函数不使用的模块,在__init__方法中注释掉即可。也就是说用什么模块就定义哪个模块,不要定义一些模块但不去使用

猜你喜欢

转载自blog.csdn.net/u014295602/article/details/130781392