事件:使用pytorch进行multi-task learning,训练到30-60 epoch的时候,机器卡死了。虽然是ubuntu也卡死了
原因:一通没头没脑地分析之后,原因可能是内存泄漏。
解决:
将数据记录到log文件以提供给tensorboard可视化分析的时候,注意要在结束时关闭 SummaryWriter
writer = SummaryWriter(os.path.join(ckptDir, 'logs'))
for epoch in range(num_epochs):
...
# tensorboardX
writer.add_scalar('learning rate', lr, epoch + 1)
writer.add_scalars('loss', {'train loss': train_loss, 'validation loss': val_loss}, epoch + 1)
writer.add_scalars('accuracy', {'train accuracy': train_acc, 'validation accuracy': val_acc}, epoch + 1)
writer.add_scalars('balanced accuracy', {'train bacc': train_bacc, 'validation bacc': val_bacc}, epoch + 1)
...
# 就是这一句
writer.close()
参考:Drux @ https://stackoverflow.com/questions/44831317/tensorboard-unble-to-get-first-event-timestamp-for-run