利用caffe第三方实现的combined margin_layer进行训练(https://github.com/gehaocool/CombinedMargin-caffe),训练过程中出现了一下问题,这里进行记录。
一、利用开源项目中的res36-E网络进行训练
训练过程比较顺利,虽然为了尽快训练完,将epoch进行了精简,但是最终的loss也下降到2左右,和项目给出的loss基本相同,训练过程中分类的精度也稳定在0.7-0.8之间。
I1222 12:24:23.999287 2126 solver.cpp:228] Iteration 277750, loss = 2.11202
I1222 12:24:23.999418 2126 solver.cpp:244] Train net output #0: accuracy = 0.734375
I1222 12:24:23.999433 2126 solver.cpp:244] Train net output #1: softmax_loss = 1.72783 (* 1 = 1.72783 loss)
I1222 12:24:23.999442 2126 sgd_solver.cpp:106] Iteration 277750, lr = 1e-05
I1222 12:25:15.731434 2126 solver.cpp:228] Iteration 277800, loss = 1.79811
I1222 12:25:15.731550 2126 solver.cpp:244] Train net output #0: accuracy = 0.78125
I1222 12:25:15.731564 2126 solver.cpp:244] Train net output #1: softmax_loss = 1.98719 (* 1 = 1.98719 loss)
I1222 12:25:15.731572 2126 sgd_solver.cpp:106] Iteration 277800, lr = 1e-05
I1222 12:26:07.457556 2126 solver.cpp:228] Iteration 277850, loss = 2.63742
I1222 12:26:07.457665 2126 solver.cpp:244] Train net output #0: accuracy = 0.65625
I1222 12:26:07.457679 2126 solver.cpp:244] Train net output #1: softmax_loss = 2.20569 (* 1 = 2.20569 loss)
I1222 12:26:07.457686 2126 sgd_solver.cpp:106] Iteration 277850, lr = 1e-05
I1222 12:26:59.228655 2126 solver.cpp:228] Iteration 277900, loss = 1.94271
I1222 12:26:59.228780 2126 solver.cpp:244] Train net output #0: accuracy = 0.796875
I1222 12:26:59.228792 2126 solver.cpp:244] Train net output #1: softmax_loss = 1.22154 (* 1 = 1.22154 loss)
I1222 12:26:59.228806 2126 sgd_solver.cpp:106] Iteration 277900, lr = 1e-05
I1222 12:27:51.053869 2126 solver.cpp:228] Iteration 277950, loss = 1.81116
I1222 12:27:51.053997 2126 solver.cpp:244] Train net output #0: accuracy = 0.65625
I1222 12:27:51.054013 2126 solver.cpp:244] Train net output #1: softmax_loss = 1.97967 (* 1 = 1.97967 loss)
I1222 12:27:51.054021 2126 sgd_solver.cpp:106] Iteration 277950, lr = 1e-05
I1222 12:28:42.938175 2126 solver.cpp:228] Iteration 278000, loss = 2.22095
I1222 12:28:42.938287 2126 solver.cpp:244] Train net output #0: accuracy = 0.671875
I1222 12:28:42.938300 2126 solver.cpp:244] Train net output #1: softmax_loss = 2.34338 (* 1 = 2.34338 loss)
I1222 12:28:42.938309 2126 sgd_solver.cpp:106] Iteration 278000, lr = 1e-05
I1222 12:29:34.774622 2126 solver.cpp:228] Iteration 278050, loss = 2.35736
I1222 12:29:34.774727 2126 solver.cpp:244] Train net output #0: accuracy = 0.71875
I1222 12:29:34.774741 2126 solver.cpp:244] Train net output #1: softmax_loss = 2.62041 (* 1 = 2.62041 loss)
I1222 12:29:34.774749 2126 sgd_solver.cpp:106] Iteration 278050, lr = 1e-05
I1222 12:30:26.505686 2126 solver.cpp:228] Iteration 278100, loss = 1.95937
I1222 12:30:26.505806 2126 solver.cpp:244] Train net output #0: accuracy = 0.71875
I1222 12:30:26.505820 2126 solver.cpp:244] Train net output #1: softmax_loss = 2.02977 (* 1 = 2.02977 loss)
I1222 12:30:26.505829 2126 sgd_solver.cpp:106] Iteration 278100, lr = 1e-05
I1222 12:31:18.296882 2126 solver.cpp:228] Iteration 278150, loss = 1.59377
I1222 12:31:18.296995 2126 solver.cpp:244] Train net output #0: accuracy = 0.796875
I1222 12:31:18.297009 2126 solver.cpp:244] Train net output #1: softmax_loss = 1.45988 (* 1 = 1.45988 loss)
I1222 12:31:18.297017 2126 sgd_solver.cpp:106] Iteration 278150, lr = 1e-05
I1222 12:32:10.071180 2126 solver.cpp:228] Iteration 278200, loss = 2.38386
I1222 12:32:10.071321 2126 solver.cpp:244] Train net output #0: accuracy = 0.609375
I1222 12:32:10.071333 2126 solver.cpp:244] Train net output #1: softmax_loss = 3.07407 (* 1 = 3.07407 loss)
I1222 12:32:10.071341 2126 sgd_solver.cpp:106] Iteration 278200, lr = 1e-05
二、结合SE-resXt50-E网络进行训练
为了尝试复现论文中的一些更加复杂的模型,自行设计了SE-resXt50-E网络进行训练复现。训练过程中出现了如下情况。
(1)出现梯度爆炸现象
利用SE-ResXt50的预训练模型,结合combined margin进行人脸识别任务的finetuning,在7850次迭代时,loss变为87.3365,即梯度出现爆炸。
I1222 16:49:02.153913 7256 solver.cpp:228] Iteration 7650, loss = 24.1841
I1222 16:49:02.153985 7256 solver.cpp:244] Train net output #0: accuracy = 0
I1222 16:49:02.153998 7256 solver.cpp:244] Train net output #1: softmax_loss = 24.1341 (* 1 = 24.1341 loss)
I1222 16:49:02.154006 7256 sgd_solver.cpp:106] Iteration 7650, lr = 0.01
I1222 16:49:45.972348 7256 solver.cpp:228] Iteration 7700, loss = 24.1163
I1222 16:49:45.972483 7256 solver.cpp:244] Train net output #0: accuracy = 0
I1222 16:49:45.972497 7256 solver.cpp:244] Train net output #1: softmax_loss = 24.0554 (* 1 = 24.0554 loss)
I1222 16:49:45.972504 7256 sgd_solver.cpp:106] Iteration 7700, lr = 0.01
I1222 16:50:29.802949 7256 solver.cpp:228] Iteration 7750, loss = 24.0914
I1222 16:50:29.803061 7256 solver.cpp:244] Train net output #0: accuracy = 0
I1222 16:50:29.803073 7256 solver.cpp:244] Train net output #1: softmax_loss = 24.1256 (* 1 = 24.1256 loss)
I1222 16:50:29.803081 7256 sgd_solver.cpp:106] Iteration 7750, lr = 0.01
I1222 16:51:13.615972 7256 solver.cpp:228] Iteration 7800, loss = 24.071
I1222 16:51:13.616046 7256 solver.cpp:244] Train net output #0: accuracy = 0
I1222 16:51:13.616060 7256 solver.cpp:244] Train net output #1: softmax_loss = 24.0491 (* 1 = 24.0491 loss)
I1222 16:51:13.616066 7256 sgd_solver.cpp:106] Iteration 7800, lr = 0.01
I1222 16:51:57.547444 7256 solver.cpp:228] Iteration 7850, loss = 87.3365
I1222 16:51:57.547547 7256 solver.cpp:244] Train net output #0: accuracy = 0
I1222 16:51:57.547559 7256 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I1222 16:51:57.547566 7256 sgd_solver.cpp:106] Iteration 7850, lr = 0.01
I1222 16:52:41.812505 7256 solver.cpp:228] Iteration 7900, loss = 87.3365
I1222 16:52:41.812608 7256 solver.cpp:244] Train net output #0: accuracy = 0
I1222 16:52:41.812621 7256 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I1222 16:52:41.812628 7256 sgd_solver.cpp:106] Iteration 7900, lr = 0.01
I1222 16:53:26.071609 7256 solver.cpp:228] Iteration 7950, loss = 87.3365
I1222 16:53:26.071732 7256 solver.cpp:244] Train net output #0: accuracy = 0
I1222 16:53:26.071743 7256 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I1222 16:53:26.071750 7256 sgd_solver.cpp:106] Iteration 7950, lr = 0.01
I1222 16:54:10.328733 7256 solver.cpp:228] Iteration 8000, loss = 87.3365
I1222 16:54:10.328857 7256 solver.cpp:244] Train net output #0: accuracy = 0
I1222 16:54:10.328871 7256 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I1222 16:54:10.328878 7256 sgd_solver.cpp:106] Iteration 8000, lr = 0.01
I1222 16:54:54.585865 7256 solver.cpp:228] Iteration 8050, loss = 87.3365
I1222 16:54:54.585988 7256 solver.cpp:244] Train net output #0: accuracy = 0
I1222 16:54:54.586001 7256 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I1222 16:54:54.586009 7256 sgd_solver.cpp:106] Iteration 8050, lr = 0.01
原因:
模型文件中的BN层中的use_global_stats为deploy文件中的true,因此bn层的参数使用之前保存的参数,并未进行训练,导致loss爆炸。
(2)出现陷入局部极值点
不从SE-ResXt50的预训练模型载入权重,直接从头开始训练(lr=0.1),发现训练到3900此died迭代时,loss变为15.0631,并且后面loss一直为该值,不再变化。
I1224 10:28:33.599040 2932 solver.cpp:228] Iteration 3550, loss = 15.0634
I1224 10:28:33.599190 2932 solver.cpp:244] Train net output #0: accuracy = 0
I1224 10:28:33.599203 2932 solver.cpp:244] Train net output #1: softmax_loss = 15.0634 (* 1 = 15.0634 loss)
I1224 10:28:33.599211 2932 sgd_solver.cpp:106] Iteration 3550, lr = 0.1
I1224 10:29:27.784029 2932 solver.cpp:228] Iteration 3600, loss = 15.0633
I1224 10:29:27.784162 2932 solver.cpp:244] Train net output #0: accuracy = 0
I1224 10:29:27.784175 2932 solver.cpp:244] Train net output #1: softmax_loss = 15.0633 (* 1 = 15.0633 loss)
I1224 10:29:27.784184 2932 sgd_solver.cpp:106] Iteration 3600, lr = 0.1
I1224 10:30:21.521454 2932 solver.cpp:228] Iteration 3650, loss = 15.0632
I1224 10:30:21.521585 2932 solver.cpp:244] Train net output #0: accuracy = 0
I1224 10:30:21.521598 2932 solver.cpp:244] Train net output #1: softmax_loss = 15.0632 (* 1 = 15.0632 loss)
I1224 10:30:21.521605 2932 sgd_solver.cpp:106] Iteration 3650, lr = 0.1
I1224 10:31:15.220844 2932 solver.cpp:228] Iteration 3700, loss = 15.0632
I1224 10:31:15.220999 2932 solver.cpp:244] Train net output #0: accuracy = 0
I1224 10:31:15.221010 2932 solver.cpp:244] Train net output #1: softmax_loss = 15.0632 (* 1 = 15.0632 loss)
I1224 10:31:15.221019 2932 sgd_solver.cpp:106] Iteration 3700, lr = 0.1
I1224 10:32:09.000339 2932 solver.cpp:228] Iteration 3750, loss = 15.0631
I1224 10:32:09.001657 2932 solver.cpp:244] Train net output #0: accuracy = 0
I1224 10:32:09.001670 2932 solver.cpp:244] Train net output #1: softmax_loss = 15.0631 (* 1 = 15.0631 loss)
I1224 10:32:09.001678 2932 sgd_solver.cpp:106] Iteration 3750, lr = 0.1
I1224 10:33:02.783416 2932 solver.cpp:228] Iteration 3800, loss = 15.0631
I1224 10:33:02.783567 2932 solver.cpp:244] Train net output #0: accuracy = 0
I1224 10:33:02.783579 2932 solver.cpp:244] Train net output #1: softmax_loss = 15.0631 (* 1 = 15.0631 loss)
I1224 10:33:02.783587 2932 sgd_solver.cpp:106] Iteration 3800, lr = 0.1
I1224 10:33:56.534090 2932 solver.cpp:228] Iteration 3850, loss = 15.0631
I1224 10:33:56.534256 2932 solver.cpp:244] Train net output #0: accuracy = 0
I1224 10:33:56.534270 2932 solver.cpp:244] Train net output #1: softmax_loss = 15.0631 (* 1 = 15.0631 loss)
I1224 10:33:56.534277 2932 sgd_solver.cpp:106] Iteration 3850, lr = 0.1
I1224 10:34:50.398485 2932 solver.cpp:228] Iteration 3900, loss = 15.063
I1224 10:34:50.398706 2932 solver.cpp:244] Train net output #0: accuracy = 0
I1224 10:34:50.398721 2932 solver.cpp:244] Train net output #1: softmax_loss = 15.063 (* 1 = 15.063 loss)
I1224 10:34:50.398731 2932 sgd_solver.cpp:106] Iteration 3900, lr = 0.1
I1224 10:35:44.156437 2932 solver.cpp:228] Iteration 3950, loss = 15.063
I1224 10:35:44.156621 2932 solver.cpp:244] Train net output #0: accuracy = 0
I1224 10:35:44.156635 2932 solver.cpp:244] Train net output #1: softmax_loss = 15.063 (* 1 = 15.063 loss)
I1224 10:35:44.156643 2932 sgd_solver.cpp:106] Iteration 3950, lr = 0.1
I1224 10:36:37.523526 2932 solver.cpp:228] Iteration 4000, loss = 15.063
I1224 10:36:37.523660 2932 solver.cpp:244] Train net output #0: accuracy = 0
I1224 10:36:37.523672 2932 solver.cpp:244] Train net output #1: softmax_loss = 15.063 (* 1 = 15.063 loss)
I1224 10:36:37.523680 2932 sgd_solver.cpp:106] Iteration 4000, lr = 0.1
根据后面一组实验的现象猜测,可能有两个原因:
1)从随机初始化开始的权重,在开始比较大的lr情况下,训练容易达到局部极值点。(即权重初始化方式有较大影响)
2)学习率过大,应该减少学习率?
(3)loss正常下降
依旧 利用SE-ResXt50的预训练模型载入权重,finetuning训练,loss下降正常(目前loss下降到5-6左右)。
三、to be continued
附录:其他的训练思路
有人曾经表示,insightface-caffe训练相比mxnet的loss下降慢并且训练难度更大,原因不明。因此提出一种可能的解决方案:
1)先利用传统的softmax loss进行人脸识别分类训练;
2)然后在训练好的模型上,换成arcface loss/combined margin loss进行finetuning,可以提高训练的效率和效果。
或者
直接利用别人已经训练好的相关模型,利用自己的人脸识别数据进行finetuning。
还有一种思路
将其中的m值调小,默认是30或者64,可以先调很小,然后再利用训练好的模型调大m进行finetuning(此方案未经验证,不确认是否可行)。