番外篇：基于tesseract的光学字符训练

前提环境

附图为jTessBoxEditor执行目录所有的文件：
在这里插入图片描述

制造字体需要明确需要检测的字体类型，例如需要检测的目标字体为宋体，那么就可以在输入文字的时候把字体的系列改成宋体，如下为笔者需要OCR识别的字体。
在这里插入图片描述

> tesseract zh_CN6.song.exp0.tif zh_CN6.song.exp0 batch.nochop makebox.

在这里插入图片描述

执行下面命令。

> tesseract zh_CN6.song.exp0.tif zh_CN6.song.exp0 nobatch box.train

在这里插入图片描述

> unicharset_extractor zh_CN6.song.exp0.box

在这里插入图片描述

>  shapeclustering -F font_properties -U unicharset -O unicharset zh_CN6.song.exp0.tr

在这里插入图片描述

>  mftraining -F font_properties -U unicharset -O unicharset zh_CN6.song.exp0.tr

在这里插入图片描述

>   cntraining zh_CN6.song.exp0.tr

在这里插入图片描述

重名名以上5个文件,手动修改文件名也可

mv normproto zh_CN6.normproto
mv inttemp zh_CN6.inttemp
mv pffmtable zh_CN6.pffmtable
mv shapetable zh_CN6.shapetable
mv unicharset zh_CN6.unicharset

>  cntraining zh_CN6.song.exp0.tr

在这里插入图片描述
如果过程没问题，最终会有一个traineddata 文件。

https://blog.csdn.net/qq_49710945/article/details/107688109