TextSnake

场景文本识别，目前的一些解决方法

CPTN：只允许水平检测框
RRPN：允许带角度的矩形框覆盖
EAST：允许带角度矩形框或任意四边形覆盖
TextBox：水平矩形框
TextBox++：旋转矩形框
https://zhuanlan.zhihu.com/p/38655369?utm_source=qq&utm_medium=social

Representation

在这里插入图片描述

Pipline

在这里插入图片描述

network architecture

blue blocks are convolution stage of vgg-16l

blue blocks are convolution stage of vgg-16
Inspired by FPN and U-net，we adopt a scheme that gradually merges features from different levels of the stem network（主干网络）
主干网络可以是提出用来分类的CNN，如VGG-16/19、ResNet，这些网络可以被分成5卷积个阶段和几层全连接层
去除全连接层，把每个阶段的 feature map 喂到特征融合网络中
特征融合网络（网络结构图中的下半部分）的几个阶段线性堆叠在一起，每个融合单元以上一个阶段的 feature map 和对应的（如网络结构图所示）主干网络中的 feature map 为输入。
融合单元被定义为：
- $h_1 = f_5$
- $h_i = conv_{3x3}(conv_{1x1}[f_{6-i};UpSampling_{x2}(h_{i-1})])$ ，for i = 2,3,4,5
$f_i$ 表示主干网络第 i 阶段的 feature map， $h_i$ 表示对应融合单元的 feature map ，上采样是通过 deconvolutional layer[1]实现
融合后，得到输入图像一半尺寸大小的 feature map，使用额外的上采样层和 2 个卷积层生成预测：
- $h_{final} = UnSampling_{x2}(h_5)$
- $P = conv_{1x1}(conv_{3x3}(h_{final}))$
$P\in R^{h*w*7}$ , 4 channels for logits of TR/TCL,后面 3 通道分别表示文本实例的 r, cos $\theta$ ,sin $\theta$
最后通过取 TR/TCL的softmax，并规范化cos $\theta$ ,sin $\theta$ ，使cos $\theta$ ,sin $\theta$ 平方和为 1，得到预测结果

Inference

假设我们已经有一个能够很好预测TCL，TR，radii，sin和cos的FCN网络了，接下来就是如何将这这几个特征转换为最终的文字预测。
首先，对于TCL 和 TR，使用阈值 $T_{tcl}$ 和 $T_{tr}$
然后，TR与TCL的交集给出了TCL的最终预测，使用 disjoint-set，能有效地把TCL像素分割到不同的文本实例。
最后，一个前进算法（striding algorithm）被用来提取一个有序点列表，这个列表表示了文本实例的形状和路线，并重构文本实例区域（ a striding algorithm is designed to extract an ordered point list that indicates the shape and course of the text instance, and also reconstruct the text instance areas）
应用两个简单的启发式方法来过滤假阳性文本实例：
- The number of TCL pixels should be at least 0.2 times their average radius
- At least half of pixels in the reconstructed text area should be classified as TR

The procedure for the striding algorithm

在这里插入图片描述

首先，以TCL区域的任意一个点作为起始点，并且做 centralizing操作，在起始点利用sin和cos画出改点位置的切线（虚线）和法线（实线），法线部分与TCL区域相交的两个点取中点作为Centraling点(Act(a))
然后向两个相反的方向搜索前进，striding（向切线方向迈一步，步长为 1/2r * cos $\theta$ ） and centralizing(Act(a))，直到到达 end（如上图中的expanding to ends）
从初始点的左右分别都生成到末端，每次在中心点画一个圆，最后所有被圆覆盖的地方我们就作为Text预测出来

训练数据的生成

在这里插入图片描述

基于文本实例是蛇形的这个假设，一个 snake-shaped text instance 有如下特征：有两个分别叫做头和尾的边，头尾附近的两条边（sideline）是近似平行但方向相反的。
如图，AH、DE 是头和尾，它的前一条边 GH 与后一条边 AB 的夹角近似180，这样就可以定义一条边是底边的度量指标 $M(e_{i,i+1}) = cos(e_{i+1,i+2}, e_{i-1,i})$ , 如果该指标接近 -1，那么他就是底边
确定底边后，等量的点从 sidelines 上采样得到，如ABCD、HGFE，将两个长边（sidelines）每个采样点相连，对应点连线取中点，这个中点就是TCL的所在点，连接线的长度就是圆盘的直径，两点之间连线的角度就是方向

[1]: Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks. In:Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition(CVPR). pp. 2528–2535 (2010)
[2]: http://blog.prince2015.club/2019/01/06/TextSnake/

场景文本识别，目前的一些解决方法

Representation

Pipline

network architecture

Inference

The procedure for the striding algorithm

训练数据的生成

猜你喜欢