|Place of Conferral||中国科学院自动化研究所|
|Keyword||自然场景图像 文本检测 文本识别 端到端提取 深度学习|
已有的场景文本识别器的建模方法均是基于递归或卷积神经网络，虽然取得了性能的不断提升，但在识别准确率、训练及测试效率方面仍存在局限性。本文首次提出了一个基于自注意力机制的识别模型，模型遵循编解码的框架，编码器与解码器均采用自注意力机制建模，使得训练过程支持并行化计算。考虑到场景图像中文本和背景均富有变化性，本文设计了一个模态转换模块，利用卷积与拼接操作压缩并保留图像高度信息，实现从二维输入图像到一维特征序列的高效转换，结合编码器，从输入图像中提取出更具区分性的文本特征。该方法在规则和不规则数据集上均取得了当前最高的识别准确率，同时训练速度超过了基于递归和卷积神经网络的识别模型，比当前最好的模型快了至少 8 倍。
现有的端到端模型中，由于识别分支没有充分利用场景文本在二维空间上的分布信息，导致倾斜和竖直文本的提取效果相对较差；此外，由于识别分支在训练与测试阶段接收的文本框的不一致性，限制了模型的提取性能。本文提出了一个鲁棒性更好的端到端场景文本提取模型，模型包含特征提取模块、检测分支和识别分支三个部分。针对第一个问题，本文首次构建了基于二维注意力机制的识别分支，利用场景文本在二维空间上的分布信息，将检测分支输出的各个方向的文本区域解码出文本内容，提高模型对文本方向变化的鲁棒性。针对第二个问题，本文提出了数据增强操作 jitter boxes，通过统计标注框与检测框之间的扰动并作用于识别分支的训练阶段，训练识别分支从扰动的文本框中解码出文本内容，提高模型对识别输入扰动的鲁棒性。实验表明，二维识别分支和 jitter boxes显著地提升了模型的提取性能，同时可方便地嵌入到现有的任一端到端模型中。
Scene text detection and recognition aim to extract text information from scene images, including locations and contents of texts. With the rapid development of the Internet and the popularization of mobile devices such as smartphones and tablets, images have gradually become the main medium for information transmission, and have exploded in number. As the most relevant expression to human high-level semantic information, texts in scene images are closely related to image contents. Extracting texts effectively from scene images is the basis of image understanding and analysis. It can help people quickly locate the interesting parts among massive data and plays an important role in more and more practical applications, such as image retrieval, intelligent transportation, human-computer interaction, real-time translation, etc.
Scene text spotting is mainly based on two frameworks. One is the cascaded framework, which combines independent scene text detection model and recognition model to complete the extraction task. The other is the end-to-end framework, where an end-to-end text spotting model is leveraged to detect and recognize texts simultaneously. The acquisition method of scene images and the characteristics of scene texts lead to many challenges in scene text spotting. According to the characteristics of scene images, this paper thoroughly studies three key technologies among two frameworks, including scene text detection technology, scene text recognition technology and end-to-end scene text spotting technology. The main efforts and innovations of the paper can be summarized as follows:
1. A multi-scale scene text detector with feature pyramids is proposed
Currently, the detection of multi-scale scene texts remains challenging, and the accuracy and speed have not reached a reasonable balance. The generally used image pyramid and feature pyramid are dedicated to detecting a large range of texts, but make trade-offs between speed and accuracy. To solve this problem, this paper proposes an efficient model for multi-scale scene text detection. Firstly, the upper layers of the network are used as prediction layers, which contain rich semantic information. Then, based on the backbone network, a proposed grouped pyramid module combines the basic layers recursively into a prediction layer via a top-down partition and a bottom-up group. The grouped pyramid module appends more detailed information to the prediction. Experiments demonstrate that the proposed model makes full use of the fine-grained information and the high-level semantics. The model achieves a higher accuracy for multi-scale text detection and reduces the losses of computing speed.
2. An oriented scene text detector with learnable anchors is proposed
Existing regression-based scene text detectors mainly use fixed anchors. Once defined, the scales and positions of the anchors could not be changed during network training. As scene texts have large variation in scales, aspect ratios and positions, the use of fixed anchors is insufficient to detect complicated and changeable scene texts. To solve this problem, this paper proposes a single-shot oriented scene text detector with learnable anchors. The proposed model contains two prediction branches. One aims to refine scales and locations of anchors according to the characteristics of scene texts. The other one receives the refined anchors as defaults and regresses their offsets to text regions. These two branches are optimized jointly without sacrificing much speed. Extensive experiments on both oriented and horizontal benchmarks demonstrate that the first branch provides a better initialization of anchors used in the second branch. The use of the learnable anchor branch ensures that the proposed model could provide more accurate locations and a lower rate of missed detections.
3. A scene text recognizer based on the self-attention mechanism is proposed
Existing scene text recognition methods mainly adopt recurrence or convolution based networks. Though they have obtained good performance, these methods still have limitations in recognition accuracy, training and testing efficiency. This paper, for the first time, proposes a scene text recognition model based on the self-attention mechanism. The proposed model follows the encoder-decoder framework, where the encoder and the decoder both rely on the self-attention mechanism, thus could be trained with more parallelization. Considering that scene images have large variation in texts and backgrounds, this paper further designs a modality-transform block, which compresses and retains height-dimensional information by using convolution and concatenate operations. This block could effectively transform two-dimensional input images to one-dimensional sequences, combined with the encoder to extract more discriminative features. The proposed model achieves state-of-the-art accuracy on both regular and irregular benchmarks, while its training speed exceeds the recurrence and convolution based models, at least 8 times faster.
4. A more robust end-to-end scene text spotting model is proposed
In the existing end-to-end models, the recognition branches do not make full use of the distribution information of scene texts in the two-dimensional space, which results in poor performance on oriented and vertical texts. At the same time, the recognition branches receive inconsistent text boxes during the training and testing phases, which confines the spotting accuracy. This paper proposes a more robust end-to-end scene text spotting model, which contains three parts: a feature extraction module, a detection branch and a recognition branch. To solve the first issue, this paper, for the first time, constructs a recognition branch based on the two-dimensional attention mechanism. By leveraging the distribution information of scene texts in the two-dimensional space, the recognition branch could recognize texts with variable orientations from the detection branch, which improves the robustness of the model to changes in text orientations. To solve the second issue, this paper proposes a data augmentation, named jitter boxes, which operates on the recognition branch. By counting the disturbance between the labeled boxes and the detection boxes, and applying it to the training phase of the recognition branch, the recognition branch is trained to decode text contents from the disturbing text boxes, which improves the robustness of the model to input disturbances of the recognition branch. Experiments demonstrate that the performance of the proposed model is significantly improved due to the use of the two-dimensional recognition branch and the jitter boxes. Meanwhile, the two-dimensional branch and jitter boxes could be easily embedded into any existing end-to-end model.
|盛芬芬. 自然场景文本检测与识别技术研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.|
|Files in This Item:|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.