自然场景文本检测与识别技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	自然场景文本检测与识别技术研究
	盛芬芬
	2020-06-03
页数	138
学位类型	博士
中文摘要	自然场景文本检测与识别技术旨在从场景图像中提取出文本信息，完成文本的位置定位和内容识别。随着互联网行业的飞速发展，以及智能手机、平板电脑等移动终端的快速普及，图像逐渐成为信息传递的主要媒介，并在数量上呈爆炸式增长。文字作为人类高层语义信息最直接的表达形式，图像中的文本与图像内容紧密相关。准确、高效地提取场景图像中的文本是图像内容理解与分析的基础，能够帮助人们从海量数据中快速定位到感兴趣的部分，在越来越多的实际应用中发挥着重要作用，如图像检索、智能交通、人机交互、实时翻译等。因此，对场景文本检测与识别技术展开研究具有重要的理论意义和应用价值。场景文本的提取主要基于两种框架，一种是级联框架，由独立的场景文本检测模型与识别模型组合完成场景文本提取任务；另一种是端到端框架，由一个端到端模型同时实现场景文本的检测与识别。场景图像的采集方式以及场景文本的特性导致场景文本提取任务中仍存在诸多挑战。本文根据场景图像的特点，针对两种框架中的关键技术展开深入的研究，包括场景文本检测技术、场景文本识别技术以及端到端的场景文本提取技术。论文的主要工作和创新点归纳如下： 1. 提出了一个基于特征金字塔的多尺寸场景文本检测模型目前，多尺寸场景文本的高效检测依然具有挑战性，模型的精度与速度之间没有达到合理的平衡。常用的图像金字塔与特征金字塔的方法致力于检测较大尺寸范围的文本，但对速度与精度分别进行了取舍。针对该问题，本文提出了一个高效的多尺寸场景文本检测模型，首先将网络的高层作为预测层，预测层富含语义信息；然后在基线系统上增加了一个融合的特征金字塔模块，采用自顶向下的切分与自底向上的融合操作融合网络的基础特征层。融合层同样用于模型的预测，为预测层增加更多的细节信息。实验表明，该方法更充分地利用了网络的底层与高层特征，在实现多尺寸场景文本检测的同时，降低了运算速度的损耗。 2. 提出了一个锚框可学习的倾斜场景文本检测模型基于回归的场景文本检测模型通常使用固定的锚框（anchor），锚框一旦设定完成，其尺寸与位置在模型的训练过程中不再发生改变。自然场景中的文本在尺寸、宽高比以及位置上具有较大的不确定性，使用固定的锚框难以满足复杂多变的场景文本的检测。针对此问题，本文提出了一个锚框可学习的倾斜场景文本检测模型。该模型包括两个预测分支，第一个预测分支根据场景文本的特性通过可学习的方式对锚框的尺寸与位置进行调整，第二个预测分支基于调整过的锚框进行场景文本位置的预测，两个预测分支同步训练。水平与倾斜数据集上的大量实验表明，第一个分支为第二个分支使用的锚框提供了更好的初始值，从而保证模型输出更为精确的文本检测框，显著降低了漏检率。 3. 提出了一个基于自注意力机制的场景文本识别模型已有的场景文本识别器的建模方法均是基于递归或卷积神经网络，虽然取得了性能的不断提升，但在识别准确率、训练及测试效率方面仍存在局限性。本文首次提出了一个基于自注意力机制的识别模型，模型遵循编解码的框架，编码器与解码器均采用自注意力机制建模，使得训练过程支持并行化计算。考虑到场景图像中文本和背景均富有变化性，本文设计了一个模态转换模块，利用卷积与拼接操作压缩并保留图像高度信息，实现从二维输入图像到一维特征序列的高效转换，结合编码器，从输入图像中提取出更具区分性的文本特征。该方法在规则和不规则数据集上均取得了当前最高的识别准确率，同时训练速度超过了基于递归和卷积神经网络的识别模型，比当前最好的模型快了至少 8 倍。 4. 提出了一个鲁棒性更好的端到端场景文本提取模型现有的端到端模型中，由于识别分支没有充分利用场景文本在二维空间上的分布信息，导致倾斜和竖直文本的提取效果相对较差；此外，由于识别分支在训练与测试阶段接收的文本框的不一致性，限制了模型的提取性能。本文提出了一个鲁棒性更好的端到端场景文本提取模型，模型包含特征提取模块、检测分支和识别分支三个部分。针对第一个问题，本文首次构建了基于二维注意力机制的识别分支，利用场景文本在二维空间上的分布信息，将检测分支输出的各个方向的文本区域解码出文本内容，提高模型对文本方向变化的鲁棒性。针对第二个问题，本文提出了数据增强操作 jitter boxes，通过统计标注框与检测框之间的扰动并作用于识别分支的训练阶段，训练识别分支从扰动的文本框中解码出文本内容，提高模型对识别输入扰动的鲁棒性。实验表明，二维识别分支和 jitter boxes显著地提升了模型的提取性能，同时可方便地嵌入到现有的任一端到端模型中。
英文摘要	Scene text detection and recognition aim to extract text information from scene images, including locations and contents of texts. With the rapid development of the Internet and the popularization of mobile devices such as smartphones and tablets, images have gradually become the main medium for information transmission, and have exploded in number. As the most relevant expression to human high-level semantic information, texts in scene images are closely related to image contents. Extracting texts effectively from scene images is the basis of image understanding and analysis. It can help people quickly locate the interesting parts among massive data and plays an important role in more and more practical applications, such as image retrieval, intelligent transportation, human-computer interaction, real-time translation, etc. Scene text spotting is mainly based on two frameworks. One is the cascaded framework, which combines independent scene text detection model and recognition model to complete the extraction task. The other is the end-to-end framework, where an end-to-end text spotting model is leveraged to detect and recognize texts simultaneously. The acquisition method of scene images and the characteristics of scene texts lead to many challenges in scene text spotting. According to the characteristics of scene images, this paper thoroughly studies three key technologies among two frameworks, including scene text detection technology, scene text recognition technology and end-to-end scene text spotting technology. The main efforts and innovations of the paper can be summarized as follows: 1. A multi-scale scene text detector with feature pyramids is proposed Currently, the detection of multi-scale scene texts remains challenging, and the accuracy and speed have not reached a reasonable balance. The generally used image pyramid and feature pyramid are dedicated to detecting a large range of texts, but make trade-offs between speed and accuracy. To solve this problem, this paper proposes an efficient model for multi-scale scene text detection. Firstly, the upper layers of the network are used as prediction layers, which contain rich semantic information. Then, based on the backbone network, a proposed grouped pyramid module combines the basic layers recursively into a prediction layer via a top-down partition and a bottom-up group. The grouped pyramid module appends more detailed information to the prediction. Experiments demonstrate that the proposed model makes full use of the fine-grained information and the high-level semantics. The model achieves a higher accuracy for multi-scale text detection and reduces the losses of computing speed. 2. An oriented scene text detector with learnable anchors is proposed Existing regression-based scene text detectors mainly use fixed anchors. Once defined, the scales and positions of the anchors could not be changed during network training. As scene texts have large variation in scales, aspect ratios and positions, the use of fixed anchors is insufficient to detect complicated and changeable scene texts. To solve this problem, this paper proposes a single-shot oriented scene text detector with learnable anchors. The proposed model contains two prediction branches. One aims to refine scales and locations of anchors according to the characteristics of scene texts. The other one receives the refined anchors as defaults and regresses their offsets to text regions. These two branches are optimized jointly without sacrificing much speed. Extensive experiments on both oriented and horizontal benchmarks demonstrate that the first branch provides a better initialization of anchors used in the second branch. The use of the learnable anchor branch ensures that the proposed model could provide more accurate locations and a lower rate of missed detections. 3. A scene text recognizer based on the self-attention mechanism is proposed Existing scene text recognition methods mainly adopt recurrence or convolution based networks. Though they have obtained good performance, these methods still have limitations in recognition accuracy, training and testing efficiency. This paper, for the first time, proposes a scene text recognition model based on the self-attention mechanism. The proposed model follows the encoder-decoder framework, where the encoder and the decoder both rely on the self-attention mechanism, thus could be trained with more parallelization. Considering that scene images have large variation in texts and backgrounds, this paper further designs a modality-transform block, which compresses and retains height-dimensional information by using convolution and concatenate operations. This block could effectively transform two-dimensional input images to one-dimensional sequences, combined with the encoder to extract more discriminative features. The proposed model achieves state-of-the-art accuracy on both regular and irregular benchmarks, while its training speed exceeds the recurrence and convolution based models, at least 8 times faster. 4. A more robust end-to-end scene text spotting model is proposed In the existing end-to-end models, the recognition branches do not make full use of the distribution information of scene texts in the two-dimensional space, which results in poor performance on oriented and vertical texts. At the same time, the recognition branches receive inconsistent text boxes during the training and testing phases, which confines the spotting accuracy. This paper proposes a more robust end-to-end scene text spotting model, which contains three parts: a feature extraction module, a detection branch and a recognition branch. To solve the first issue, this paper, for the first time, constructs a recognition branch based on the two-dimensional attention mechanism. By leveraging the distribution information of scene texts in the two-dimensional space, the recognition branch could recognize texts with variable orientations from the detection branch, which improves the robustness of the model to changes in text orientations. To solve the second issue, this paper proposes a data augmentation, named jitter boxes, which operates on the recognition branch. By counting the disturbance between the labeled boxes and the detection boxes, and applying it to the training phase of the recognition branch, the recognition branch is trained to decode text contents from the disturbing text boxes, which improves the robustness of the model to input disturbances of the recognition branch. Experiments demonstrate that the performance of the proposed model is significantly improved due to the use of the two-dimensional recognition branch and the jitter boxes. Meanwhile, the two-dimensional branch and jitter boxes could be easily embedded into any existing end-to-end model.
关键词	自然场景图像文本检测文本识别端到端提取深度学习
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39266
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	盛芬芬. 自然场景文本检测与识别技术研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
盛芬芬-自然场景文本检测与识别技术研究-（14633KB）	学位论文		限制开放	CC BY-NC-SA