基于序列建模的自然场景下文字识别方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于序列建模的自然场景下文字识别方法研究
	高云泽
	2020-05-27
页数	132
学位类型	博士
中文摘要	随着移动互联网的不断发展以及智能终端的快速普及，图像与视频数据呈爆炸式增长，已经成为信息传递与交互的重要载体。面向海量数据实现准确高效的语义理解，是大数据与智能化时代发展的必经之路。图像与视频中含有丰富的文字信息，与普通视觉目标不同，文字直接承载了高层语义信息，是实现场景理解与分析的关键要素。文字普遍存在于日常生活的各个场景中，自然场景下的文字识别技术在身份认证、智慧教育、内容审核、自动驾驶等多个领域有广阔的应用前景。因此，面向自然场景文字识别的研究具有重要的学术价值与实际意义。传统的文字识别主要针对文档文字，这种文字通常具有排列整齐、字体统一、背景简单、成像条件理想等特点。相比之下，自然场景中的文字常表现出排布形式多样、字体表观多变、背景复杂、成像条件不可控等问题。因此，与文档文字识别任务相比，自然场景下的文字识别更有挑战性。近年来，得益于深度学习的发展，同时考虑到文字序列中的依赖关系，结合深度特征与序列关系的融合学习已经成为主流发展方向。因此，本文以基于序列建模的深度学习框架为基础，针对面向自然场景的文字识别中计算效率、标注成本、成像质量与文字形变等难题，通过设计合理的网络结构与学习方法，有效提升了识别的效率与精度。本文的主要研究成果和贡献归纳如下： 1.基于全卷积序列建模网络的场景文字识别。针对基于循环神经网络的序列建模框架存在时序循环，无法并行计算的问题，提出了一种基于全卷积序列建模网络的场景文字识别方法。循环神经网络的序列建模本质是上下文信息的融合过程，与卷积操作的感受野机制类似。因此，本方法采用卷积操作来近似实现序列建模过程，可以有效地进行并行计算，显著提高了计算效率。此外，本方法在基础特征编码的过程中引入了注意力机制，来增强前景文字表达并抑制背景噪声响应，从而实现了更精准的上下文信息融合。在公开数据集上的实验结果表明，本方法在保持较高识别精度的同时可显著提升推理速度。 2.基于半监督神经网络学习的场景文字识别。针对全监督学习方法依赖于大量精细的标注信息，从而导致模型泛化能力较弱，难以在标注数据不足的情况下实现高精度识别的问题，提出了一种基于半监督神经网络学习的文字识别方法。对于无标注数据，本方法利用视觉语义嵌入模块将输入图像与预测序列映射到同一个语义空间，通过设计标签无关的优化目标来衡量其语义一致性作为预测质量评估的依据，为网络学习提供指导。对于有标注数据，本方法设计了基于单词的全局优化目标，并结合基于字符的局部优化目标，加强了标签对网络学习的指导。在公开数据集上的实验结果表明，本方法在保持较高识别精度的同时可减少约90%的标注需求。当标注数据有限时，得益于无标注数据对学习过程的影响，多个数据集的精度均有12%以上的提升。 3.基于门控双向交互解码网络的场景文字识别。针对自然场景中图像质量参差不齐，而基于低质量数据学习到的视觉特征表达与判别能力较弱的问题，提出了一种基于门控双向交互解码网络的场景文字识别方法。通过充分挖掘文字序列中的语法关系，来为视觉特征提供补充信息。在前向解码器建模前向语法关系的同时，引入了反向解码器来获取反向语法关系，通过双向解码器的交互充分融合双向语法信息，并结合视觉上下文信息，实现了三者的融合互补。此外，通过门控机制对三种信息的融合进行调节，减弱噪声的影响，实现了更精准的解码过程。在公开数据集上的实验结果表明，本方法较单向解码方法有显著的性能优势，在多个数据集上获得了同期最好的结果。 4.基于渐进矫正网络的不规则文字识别。针对场景文字形变派生的不规则文字识别问题，提出了一种基于渐进矫正网络的文字识别方法。通过层级渐进的矫正方式分解了矫正难度，并采用迭代计算对复杂多变的不规则文字进行矫正。在此基础上，本方法利用文字包络来反映其位置和姿态，不断优化文字包络在原始图像上的位置信息来作为空间变换的依据，从而在迭代矫正过程中保留了文字的完整性。同时，矫正网络能够与识别网络进行端到端的联合训练，无需额外的字符位置标注。在公开数据集上的实验结果表明，本方法有效地矫正了不规则文字，从而显著提升了识别精度，在4个不规则文字数据集的7项指标中取得了5项最优。
英文摘要	With the continuous development of mobile Internet and the rapid popularity of intelligent terminals, images and videos have experienced explosive growth and become the important media for information transmission and interaction. Realizing accurate and efficient semantic understanding for massive data is essential to the development of the era of big data and intelligence. There exists rich textual information in images and videos. Different from general visual objects, text directly carries high-level semantic information, which is the key element of scene understanding. Text is common in various scenes of daily life. Text recognition in natural images has broad application prospects in many fields such as identity authentication, smart education, content approval, and autonomous driving. Therefore, the research on scene text recognition has important academic value and practical significance. Traditional text recognition is mainly for document text, that usually has the regular arrangement, uniform font, simple background, and ideal imaging. In contrast, text in natural images often suffers from various arrangements, diverse appearances, complex backgrounds, and uncontrollable imaging. Therefore, scene text recognition is more challenging compared with document text recognition. Recently, benefiting from the advancements in deep learning and exploiting the dependencies in text sequences, scene text recognition has been extensively formulated as a sequence recognition problem. The fusion of deep features and sequential relations has become the main research direction. Therefore, based on the deep learning framework of sequence modeling, this dissertation designs reasonable network structures and learning strategies to solve the problems of computational efficiency, annotation cost, imaging quality, and text deformation, thus greatly improving the efficiency and performance of scene text recognition. The main contributions of this dissertation are summarized as follows: 1. This dissertation proposes a fully convolutional sequence modeling network to address the recurrent computation issue of the framework based on recurrent neural network (RNN). The sequence modeling of RNN is essentially the fusion of contextual information, which is similar to the receptive field of convolutional operation. Therefore, a convolutional network is used to approximately realize the sequence modeling process. The parallel computing can significantly improve efficiency. Besides, the attention mechanism is integrated into the basic feature encoder to enhance foreground text and suppress background noise, thus achieving more accurate contextual information fusion. Experimental results on public datasets show that the proposed method can significantly improve the efficiency while maintaining good recognition performance. 2. This dissertation proposes a semi-supervised learning method to address the deficient generalization issue of fully supervised learning methods in the case of finite labeled samples. For unlabeled data, a label-independent optimization objective is designed. Through the visual-semantic embedding, the input image and the predicted sequence are projected into the same semantic space, and their semantic consistency is measured to evaluate the prediction quality. For labeled data, we design a word-based global optimization objective, which is combined with a character-level local optimization objective to strengthen the guidance for network learning. Experimental results on public datasets show that the proposed method can reduce the annotation costs by about 90% while maintaining good recognition performance. When only a small amount of labeled data is available, the accuracy on multiple datasets has been improved by more than 12% due to the impact of unlabeled data on the learning process. 3. This dissertation proposes a gate-based bidirectional interactive decoding network to address the inadequate description issue of visual features extracted from low-quality images. The grammatical information in text sequences is fully excavated to supplement the visual features. Based on the forward grammar from the forward decoder, a backward decoder is introduced to model the reverse grammar. Besides, the bidirectional grammar information is fully fused through the interaction of the bidirectional decoders, and the visual context information is also combined to exploit the complementary advantages of different information. Through a gating mechanism, the fusion process is adjusted to reduce the influence of noise, thus achieving more accurate decoding. Experimental results on public datasets show that the proposed method has significant superiority over the unidirectional decoding methods, and achieves state-of-the-art performance on multiple tasks. 4. This dissertation proposes a progressive rectification network to address the irregular text recognition. The rectification is performed in a progressive manner, thus the difficulty of each step is considerably mitigated. The iterative refinement is adopted to remove the severe deformation of irregular text. Besides, the text envelope is used to reflect the position and posture of the text. The position of the text envelope on the original image is continuously refined as the basis for spatial transformation, so that the integrity of the text is preserved during the iteration. The rectification network is jointly optimized with the recognition network under the same objective in an end-to-end scheme, without the requirement for additional character location annotations. The experimental results on public datasets show that the proposed method effectively rectifies the irregular text, thus achieving superior recognition performance on irregular benchmarks.
关键词	场景文字识别,序列建模,全卷积网络,半监督学习,语法关系建模,不规则文字识别
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39294
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	高云泽. 基于序列建模的自然场景下文字识别方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士学位论文_高云泽.pdf（3939KB）	学位论文		限制开放	CC BY-NC-SA