自然场景文本检测与识别方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 模式分析与学习

	自然场景文本检测与识别方法研究
	王聪
	2020-05-30
页数	140
学位类型	博士
中文摘要	自然场景图像中的文本包含丰富的语义信息，因而，场景文本提取技术具有广阔的应用需求和前景。然而，由于自然场景图像中文本的多样性、复杂背景以及低成像质量等因素，场景文本提取是一个非常有挑战性的问题。场景文本提取主要包括文本检测和文本识别两个子任务，本文就场景文本检测和场景文本识别进行深入研究。主要研究工作和贡献有以下几点：提出了一种基于超像素提取字符候选的场景文本检测方法。有别于主流的基于极值区域的字符候选提取方法，所提出的基于超像素的方法利用字符的颜色一致性和边缘明显性，通过融合颜色信息和边缘信息对场景文本图像进行超像素分割，进而通过层次聚类进行字符候选区域提取。此外，我们基于卷积神经网络设计了一种可融入字符候选区域上下文信息的文本/非文本分类器，并结合双阈值策略对字符候选区域进行字符候选过滤。在公开数据集上的实验结果表明，所提出的场景文本检测系统在性能上优于之前的代表性连通部件类方法。提出了一种基于记忆增强化注意力网络的场景文本识别方法。之前基于注意力机制的场景文本识别方法主要采用标准注意力网络作为解码器，在解码当前时刻字符时没有充分利用上一时刻之前的字符信息和所有历史时刻的注意力信息。为此，所提出的记忆增强化注意力网络对标准注意力网络在两个方面进行记忆增强：对历史字符信息的记忆增强和对历史对齐信息的记忆增强。在公开数据集上的实验结果表明，所提出的记忆增强化注意力网络在识别性能上优于标准注意力网络。并且，与之前的主流方法相比，所提出的场景文本识别方法取得了相当或更好的性能。提出了一种基于嵌入门控化注意力网络的场景文本识别方法。标准注意力网络在解码当前时刻的字符时过重地依赖于前一时刻的字符嵌入向量，而前一时刻字符嵌入向量的来源在训练阶段和测试阶段存在差异。为此，所提出的嵌入门控化注意力网络通过添加一个自适应嵌入门控以自适应地重置来自于前一时刻字符嵌入向量的输入信息，该自适应嵌入门控基于同一时刻隐状态向量与相应字符嵌入向量的相关度进行构建。在公开数据集上的实验结果表明，所提出的嵌入门控化注意力网络在识别性能方面优于标准注意力网络。提出了一种基于多分支指导式注意力网络的不规则场景文本识别方法。该方法提供了一种简单但有效的方式以同时处理不规则场景文本图像中的多种不规则因素。通过训练阶段多分支数据之间的互指导，所提出的多分支指导式注意力网络可学习规则文本图像和相应不规则文本图像中所预测字符序列的语义表达不变性，并且可缓解标准注意力网络经常遇到的注意力漂移问题，显著提升各解码时刻的注意力对齐准确度。在公开数据集上的实验验证了所提出的方法在不规则场景文本识别和注意力漂移问题上的有效性。并且，与之前的主流不规则场景文本识别方法相比，所提出的方法取得了相当或更好的性能。
英文摘要	Texts in natural scene images convey rich semantic information. Thus, scene text extraction technology has potential needs in numerous applications. However, due to the diversity of text appearance, complex background and low imaging quality in natural scene images, extracting text from natural scene images is a very challenging problem. Scene text extraction technology involves two sub-tasks: text detection and text recognition, which are the main objective of research in this thesis. The contributions of this dissertation are summarized as follows: A scene text detection method with superpixel based character candidate extraction is proposed. Different from representative character candidate extraction methods based on extremal regions, the proposed superpixel based method fuses color and edge information to segment scene text image into superpixels by taking advantage of color consistency and edge visibility of characters, and then extracts character candidates through hierarchical clustering. In addition, we design a convolutional neural network based text/non-text classifier, which can use the contextual information of character candidate region and combines with double threshold strategy for character candidate filtering. The experimental results on public datasets show that the performance of the proposed scene text detection system is superior to previous representative connected components based methods. A memory-augmented attention network for scene text recognition is proposed. Most of previous attention based scene text recognition methods adopted standard attention network as the decoder and did not make full use of character information before last time step and alignment information at all historical time steps when decoding the character at current time step. To address this problem, the proposed memory-augmented attention network (MAAN) performs memory augmentation on standard attention network in two aspects: memory augmentation of historical character information and memory augmentation of historical alignment information. The experimental results on public datasets show that the performance of the proposed MAAN is superior to standard attention network and is comparable or superior compared with previous state-of-the-art methods. An attention network with gated embedding for scene text recognition is proposed. Standard attention network relies on the embedding vector of character at last time step overmuch when decoding the character at current time step. However, the source of previous character embedding vector is different in training phase and test phase. To address this problem, the proposed attention network with gated embedding (GEAN) adaptively resets the input information from previous character embedding vector through adding an adaptive embedding gate, which is constructed based on the degree of correlation between the hidden state vector and the embedding vector of the corresponding character label at the same time step. The experimental results on public datasets show that the recognition performance of the proposed GEAN is superior to standard attention network. A multi-branch guided attention network for irregular text recognition is proposed. The proposed method provides a simple but effective way to deal with multiple types of irregularity in irregular text images simultaneously. Through mutual guidance among multi-branch data in training, the proposed multi-branch guided attention network (MBAN) can learn invariant semantic representation of predicted character sequences between regular text images and the corresponding irregular images and alleviate the attention drift problem often encountered by standard attention network, in the sense that the accuracy of alignment factors at each time step is significantly improved. The experiments on public datasets verify the effectiveness of the proposed MBAN in recognizing irregular text and alleviating the attention drift problem. And the performance of MBAN is shown to be comparable or superior compared with previous state-of-the-art irregular text recognition methods.
关键词	场景文本检测，场景文本识别，超像素分割，注意力网络，互指导机制
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39730
专题	多模态人工智能系统全国重点实验室_模式分析与学习
推荐引用方式 GB/T 7714	王聪. 自然场景文本检测与识别方法研究[D]. 中国科学院大学. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
学位论文_王聪_签名版.pdf（8251KB）	学位论文		开放获取	CC BY-NC-SA