基于深度学习的场景文字检测与识别

CASIA OpenIR > 复杂系统管理与控制国家重点实验室 > 影像分析与机器视觉

	基于深度学习的场景文字检测与识别
	杜臣
	2021-08-04
页数	140
学位类型	博士
中文摘要	场景中的文字信息是图像高层次语义的一种重要载体，能够为理解场景提供丰富和准确的语义信息。通过文字检测与识别技术，让计算机自动理解图像和视频包含的高层次语义信息，并利用得到的信息推动更多的应用，对计算机视觉技术的发展具有重要的意义。近年来，随着深度学习技术的发展，基于深度学习的文字检测与识别技术在无人驾驶、智慧金融、在线教育、图文监管、信息检索等诸多应用场景中扮演起重要的角色，但众多的应用场景也给文字检测与识别技术带来新的问题与挑战。在自然场景图像文本识别、网络图像文本识别等应用场景中，由于图像中背景比较复杂、类文本目标多，以及图像中的文字对象存在文字形状、字符方向、排版格式的多样性，文字自动检测和识别的性能依然不理想，存在很多问题亟需解决。本文研究基于深度学习的场景文字检测与识别，在对国内外相关技术进行详细综述的基础上，围绕基于深度学习的场景文字检测与识别方法在模型设计与应用中的相关问题展开研究，具体包括：场景图像中的文字检测，无约束阅读视角的文本识别，场景图像中端到端的文字检测与识别。本文的主要工作和贡献包括以下内容： 1. 提出了一种融合边缘感知和区域感知的场景文字检测方法针对场景图像中的任意方向的文字检测问题，本文提出了一种融合边缘感知和区域感知的场景文字检测方法。该方法将文字检测问题设计为文字区域预测、文字边缘预测和文本框预测三项可通过卷积神经网络模型进行学习的任务。该方法设计的文字检测监督任务减少了文字区域内背景信息对基于深度学习的文字检测模型学习过程的干扰，从而减少背景纹理干扰导致的文字误检。相比于已有的文字检测方法，该方法能够更精确地检测场景图片中任意大小、方向的文字，在公开数据集上取得了更好的检测结果。 2. 提出了一种基于多尺度特征选择性聚合的场景文字检测方法在基于深度学习的场景文字检测方法中，为应对文字的尺度变化问题，通常会在卷积神经网络特征提取阶段采用高层特征和低层特征跨层融合的特征提取方式。本文在实验中发现，卷积神经网络提取的低层特征包含过多的背景纹理特征，这些背景纹理特征和文字纹理特征区分性不强，在和高层特征融合用于后续文字检测任务时，导致检测模型产生较多误检。为此，本文提出了一种基于多尺度特征选择性聚合的场景文字检测方法。该方法通过提出的特征选择性聚合机制，从特征层次抑制复杂纹理背景对检测结果的干扰，提升场景图像文字检测性能。 3. 提出了一种基于中心线矫正和多视角特征聚合的无约束阅读视角文本识别方法无约束阅读视角文字是指存在弯曲变形、字符朝向变化的文字，它们在场景图像中广泛出现，且难以识别。为此，本文提出了一种基于中心线矫正和多视角特征聚合的无约束阅读视角文本识别方法。在该方法中，中心线矫正机制能够在识别前对输入的图片进行自适应的矫正，将图像中的弯曲文本行矫正为水平且分布均匀的文本行，多视角特征聚合机制能够自适应的学习图像中文字方向的变化，提取更易于文字识别的特征。中心线矫正模型和多视角特征聚合模型均采用以文本识别结果为导向的弱监督方式进行学习，无需额外的人工标注。整个识别算法可对任意方向和排列的文本图像进行识别，在中文场景和英文场景的识别任务中都取得了很好的效果。 4. 提出了一种基于多尺度卷积按需共享和特征矫正的端到端场景文字检测与识别方法基于深度学习的文字检测与识别模型参数维度大，在资源受限的设备上应用困难。为此，本文提出了一种基于多尺度卷积特征按需共享和特征矫正的端到端场景文字检测与识别方法。相较于以前的方法，该方法通过多尺度卷积特征按需共享的方式将检测与识别模块联接为一体，解决在共享特征时检测任务和识别任务因所需特征描述不同导致的特征不兼容问题。针对不规则形状文本的识别问题，该方法利用检测特征中的整体性信息学习文本区域中文本的形状变化，用于对识别分支输入的文本区域特征进行矫正，提升模型对不规则形状文本的识别性能。相较于分阶段的检测与识别方法，该方法通过共享特征提取网络和采用特征层的矫正有效降低了模型参数量，提高计算效率。 5. 提出了一种结合语言模型的端到端中文检测与识别方法在中文识别场景中，受图像中汉字分布较为离散、文字横竖排列具有随机性的影响，已有的方法容易产生检测结果与实际语义结果不相符的检测歧义问题。为此，本文提出了结合语言模型处理的端到端文字检测与识别方法。该方法能够检测与识别输入图像中所有可能的横向文本行和纵向文本行，并结合设计的语义筛选模型对识别结果进行语义判断，对有重叠的文本行识别结果进行筛选。与已有的只利用视觉特征的文本检测与识别的方法相比，该文方法可有效结合语义信息消除复杂排版格式导致的文字歧义检测与识别问题。
英文摘要	Text is one important carrier of image semantic information, which can provide rich and accurate semantic information for scene understanding. It is of great significance to the development of computer vision technology to make computers automatically understand the high-level semantic information contained in images and videos through text detection and recognition technology, and to use the obtained information to promote more applications. In recent years, with the development of deep learning technology, text detection and recognition technology based on deep learning plays an important role in many application scenarios, such as driverless car, smart finance, online education, network information supervision, information retrieval, and so on. But at the same time, many application scenarios also bring new problems and challenges to the text detection and recognition technology. Due to the complex background and many text-like objects in the image, as well as the diversity of text objects in text shape, character direction, and typesetting format, the performance of text detection and recognition in complex scene images is still unsatisfactory. Therefore, it is necessary to provide effective solutions for detecting and recognizing texts in complex scene images. Based on the detailed review of the related technologies, this thesis conducts indepth researches on deep-learning-based text detection and recognition in complex scene images, including text detection in complex background images, unconstrained-view text recognition, end-to-end text detection and recognition in complex scenes. The main contributions and innovations of this dissertation are summarized as follows: 1. Scene text detection based on text edge and region awareness Aiming at the problem of arbitrary orientation text detection in the scene images, this thesis proposes a scene text detection method based on text edge and region awareness. The proposed method is optimized in an end-to-end way with multi-task outputs: text and nontext classification, text edge prediction, and the text boundaries regression. Compared with the existing text detection methods, the proposed method uses text edge information to suppress the interference of background in the text region during training of the model, which can effectively reduce the misdetection of text caused by complex background. Experiments on several standard datasets demonstrate that the proposed method can accurately detect any size or arbitrary orientation text in scene images and outperform state-of-the-art methods in terms of both accuracy and efficiency. 2. Scene text detection based on selective aggregation of multi-level features In the feature extraction stage of deep learning-based scene text detection methods, the feature combination of the low-layer and high-level features plays an important role in dealing with the problem of text scale change. In the experiment of this thesis, it is found that the low-level features extracted by the convolutional neural network contain too many background features, which are not distinguishable from text features. This thesis proposes a scene text detection method based on selective aggregation of multi-level features. Through an effective feature selective aggregation mechanism, this method can suppress the interference of complex texture background on the detection results from the feature level, and improve the performance of text detection in complex background images. 3. Unconstrained-view text recognition based on weak-supervised centerline rectification network and multi-view feature aggregation network In the complex scene, many texts in the image are curved and suffer from the problem of orientation changed caused by the mirror flip, which increases the difficulty of the text recognition. This thesis proposes a powerful pipeline for unconstrained-view text recognition based on the weak-supervised centerline rectification network and multiview feature aggregation network. The centerline rectification network is designed to rectify the curved text into horizontal text. The multi-view feature aggregation network can learn the direction change of the sequence features and select the most suitable features when doing feature extraction. Both centerline rectification module and multiview feature extraction module are trained based on weak supervision which do not require any extra annotation but is completely driven by the gradients back-propagated from the recognition network. The whole pipeline is able to automatically solve the multi-view scene text recognition problem. Experiments results show the superior performance of the proposed method. 4. End-to-end text detection and recognition method based on multi-scale convolution features on-demand shared mechanism and feature correction Aiming at the disadvantages of the application of deep learning based text detection and recognition on resource-constrained devices, this thesis proposes an end-to-end scene text detection and recognition method based on multi-scale convolution features on-demand shared mechanism and feature correction. Compared with the previous methods, this method combines the detection and recognition modules by means of multi-scale convolution feature sharing on-demand, and solves the problem of feature incompatibility caused by different feature descriptions in detection task and recognition task when sharing features. To extract effective text features for perspective and curved text recognition, this method uses the holistic information of the detection features to learn the shape change of the text in the text area, which is used to correct the features of the text area in the recognition branch. Compared with the two-stage text detection and recognition method, this method can effectively reduce the number of parameters and improve the computational efficiency by sharing the feature extraction network and using the feature-based correction. 5. Language-model-based end-to-end Chinese text detection and recognition Due to the text detection ambiguity caused by the large character spacing and character evenly spread in multiple rows and columns, end-to-end text recognition is still challenging in Chinese scenes. This thesis proposes a language-model-based Chinese text spotting framework. In this framework, a text detection module is designed to get all visually plausible groupings of the characters for later recognition selection with a language model. At the same time, a language-model-based filter is proposed to make a semantic judgment on the recognition results by calculating the degree of semantic confusion. Different from previous methods which only use visual features for text detection and recognition, the proposed method can effectively deal with the ambiguous detection and recognition problems in complex typesetting formats by combining semantic information.
关键词	文字检测文字识别特征聚合文字边缘感知中心线矫正
语种	中文
七大方向——子方向分类	文字识别与文档分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/46631
专题	复杂系统管理与控制国家重点实验室_影像分析与机器视觉
推荐引用方式 GB/T 7714	杜臣. 基于深度学习的场景文字检测与识别[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
杜臣-基于深度学习的场景文字检测与识别.（15118KB）	学位论文		开放获取	CC BY-NC-SA