图像与视频中的文本检测与识别方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 模式分析与学习

	图像与视频中的文本检测与识别方法研究
	冯伟
	2021-06
页数	124
学位类型	博士
中文摘要	近些年来，随着互联网的广泛使用，大量的自然场景图像和视频通过网络传播。在这些自然场景图像和视频中，文字往往能够帮助人类对图像和视频进行理解。因此文本检测和识别也有助于计算机快速有效地对图像和视频进行理解和分析。相比于扫描图像中的文本，自然场景中的文本在形状、字体、颜色、图像分辨率、拍照角度、背景复杂度等方面都更具有挑战性。本文研究场景图象和视频中的文本检测与识别问题，提出了几种有效的方法：首先从研究图像中常见的四边形文本出发，之后扩展到自然场景中任意形状的文本，最终推广到由图像序列组成的视频中的文本。论文的主要创新性工作如下： 1. 提出了一种基于循环实例分割的四边形场景文本检测方法。为了解决四边形文本行中的粘连问题，我们提出了一种基于循环实例分割的四边形文本检测方法。其中，全卷积网络用于对文本区域和非文本区域进行分类，之后循环神经网络利用全卷积网络提取的特征在每个时间步同时检测和分割一个文本实例。由于该方法采用实例分割的思想检测文本行，因此可以有效地解决相邻文本行的粘连问题。实验结果表明，我们提出的基于循环实例分割的文本检测方法在两个四边形场景文本数据集上取得了有竞争力的结果。 2. 提出了一种基于自底向上的任意形状场景文本端到端识别方法。该方法中，文本检测器用一系列旋转正方形来描述文本的形状，通过自底向上聚合多个旋转正方形得到最终的文本外包框。基于检测到的旋转正方形，我们用一个新颖的滑动感兴趣区域算子将任意形状文本区域从特征图上提取出来。最后用基于卷积神经网络的字符分类器和基于联结主义时间分类的解码器对提取到的特征进行识别。该方法在两个任意形状文本数据集上取得了最佳的性能，并在四边形文本数据集上取得了有竞争力的结果。 3. 提出了一种融合自底向上和自顶向下的残差双尺度文本端到端识别方法。该方法中，自底向上的检测器用一系列旋转正方形来描述文本行的形状，自顶向下的检测器用最小包围旋转矩形表示文本感兴趣区域，最终文本行的外包框由两个检测器共同决定。此外，我们还提出了一种残差双尺度机制来提升模型对尺度变化的鲁棒性。其中，两个端到端识别器以不同尺度的特征作为输入，高层次的端到端识别器同时学习低层次端到端识别器的残差。该方法在四个英文数据集和一个中文数据集都取得了最佳的性能。这些数据集不仅包含常见的四边形文本，也包含任意形状文本。 4. 提出了一种基于语义特征的视频文本检测方法。本方法用一个字符中心分割分支来提取语义特征，对字符的类别和位置进行编码，然后用一种表观-语义-几何描述子来跟踪文本实例，其中的语义特征可以提高对表观变化的鲁棒性。为了克服字符级标注的不足，我们提出了一种弱监督字符中心检测模块，该模块只使用词级标注的真实图像来生成字符级标注。该方法在四个视频文本数据集以及两个中文图像文本数据集上取得了最佳的性能。
英文摘要	In recent years, with the wide use of the Internet, a large number of scene images and videos spread through the Internet. Texts in images and videos can help understand, analyze and retrieve images and videos quickly and effectively. Compared with texts in scanned images, scene texts have higher diversity and uncertainty, due to the complex image background, the change of image resolution, illumination and perspective. This dissertation studies technology for text detection and recognition in scene images and videos, and considers common quadrilateral texts in images, arbitrary shaped texts, and texts in videos. The main contributions of this dissertation are summarized as follows. 1. A method for quadrilateral scene text detection with recurrent instance segmentation is proposed. To avoid the adhesion problem in quadrilateral texts, we propose a quadrilateral text detection method with recurrent instance segmentation. A fully convolution network is used to classify text and non-text regions, and then a recurrent neural network uses the features extracted by the fully convolution network to detect and segment a text instance at each time step. As this method adopts the idea of instance segmentation to detect texts, it can effectively solve the adhesion problem. Experimental results show that the proposed method achieves competitive results on two quadrilateral scene text datasets. 2. A bottom-up method for end-to-end arbitrary shaped text spotting is proposed. In this method, the text detector uses a series of rotated squares to describe the shape of the text, and aggregates multiple rotated squares to get the final bounding box. Then a novel operator RoISlide is used to extract the arbitrary shaped text region from the feature map by affine transformation of the detected rotated squares. On the basis of the features extracted by RoISlide, a convolutional neural network and connectionist temporal classification based text recognizer are used to recognize the text. The proposed method achieves state-of-the-art performance on two arbitrary shaped text datasets, and achieves competitive results on one quadrilateral text dataset. 3. A residual dual scale method fusing bottom-up and top-down processing is proposed to scene text spotting. In the method, the bottom-up detector uses a series of rotated squares to describe the shape of texts, the top-down detector uses the minimum enclosing rotated rectangle to represent the region of interest of the text, and the final bounding box is determined by fusing the outputs of two detectors. To improve the robustness against scale variance, we further propose a residual dual scale spotting mechanism, where two spotters work on different feature levels, and the high-level spotter is based on residuals of the low-level spotter. The proposed method achieves state-of-the-art performance in four English datasets and one Chinese dataset, including both arbitrary shaped and oriented texts. 4. A semantic-aware video text detection method is proposed. Specifically, a character center segmentation branch is used to extract semantic features, and encode the category and position of characters. Then a novel appearance-semantic-geometry descriptor is used to track text instances, in which semantic features can improve the robustness against appearance changes. To overcome the lack of character-level annotations, we propose a novel weakly-supervised character center detection module, which only uses word-level annotated real images to generate character-level labels. The proposed method achieves state-of-the-art performance on two video text datasets and two Chinese scene text datasets.
关键词	文本检测与识别实例分割自底向上自顶向下语义特征
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/45044
专题	多模态人工智能系统全国重点实验室_模式分析与学习
推荐引用方式 GB/T 7714	冯伟. 图像与视频中的文本检测与识别方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
图像与视频中的文本检测与识别方法研究.p（18533KB）	学位论文		开放获取	CC BY-NC-SA