CASIA OpenIR  > 毕业生  > 博士学位论文
自然场景文本检测与识别方法研究
王燕娜
学位类型工学博士
导师王春恒
2018-05-29
学位授予单位中国科学院研究生院
学位授予地点北京
关键词自然场景图像 文本检测 文本二值化 文本识别 图模型 文本上下文信息 字符笔画 字符结构信息
摘要文字作为人类文明进步的一个重要标志,是人类交流的主要媒介。随着互联网行业的飞速发展以及拍照智能终端的广泛应用,以图像为载体的多媒体信息为人们的生活带来了极大的方便。图像中的文字能表达丰富和准确的语义信息,因此对图像中的文字进行自动检测和识别的需求越来越多,也吸引了越来越多研究者的关注。近几年,扫描文档的文本自动识别技术日趋成熟,而自然场景中的文本自动检测和识别性能依然不理想,受到诸多干扰因素的影响,如文字字体不同、低分辨率、光照不均以及复杂多变的背景。
    本文结合场景文本自身的特点,对场景文本检测与识别领域涉及到的文本检测、文本二值化以及文本识别问题展开了一系列的研究。本文的主要工作和贡献包括以下几方面:
     1. 由于自然场景中各种因素的干扰,文本类内存在很大的差异,背景存在诸多的不确定性,单一的分类器或特征很难有效地区分文本区域和非文本区域。针对以上问题,本文提出了基于卷积神经网络和上下文信息的图模型文本检测方法。该方法从文本自身特点出发,将多源信息融合到一个框架中,提高文本检测性能。本文利用极大稳定性区域算子检测字符候选,提高字符召回率及检测速度,然后融合多个上下文信息以及单个字符候选区域信息构建图模型改善检测性能,进一步采用上下文信息恢复丢失的文本提高字符召回率,最后为减少文本行类内变化,为不同形状的文本行设计了包含灰度及二值信息的不同文本行分类器,提高文本行分类结果,改善最终的检测性能。实验结果表明,本文的场景文本检测方法在四
个公开数据集上均达到了满意的性能,表明了该方法的有效性和通用性。
     2. 针对适用于传统扫描文本块识别的二值化方法在场景文本上效果较差,本文提出了基于笔画特性的自适应场景文本二值化方法。为减少复杂背景的干扰以及相邻字符间不同笔画的影响,本文首先将整个文本行图像分割成多个子图,然后利用字符的笔画特性设计算法自动地选择置信度较高的前景像素和背景像素,并且根据获取的种子像素生成初始的前景和背景聚类中心,以此为基础获得整图的聚类中心,最后结合像素自身的信息和上下文信息利用图模型实现最终的文本二值化。在视频叠加文本图像以及场景文本图像上,本文利用文本识别评价指标以及像素层评价指标评估该方法,验证了该方法对文本图像二值化的有效性。
    3. 字符识别的一个重要因素是字符特征,本文从字符特征表示的角度出发,提出了基于卷积激活的场景字符特征表示方法。本文利用卷积神经网络提取字符笔画部件特征。接下来,考虑到字符是一种结构化的目标,本文在卷积激活特征中融入空间信息,并采用池化策略和编码策略生成全局字符特征表示。为了应对不同图像大小的字符笔画变化,本文采用多尺度图像输入增强字符特征的鲁棒性。为了评估基于卷积激活字符表示的有效性和通用性,本文除了在公开的英文数据集上进行评估之外,还收集了一个中文场景字符数据集,用于中文场景字符识别领域的研究。本文在七个场景字符数据集上全面评估了本文提出的方法,并探索了基于卷积激活的字符表示方法在不同语言字符识别上的性能差异。实验结果表明,
基于卷积激活的字符表示方法对多语言字符识别是有效的。
    4. 考虑到场景字符由一系列按照特定规则排列的笔画组成,本文充分利用字符的笔画特性和结构特性,进一步提出了两种基于卷积激活的场景字符表示方法。本文首先提出了基于多阶共生激活编码的字符特征,该方法认为单个的鉴别性笔画信息可以为字符识别提供重要的线索。除此之外,多个鉴别性笔画的共生信息可以为字符识别提供更多的上下文信息。本文构建多阶共生激活来捕获多层次笔画之间的关系提升特征的表示能力,并且进一步采用编码策略聚合提取的多阶共生激活描述子生成字符全局表示。本文在国际公开的数据集以及本文收集的中文数据集上评估了提出的方法,实验结果验证了本方法的有效性。为了更充分地挖掘字符的结构信息,本文将每一类字符看做一种结构化目标,提出了基于空间嵌入笔画部件判别检测器的字符识别方法。该方法将字符的笔画检测器与空间位置相结合,认为识别字符时不同类别的字符鉴别性笔画不同。本文利用卷积激活表示字符笔画部件特征,自动学习笔画检测器,并自动挑选对应于鉴别性部件的笔画检测器,并对每一个检测器分配一个响应区域。接下来本文将鉴别性部件检测器与空间位置相关联,来缓解字符平移、旋转和变形等的影响,最终聚合检测器响应生成最终的字符特征。实验结果表明本文提出的字符识别方法在英文和中文场景字符数据集上均取得了优异的识别性能。
其他摘要
         Text is an important symbol of human civilization progress, as well as the main
medium for human communication. With the rapid development of the Internet
and the wide application of camera-based intelligent terminals, image-based multimedia information has brought much convenience to people’s lives. Text contained
in the image expresses rich and accurate semantic information. Therefore, there are
increasing demands for automatic detecting and recognizing the text in the image,
which also attracts more and more researchers’ attention. Nowadays, the technology
of text recognition for scanned document are matured. However, the performance
of scene text automatic detection and recognition is still unsatisfactory, which is
effected by some interference, such as different fonts, low resolution, uneven illumination and complex background.
       Considering characteristics of scene text, this thesis conducts a thorough study
on scene text detection, scene text binarization and scene text recognition.The main
contributions of this dissertation are summarized as follows:
       1. Due to various interference factors in the natural scene, scene text has a high
degree of intra-class variation and background is uncertain. Thus, a single classifier
or feature is hard to distinguish text region from non-text region effectively. To deal
with the problems mentioned above, this thesis proposes a graph-model-based text
detection method for natural scene images using convolutional neural networks and
context information. From the point of text characteristics, we integrate various
information into a framework to improve detection performance. First, we use Maximum Stability Extremal Region (MSER) to detect character candidates, which can
enhance recall rate for character candidates and improve detection speed. Then, we
integrate multiple neighboring information and single character candidate information to construct the graph model. Furthermore, we apply contextual information
to recover missing character candidates. Finally, we design shape-specific classifiers
by integrating gray and binary features to reduce the intra-class variation for text
lines to improve the final detection performance. Experimental results show that the proposed scene text detection method achieves satisfactory performances on four
public scene text detection datasets, proving the effectiveness and generality of our
method.
         2. Considering that text recognition methods designed for the scanned document images might be invalid to binarize scene text images, we propose an adaptive
stroke-based text binarization method. To reduce the interference of complex background and the disturbance of different strokes between adjacent characters, we first
split the whole image into several sub-images. Second, according to character stroke
characteristics, we design algorithm to automatically select foreground pixels and
background pixels with high confidence. Afterwards, we aggregate seed pixels to
obtain the initial clustering centers for foreground and background, which will be
used to generate the clustering centers of the whole image. Finally, we integrate
the context information and single pixel information to construct the graph model
to get the binarization results. Experimental results verify the effectiveness of our
text binarization method on video overlap text dataset and scene text dataset with
pixel-level and recognition-level evaluation criteria.
         3. Character feature is an important element for character recognition. Taking
the perspective of character representation as the starting point, this thesis proposes convolutional-activation-based representations for scene characters. We first
use convolution neural networks to extract character stroke feature. Next, regarding
character as a structured object, we incorporate spatial information to convolutional
activation features to boost recognition performance. Then, we use pooling strategy and embedding strategy to generate global character representations. To deal
with the variation of character strokes with different image sizes, we use multi-scale
image input to enhance the robustness of character feature. In order to evaluate the
effectiveness and generality of convolutional-activation-based character representations, in addition to evaluating on public English datasets, we collect a Chinese
scene character dataset, which can be used for the study of scene character recognition. We evaluate the proposed method on seven scene character datasets. We also
explore the performance differences of the proposed method for multilingual character recognition. The experimental results show the effectiveness of our method for multilingual scene character recognition.
       4. Considering that characters are composed of a series of parts arranged in the
specific structures, this thesis takes advantage of stroke characteristics and structure
characteristics of characters, and proposes two kinds of convolutional-activationbased representations. We first propose to encode multi-order co-occurrence activations to obtain character feature. This method argues that discriminative strokes
can provide important clues for character recognition. In addition, co-occurrence
clues of a clique of discriminative strokes can provide more context information to
boost the recognition. We thus construct co-occurrence activation descriptors to
capture multi-level stroke relationship. Furthermore, we adopt encoding algorithm
to aggregate multi-order co-occurrence descriptors to generate the global character representation. We evaluate the proposed method on public datasets and the
collected Chinese dataset. Experimental results verify the effectiveness of the proposed method. In order to make full use of the structure information of characters,
this thesis regards each class character as a structured object. We propose spatially embedded discriminative part detectors for scene character recognition. This
method integrates stroke detectors and spatial information to recognize characters.
We consider that different type characters have different discriminative strokes. We
use convolution neural networks to extract features for character strokes and automatically learn stroke detectors. Then, we automatically select stroke detectors
corresponding to discriminative strokes, and assign each detector a specific spatial
region. Afterwards, spatial region information is embedded into the character representation to alleviate the influence of character translation, rotation and deformation. Finally, we aggregate the maximal outputs of all the salient stroke detectors to
represent characters. Experimental results show that the proposed method achieves
the significant improvement on both English and Chinese scene character datasets.
     In conclusion, we have studied scene text detection,binarization and recognition,
and we have made some progress.

 
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/21055
专题毕业生_博士学位论文
作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
王燕娜. 自然场景文本检测与识别方法研究[D]. 北京. 中国科学院研究生院,2018.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Thesis_带签字页_王燕娜.pdf(11176KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[王燕娜]的文章
百度学术
百度学术中相似的文章
[王燕娜]的文章
必应学术
必应学术中相似的文章
[王燕娜]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。