复杂场景图像中的文字检测方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	复杂场景图像中的文字检测方法研究
	黄燃东
	2021-05-27
页数	118
学位类型	博士
中文摘要	场景文字检测旨在精确检测自然场景图像中的文字区域，通常作为场景文字识别的前置步骤。目前，场景文字检测仍存在许多极具挑战性的困难，例如文字尺度、方向、形状、纵横比等因素变化多样，图像背景异常复杂等。克服这些困难，需要研究鲁棒文字特征的提取方法和简洁高效检测框架的设计方法。近几年，卷积神经网络有效地提升了场景文字检测应对各项挑战的能力。本文在卷积神经网络基础之上展开研究，主要贡献如下： 1、针对场景文字检测的假阳性检测问题，本文提出了一种聚焦特征及分类图的文字注意力混合机制。假阳性检测出现的原因是分类图与分类图输入特征对文字和背景的区分性很弱。本文研究了聚焦特征的文字注意力机制，将注意力分布图输入特征与分类图输入特征相乘融合，增强了分类图输入特征的区分性；本文研究了聚焦分类图的文字注意力机制，将注意力分布图与分类图相乘融合，增强了分类图的区分性；本文提出的聚焦特征及分类图的文字注意力混合机制将注意力分布图输入特征的指数幂与分类图输入特征相乘融合，并将注意力分布图与分类图相乘融合，同时增强了分类图输入特征的区分性与分类图的区分性。实验表明本文提出的机制能够明显抑制假阳性检测。 2、针对场景文字检测的训练样本不均衡问题，本文提出了类平衡一次方损失函数，用于解决检测准确度失衡问题。本文研究了抑制强背景交叉熵函数，用于抑制容易负样本的损失权重；本文研究了类平衡自适应损失函数，在抑制容易负样本损失权重的同时增大正样本损失权重，并侧重于困难正样本的训练；本文提出的类平衡一次方损失函数赋予正负样本相等但方向相反的梯度来克服交叉熵函数的梯度不平衡问题，并将容易负样本梯度置零来解决训练样本不均衡问题。本文提出的函数同时考虑了正负样本的损失权重和梯度，能够显著增强文字检测器对文字和背景的判别能力。 3. 针对任意形状文字检测方法复杂和低效率的问题，本文提出了一种基于并行回归分割的文字检测方法，旨在并行回归文字外接水平矩形框和分割任意形状文字。本方法包括四个模块：卷积特征提取与融合、网络输出、后处理和特征语义增强机制。卷积特征提取与融合用于提取并融合图像卷积特征；网络输出包括分类图分支、矩形框分支和文字中心性分支，其中分类图分支用于并行分类和分割文字；矩形框分支用于回归文字外接水平矩形框；文字中心性分支用于避免文字分割不完整和增强特征对文字和背景的区分性；后处理包含两种测试方式、位置感知非极大值抑制和矩形框投影；特征语义增强机制用于进一步增强特征对文字和背景的区分性。本方法构建了一个更简洁的任意形状文字检测模型，超过了大多数文字检测方法的检测性能和速度。
英文摘要	Scene text detection aims to accurately detect text regions in natural scene images and is usually used as a pre-step for scene text recognition. At present, there are still many challenging difficulties in scene text detection, such as the variation of text scale, orientation, shape, aspect ratio and the unusual complexity of image background. To overcome these difficulties, it is necessary to study the extraction methods of robust text features and design methods of concise and efficient detection framework. In recent years, convolutional neural network has effectively improved the ability of scene text detection to cope with various challenges. This dissertation is based on convolutional neural network to conduct research and its main contributions are as follows： 1. To solve the problem of false positives in scene text detection, this dissertation proposes a Features and Score Map Focused Text Attention Hybrid Mechanism (FSFTAHM). The ability of score map and its input features to distinguish text from background is considerably weak, which leads to false positives. This dissertation investigates a Features Focused Text Attention Mechanism (FFTAM), which is used to multiply attention map's input features with score map's input features to enhance the ability of score map's input features to distinguish text from background. This dissertation investigates a Score Map Focused Text Attention Mechanism (SFTAM), which is used to multiply attention map with score map to enhance the ability of score map to distinguish text from background. The proposed FSFTAHM multiplies the exponential power of attention map's input features with score map's input features and multiplies attention map with score map, which simultaneously enhances the ability of score map's input features and score map to distinguish text from background. Experiments prove that the proposed FSFTAHM can obviously suppress false positives. 2. In order to solve the imbalance problem of training samples in scene text detection, this dissertation proposes a Class-Balanced First Power Loss (CBFPL) to solve the imbalance problem of detection accuracy. Strong-Background Restrained Cross Entropy (SBRCE) is studied to down-weights loss assigned to easy negatives. Class-Balanced Self Adaption Loss (CBSAL) is studied to down-weight easy negatives and up-weight positives. And CBSAL also focuses training on hard positive samples. The proposed CBFPL provides equal but opposite gradients for positives and negatives to eliminate the gradient imbalance problem from cross entropy. Then, CBFPL abandons easy negatives and makes their gradients zero to handle the imbalance problem of training samples. The proposed CBFPL considers both the loss weight and gradients of positive and negative samples, which can significantly enhance the ability of text detectors to distinguish text from background. 3. To solve the complexity and low efficiency problem of arbitrary shape text detection methods, this dissertation proposes a parallel regression and segmentation based text detection method, which aims at parallelly regressing circumscribed horizontal rectangles of text instances and segmenting arbitrary shape text. This method consists of four modules: convolutional feature extraction and fusion, network outputs, post-processing and Feature Semantic Enhancement Mechanism (FSEM). Convolutional feature extraction and fusion is used to extract and merge image convolutional features. The network outputs contain score map branch, rectangle branch and Text Center-ness (TC) branch. Score map branch is used to parallelly classify and segment text instances. Rectangle branch aims at regressing circumscribed horizontal rectangles of text instances. TC is used to avoid incomplete segmentation of text and to enhance the ability of features to distinguish text from background. Post-processing includes two testing manners, locality-aware non-maximum suppression and rectangle projection. FSEM further enhances the ability of the features to distinguish text from background. This proposed method builds a more concise model for arbitrary shape text detection, which outperforms most text detection methods in both accuracy and speed.
关键词	场景文本检测，注意力机制，训练样本不均衡，并行回归分割，卷积神经网络
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44557
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	黄燃东. 复杂场景图像中的文字检测方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis-最终版-黄燃东-上传至答辩（21972KB）	学位论文		限制开放	CC BY-NC-SA