鲁棒的自然场景文本检测与识别技术研究

CASIA OpenIR > 数字内容技术与服务研究中心 > 版权智能与文化计算

	鲁棒的自然场景文本检测与识别技术研究
	李小倩
	2021-05-26
页数	140
学位类型	博士
中文摘要	自然场景图像中的文本通常蕴含了明确的、具有针对性的高层语义信息，可以帮助人们快速地理解图像内容，自然场景文本检测与识别技术研究具有极其重大的理论意义和广泛的应用前景。基于深度学习的文本检测与识别技术取得了不错的性能，随着研究深入，文本检测识别的对象逐渐从水平或多方向文本转向任意形状文本，从单独的文本检测或文本识别转向端到端文本识别技术研究。但自然场景图像文本检测识别仍面临着复杂的背景干扰、文本模糊与退化、字体多样性等众多挑战，有很大的提升空间，其中如何实现鲁棒的任意形状文本检测、文本识别，以及简化端到端文本提取框架，是研究需要重点关注的问题。论文研究工作主要围绕场景文本检测、场景文本识别以及端到端的场景文本提取技术展开，聚焦在多方向文本、任意形状文本的特征表示、可行有效的训练策略等方面，研究鲁棒的自然场景文本检测和识别模型。论文的主要研究工作与创新点归纳如下: 1. 提出一种基于聚合文本特征的多方向场景文本检测方法多方向场景文本检测方法通常是基于通用目标的检测方法，它将文本视为特定目标，根据多方向文本特点，设计不同尺度、不同纵横比、不同倾斜角度的锚框。锚框的设计依赖经验且锚框策略中存在冗余计算和不匹配问题，针对该问题，本文提出一种基于聚合文本特征的多方向场景文本检测方法。该方法摒弃锚框设计，利用像素点作为参考点，生成正负样本标签以及文本框坐标偏移相关值。同时，为了提升长文本检测性能，该方法结合自注意力机制和空洞卷积，提出文本特征聚合模块，用于学习较远距离的特征以及融合不同感受野下的卷积特征。该方法的模型是全卷积神经网络，先利用多层堆叠的卷积层提取特征，然后送入文本特征聚合模块获得增强的文本特征，后进入预测层进行类别预测和坐标回归。该方法免除繁琐的锚框设计，实现多尺度、多方向场景文本检测，并在多个公开数据集上进行大量实验，证明方法的有效性。 2. 提出一种基于自适应回归的任意形状场景文本检测方法主流的任意形状场景文本检测方法通常是基于图像分割的方法，虽然具有较好可解释性，但是步骤流程相对繁琐。同时，文本框标注的歧义性限制了任意形状文本框的回归。针对上述问题，为了实现任意形状场景文本检测，本文提出一种基于自适应回归的任意形状文本检测方法。该方法利用自适应回归损失函数，使得模型可以直接预测文本框坐标相关值，克服了文本框标注的歧义性问题。同时，该方法提出文本实例精度损失函数，在交并比的引导下进一步修正文本框坐标，预测更加精确的文本框。该方法简单有效，保持了较好的推理速度，且在公开数据集上取得相当或更好的性能。 3. 提出一种基于注意力机制的场景文本识别方法基于深度学习的场景文本识别往往需要大量训练数据，由于人工标注成本过高，现有方法通常使用大规模的合成文本数据集作训练集，使用真实文本数据集作测试集。而合成文本训练数据集和真实文本测试数据集之间存在一定偏差，测试集文本风格更加多变，弯曲曲率变化更大，背景更加复杂多变。针对上述问题，本文在主流的序列到序列识别模型的基础上，从数据层面和特征层面入手，提出一种更加鲁棒有效的基于注意力机制的场景文本识别方法。针对数据层面，该方法利用 S-形形变对训练数据进行变换，丰富训练数据的文本曲率变化，以增强泛化性能。针对特征层面，该方法结合实例归一化和批归一化，应用实例-批归一化模块，学习风格不变性特征，提升模型识别准确率。该方法在规则文本和不规则文本公开数据集上取得相当或更好的识别精度。 4. 提出一种基于弱监督学习的端到端场景文本提取方法现有场景文本提取方法大多是两阶段的，将检测任务和识别任务独立训练，然后以级联的方式提取文本，忽略检测和识别高度相关又互补的关系。同时，在端到端文本提取模型中，如何有效地整合文本检测和文本识别，也是研究中值得关注的。基于此，本文提出一种基于弱监督学习的端到端场景文本提取方法。该方法的模型包含共享特征模块、特征映射模块、检测分支以及识别分支。检测分支和识别分支共享卷积特征，并通过特征映射模块实现检测分支和识别分支的连接，充分利用检测和识别高度的关联性。检测分支以弱监督学习的方式获得伪标签数据，用于参数学习，这种方式减轻模型对真实数据集的文本框标注依赖。该方法的有效性在多个公开数据集上得到验证。
英文摘要	Text in natural scene images usually contains clear and targeted high-level semantic information，which can help people quickly understand image content. The research on natural scene text detection and recognition technology has great theoretical significance and broad application prospects. The text detection and recognition technology based on deep learning has achieved good performance. With further research, the object of text detection and recognition has gradually changed from horizontal or multi-oriented text to arbitrary shaped text, and from text detection or text recognition to end-to-end text spotting. However, text detection and recognition in natural scene images faces many challenges such as complicated background interference, text blur and degradation, font diversity, and so on. There is much room for improvement, among which how to achieve robust arbitrary shaped text detection, text recognition, and how to simplify the framework of end-to-end text spotting are key issues. This dissertation mainly centers on scene text detection, scene text recognition, and end-to-end text spotting. It focuses on the feature representation of multi-oriented text or arbitrary shaped text, feasible and effective training strategies, etc., to research robust natural scene text detection and recognition. The main contributions of this dissertation are summarized as follows: 1. A multi-oriented scene text detection method based on aggregated text features is proposed Multi-oriented scene text detection methods are usually based on generic object detection, which regards the text as a specific object. According to the characteristics of multi-oriented text, it is necessary to design anchor boxes with different scales, different aspect ratios, and different inclination angles. Besides, the design of anchor depends on experience, and there are redundant calculation and mismatch problems in the anchor strategy. To solve the above issue, this dissertation proposes a multi-oriented text detection method based on aggregated text features. This method abandons anchor strategy and utilizes pixels as reference points to allocate positive and negative samples and the target coordinate offset related to samples. Also, considering the improvement of long text detection performance, a text feature aggregated module is proposed by combining the text attention and dilation convolution module to enrich the receptive field of model and extract text features from a long distance. The model of this method is a fully convolutional neural network. Firstly, the features are extracted by stacked convolution layers, and then they are sent to the text feature aggregated module for feature enhancement. Finally, category prediction and coordinate regression are performed by the prediction layers. This method avoids the tedious design of anchor boxes, realizes multi-scale multi-oriented scene text detection. A large number of experiments on several public datasets prove the effectiveness of the method. 2. An arbitrary shaped text detection method based on adaptive regression is proposed The mainstream text detection methods for arbitrary shaped text are usually based on image segmentation. Although they have good interpretability, the pipelines are relatively cumbersome. Also, the ambiguity of text box annotation limits the regression of arbitrary shaped text. In response to the above issues, an arbitrary shaped text detection method based on adaptive regression is proposed. This method makes use of the adaptive regression loss function, so that the model can directly predict the coordinate correlation value of the text bounding boxes, and overcomes the ambiguity problem of text box annotation. At the same time, the text instance accuracy loss is proposed under the guidance of the intersection of union, which further improves performance without increasing the computation of network. The method is simple and effective, maintains good inference speed, and achieves equivalent or better performance on public datasets. 3. A scene text recognition method based on attention is proposed Scene text recognition task based on deep learning usually requires a large amount of training data. Due to the high cost of manual annotation, most of the existing methods use synthetic text datasets for training, while real text datasets for testing. How- ever, there is a deviation between the distribution of the synthetic datasets and the real datasets. In testing data, the text style is more variable, the bending curvature changes more, and the background is more complex and changeable. In response to the above problems, based on the mainstream sequence-to-sequence recognition model, a more robust and effective scene text recognition method based on attention is proposed from the data level and feature level. From the data level, S-shape transformation is proposed to transform the training data, enriching the text curvature changes in the training data to enhance the generalization performance. From the feature level, this method combines instance normalization and batch normalization, and applies the IBN module to learn the style invariant features to improve generalization performance. This method achieves equivalent or better recognition accuracy on the public datasets of regular text and irregular text. 4. An end-to-end scene text spotting method based on weakly supervised learning is proposed Most of the existing scene text spotting methods are two-stage, which train the detection task and recognition task independently, and then extract the text in a cascade way, ignoring the highly correlated and complementary relationship between detection and recognition. Also, in the end-to-end text spotting model, how to effectively integrate text detection and text recognition is worthy of attention. Based on the above, this dissertation proposes an end-to-end scene text spotting method based on weakly supervised learning. The model of this method includes a shared feature module, a feature mapping module, a detection branch, and a recognition branch. The detection branch and recognition branch share convolution features, and the connection between detection branch and recognition branch is realized by feature mapping module, which makes full use of the high correlation between detection and recognition. The detection branch obtains pseudo-label annotations by weakly supervised learning, which can be used for parameter learning, and this way reduces the dependence of the model on the text bounding box annotations of real datasets. The effectiveness of this method has been verified on several public datasets.
关键词	自然场景文本检测文本识别
语种	中文
七大方向——子方向分类	文字识别与文档分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/45005
专题	数字内容技术与服务研究中心_版权智能与文化计算
推荐引用方式 GB/T 7714	李小倩. 鲁棒的自然场景文本检测与识别技术研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
鲁棒的自然场景文本检测与识别技术研究.p（23215KB）	学位论文		开放获取	CC BY-NC-SA