基于对抗机制的文本图像生成与变换方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 先进时空数据分析与学习

	基于对抗机制的文本图像生成与变换方法研究
	刘希岩
	2021-12
页数	120
学位类型	博士
中文摘要	文本图像作为文化交流与传承的重要载体之一，它在我们的日常生活中扮演着重要角色。随着人工智能技术的不断发展，借助计算机实现自动地文本数字化、版面分析以及艺术创作等引起了人们越来越多的兴趣。作为计算机视觉领域的一项重要研究课题，基于文本图像的生成与变换一直备受关注。现实中，无论是文本图像的全局变换还是局部字符的风格迁移都具有广泛的应用需求，例如对扭曲的文本图像进行几何矫正有利于后续文本识别与理解，对字符的分析和生成会促进艺术创作等。虽然相关研究已经取得了一定进展，但是针对文本图像的生成与变换任务仍然面临着诸多挑战。首先，自然场景中文本图像的多样化内容以及复杂的几何形变严重限制了文档图像矫正的质量。另外，字体风格的多变性以及不同字体之间的拓扑差异也导致字符级的生成任务变得十分困难。本文针对上述挑战，以生成式对抗网络作为技术手段，考虑多粒度的文本图像内容，即文档图像、单字符、序列字符。进而对文档图像几何矫正，单字符字体生成以及序列字符生成等具体任务展开研究。本文取得的研究成果主要包含以下三项。针对文档图像几何矫正问题，本文将其重定义为一个密集网格预测任务，并通过回归的训练方式使模型能够端到端的直接输出矫正网格。具体地，本文提出一个金字塔式编码器-解码器结构并以由粗糙到精细的方式预测多尺度网格。考虑到文档图像的结构性线索，比如文本行、文本块、表格线等这些对于矫正任务至关重要的信息在图像中是非均匀分布的，三类门控模块被提出用于指导模型更加关注这些有效信息并忽略干扰元素（例如大面积空白区域以及复杂的背景）。为了生成视觉感知更优的结果，对抗训练机制被用来训练模型并隐式地约束网格的估计。所提出的模型能够矫正多种扭曲形式的文档图像，且不受复杂页面排版以及杂乱背景环境的干扰。在开源数据集以及合成数据集上的实验表明，所提出的方法在OCR识别率以及几个常用的图像质量评价指标上均达到先进性能。针对单字符字体生成任务，本文充分考虑字符图像生成时的风格一致性与内容准确性，提出一种基于解耦表示的字体生成方法。基于解耦机制，一个生成式模型FontGAN被提出，它将字体风格化、去风格化以及多字体变换纳入一个统一的框架下。具体地，字符图像被解耦为风格表示和内容表示，这提供了对这两种类型变量的细粒度控制，从而提高了生成结果的质量。为了有效地捕获风格信息，本文引入了风格一致性模块（SCM）。从技术上讲，SCM 利用类别指导的 Kullback-Leibler 散度将风格表示显示地建模为不同的先验分布。通过这种方式，所提出的模型能够在一个框架中实现多个域之间的转换。此外，本文还提出了内容先验模块（CPM）为模型提供内容先验，从而指导内容编码过程并缓解字体去风格化过程中笔画缺失的问题。得益于解耦和重组的思想，FontGAN 足以实现字形结构的多对多迁移任务。实验结果表明，所提出的 FontGAN 在字符字体生成方面达到了先进的性能。针对序列字符生成任务，本文提出一个有效的生成模型HTG-GAN，它可以直接从潜在先验生成手写文本图像。与单字符字体合成不同的是，所提出的模型能够生成任意长度的序列字符，它更加关注相邻字符之间的结构关联性。具体地，本文将字符间的结构关系建模为风格表示从而避免对笔画布局进行显示地建模。技术上，文本图像首先被解耦为风格表示和内容表示，其中风格表示被映射到高斯分布，而内容表示直接用字符索引编码。通过这种方式，所提出的模型能够生成具有指定文本内容的且风格多样化的新图像。进而将其用于数据增广，可提升手写文本识别（HTR）性能。实验结果证明所提出的方法在手写文本生成领域优于其他方法。
英文摘要	As one of the important carriers of cultural communication and inheritance, text images play an important role in our daily life. With the development of artificial intelligence, automatic text digitization, layout analysis, and artistic creation with the help of computers have aroused more and more interest. As an important topic in computer vision, text image generation and transformation have attracted much attention. In reality, both the geometric transformation of document images and the style transfer of characters receive demands in practical application, e.g., geometric rectification of distorted text images is beneficial to subsequent text recognition and understanding. Moreover, the generation of characters will boost artistic creation. Although having experienced progress, the generation and transformation of text images are confronted with many challenges. First of all, the diverse content and complicated deformation of text images greatly limit the performance of rectification quality. In addition, the variability of font styles and the topological differences between different fonts also make character-level generation tasks very difficult. To handle the aforementioned challenges, we consider multi-grained content, i.e., document images, single characters, and sequence characters, and further achieve the text image generation and transformation tasks including document image rectification, character glyph synthesis, and handwritten text generation. The main contributions in the thesis are summarized as follows. Aimed at document image rectification, we recast this task as a dense grid prediction problem and address it by training the model using the regression strategy, so that the model can output the grid in an end-to-end manner. Specifically, we develop a pyramid encoder-decoder architecture to predict the unwarping grid at multiple resolutions in a coarse-to-fine fashion. Based on the observation that the structural visual cues, e.g., text-lines, text blocks, lines in tables, which are critical for the estimation of unwarping mapping, are non-uniformly distributed in the images, three gated modules are introduced to guide the network focusing on these informative cues rather than other interferences such as blank areas and complex backgrounds. To generate more visually pleasing rectification results, we further adopt the adversarial training mechanism to implicitly constrain the unwarping grid estimation. Our model can rectify arbitrarily distorted document images with complicated page layouts and cluttered backgrounds. Experiments on the public benchmark dataset and the synthetic dataset demonstrate that our approach outperforms the state-of-the-art methods in terms of OCR accuracy and several widely used quantitative evaluation metrics. To generate character glyph images with high quality, we take into account the style consistency and content accuracy, thereby introducing a disentangled representation framework. Based on the proposed decoupling mechanism, we propose a novel generative model named FontGAN, which integrates the font stylization, de-stylization, and multiple fonts transfer into a unified framework. Specifically, we decouple character images into style representation and content representation, which offers fine-grained control of these two types of variables, thus improving the quality of the generated results. To effectively capture the style information, a style consistency module (SCM) is introduced. Technically, SCM exploits category-guided Kullback-Leibler divergence to explicitly model the style representation into different prior distributions. In this way, our model is capable of implementing transformations between multiple domains in one framework. In addition, we propose a content prior module (CPM) to provide content prior for the model to guide the content encoding process and alleviates the problem of stroke deficiency during structure de-stylization. Benefiting from the idea of decoupling and regrouping, our FontGAN suffices to achieve many-to-many translation tasks for glyph. Experimental results demonstrate that the proposed FontGAN achieves the state-of-the-art performance in character glyph synthesis. For the sequence characters generation problem, we propose an effective generative model called HTG-GAN to synthesize handwritten text images from latent prior. Unlike single-character synthesis, our method is capable of generating images of sequence characters with arbitrary length, which pays more attention to the structural relationship between characters. Specifically, we model the structural relationship as the style representation to avoid explicitly modeling the stroke layout. Technically, the text image is disentangled into style representation and content representation, where the style representation is mapped into Gaussian distribution and the content representation is embedded using character index. In this way, our model can generate new handwritten text images with specified contents and various styles to perform data augmentation, thereby boosting handwritten text recognition (HTR). Experimental results show that our method achieves state-of-the-art performance in handwritten text generation.
关键词	文本图像几何变换文本图像生成对抗学习生成模型解耦表示
语种	中文
七大方向——子方向分类	文字识别与文档分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/46644
专题	多模态人工智能系统全国重点实验室_先进时空数据分析与学习
通讯作者	刘希岩
推荐引用方式 GB/T 7714	刘希岩. 基于对抗机制的文本图像生成与变换方法研究[D]. 自动化大厦13层第一会议室. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
刘希岩-基于对抗机制的文本图像生成与变换（21375KB）	学位论文		开放获取	CC BY-NC-SA