CASIA OpenIR  > 模式识别国家重点实验室  > 先进时空数据分析与学习
Thesis Advisor潘春洪 ; 孟高峰
Degree Grantor中国科学院自动化研究所
Place of Conferral自动化大厦13层第一会议室
Degree Discipline模式识别与智能系统
Keyword文本图像几何变换 文本图像生成 对抗学习 生成模型 解耦表示



针对单字符字体生成任务,本文充分考虑字符图像生成时的风格一致性与内容准确性,提出一种基于解耦表示的字体生成方法。基于解耦机制,一个生成式模型FontGAN被提出,它将字体风格化、去风格化以及多字体变换纳入一个统一的框架下。具体地,字符图像被解耦为风格表示和内容表示,这提供了对这两种类型变量的细粒度控制,从而提高了生成结果的质量。为了有效地捕获风格信息,本文引入了风格一致性模块(SCM)。从技术上讲,SCM 利用类别指导的 Kullback-Leibler 散度将风格表示显示地建模为不同的先验分布。通过这种方式,所提出的模型能够在一个框架中实现多个域之间的转换。此外,本文还提出了内容先验模块(CPM)为模型提供内容先验,从而指导内容编码过程并缓解字体去风格化过程中笔画缺失的问题。得益于解耦和重组的思想,FontGAN 足以实现字形结构的多对多迁移任务。实验结果表明,所提出的 FontGAN 在字符字体生成方面达到了先进的性能。


Other Abstract

As one of the important carriers of cultural communication and inheritance, text images play an important role in our daily life. With the development of artificial intelligence, automatic text digitization, layout analysis, and artistic creation with the help of computers have aroused more and more interest. As an important topic in computer vision, text image generation and transformation have attracted much attention. In reality, both the geometric transformation of document images and the style transfer of characters receive demands in practical application, e.g., geometric rectification of distorted text images is beneficial to subsequent text recognition and understanding. Moreover, the generation of characters will boost artistic creation. Although having experienced progress, the generation and transformation of text images are confronted with many challenges. First of all, the diverse content and complicated deformation of text images greatly limit the performance of rectification quality. In addition, the variability of font styles and the topological differences between different fonts also make character-level generation tasks very difficult. To handle the aforementioned challenges, we consider multi-grained content, i.e., document images, single characters, and sequence characters, and further achieve the text image generation and transformation tasks including document image rectification, character glyph synthesis, and handwritten text generation. The main contributions in the thesis are summarized as follows.

Aimed at document image rectification, we recast this task as a dense grid prediction problem and address it by training the model using the regression strategy, so that the model can output the grid in an end-to-end manner. Specifically, we develop a pyramid encoder-decoder architecture to predict the unwarping grid at multiple resolutions in a coarse-to-fine fashion. Based on the observation that the structural visual cues, e.g., text-lines, text blocks, lines in tables, which are critical for the estimation of unwarping mapping, are non-uniformly distributed in the images, three gated modules are introduced to guide the network focusing on these informative cues rather than other interferences such as blank areas and complex backgrounds. To generate more visually pleasing rectification results, we further adopt the adversarial training mechanism to implicitly constrain the unwarping grid estimation. Our model can rectify arbitrarily distorted document images with complicated page layouts and cluttered backgrounds. Experiments on the public benchmark dataset and the synthetic dataset demonstrate that our approach outperforms the state-of-the-art methods in terms of OCR accuracy and several widely used quantitative evaluation metrics.

To generate character glyph images with high quality, we take into account the style consistency and content accuracy, thereby introducing a disentangled representation framework. Based on the proposed decoupling mechanism, we propose a novel generative model named FontGAN, which integrates the font stylization, de-stylization, and multiple fonts transfer into a unified framework. Specifically, we decouple character images into style representation and content representation, which offers fine-grained control of these two types of variables, thus improving the quality of the generated results. To effectively capture the style information, a style consistency module (SCM) is introduced. Technically, SCM exploits category-guided Kullback-Leibler divergence to explicitly model the style representation into different prior distributions. In this way, our model is capable of implementing transformations between multiple domains in one framework. In addition, we propose a content prior module (CPM) to provide content prior for the model to guide the content encoding process and alleviates the problem of stroke deficiency during structure de-stylization. Benefiting from the idea of decoupling and regrouping, our FontGAN suffices to achieve many-to-many translation tasks for glyph. Experimental results demonstrate that the proposed FontGAN achieves the state-of-the-art performance in character glyph synthesis.

For the sequence characters generation problem, we propose an effective generative model called HTG-GAN to synthesize handwritten text images from latent prior. Unlike single-character synthesis, our method is capable of generating images of sequence characters with arbitrary length, which pays more attention to the structural relationship between characters. Specifically, we model the structural relationship as the style representation to avoid explicitly modeling the stroke layout. Technically, the text image is disentangled into style representation and content representation, where the style representation is mapped into Gaussian distribution and the content representation is embedded using character index. In this way, our model can generate new handwritten text images with specified contents and various styles to perform data augmentation, thereby boosting handwritten text recognition (HTR). Experimental results show that our method achieves state-of-the-art performance in handwritten text generation.

Document Type学位论文
Corresponding Author刘希岩
Recommended Citation
GB/T 7714
刘希岩. 基于对抗机制的文本图像生成与变换方法研究[D]. 自动化大厦13层第一会议室. 中国科学院自动化研究所,2021.
Files in This Item:
File Name/Size DocType Version Access License
刘希岩-基于对抗机制的文本图像生成与变换(21375KB)学位论文 开放获取CC BY-NC-SA
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[刘希岩]'s Articles
Baidu academic
Similar articles in Baidu academic
[刘希岩]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[刘希岩]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.