CASIA OpenIR  > 模式识别实验室
非受限场景下文本到图像的生成方法研究
孙建新
2024-05
Pages99
Subtype博士
Abstract

在人工智能研究中,文本与图像扮演着不可或缺的核心角色,它们是最基础且广泛使用的两种信息载体,并分别对应人类认知的根本模式:语言逻辑抽象与视觉形象直觉。随着人工智能技术的突飞猛进,这两种信息表达方式的交互融合变得越来越切实可行,尤其是在文本到图像生成任务上,这一趋势尤为引人注目。此项技术的演进不仅成功跨越了语言与视觉的壁垒,也为包括设计辅助、教育、娱乐和艺术创作在内的多个应用领域带来了创新的机遇。

尽管自然语言处理与计算机视觉两大学科已取得显著进步,然而,现有的文本至图像生成方法普遍过于倚重预设的模板输入模式,导致其在实际应用中缺乏必要的灵活度与自适应能力。本文聚焦于非受限场景下文本到图像生成,旨在突破文本到图像生成过程中文本数量与长度,输入内容与形式,生成场景等方面的限制。文章首先从相对成熟的图像生成子领域——人脸图像出发,探讨了基于多文本输入的人脸生成技术,继而在更为多样和灵活的输入环境下,本文进一步致力于实现自由式文本至人脸生成的高精确度与高分辨率目标。进而,借助于文本至人脸生成领域的既有成果,并结合最先进的扩散模型技术,本文进一步拓宽研究视野,探究更广阔复杂场景下的开放环境文本到图像生成难题。本文遵循了从特殊到一般、由浅入深的研究路径,系统地探究了文本至图像生成的各种情境及其挑战。

论文的主要工作和创新点可以归纳如下:

1. 基于语义嵌入的多文本到人脸生成。在以生成式对抗网络为基础的文本到图像生成研究中,许多研究集中于结构相对简单的图像对象,如植物、鸟类等。然而,在复杂度更高的人脸图像生成任务中,尤其是融合多重文本描述的情况下,相关研究相对较少,这主要归因于缺乏高效的算法框架和规模大、细节丰富的数据集。面对这一挑战,本文提出了一个专门针对多文本输入设计的新型语义嵌入与注意力网络架构,以期生成与描述精准匹配的人脸图像。本文提出的语义特征注入模块赋予了模型整合多元文本信息的能力。同时,引入的多文本注意力机制能够高效融合不同文本源的词汇特征,进而增强了对人脸细节的精确复原。为了更准确地构建出与文本描述相符的人脸特征,本文设计了一种属性损失机制来指导模型的生成过程。此外,考虑到现有数据集在规模和描述精确度上的局限性,本文构建了第一个人工标注的大规模人脸描述数据集,并为每张人脸图像提供了十份详细的文本描述。一系列实验结果充分证实了该技术在基于多文本描述生成人脸图像方面的高效性和准确性。

2. 自由式文本到人脸生成。
文本到人脸生成追求对人脸图像深层次、全方位语义特征的精微捕捉与重构,而这有赖于灵活运用丰富的词汇表达与复杂的句法构造。当前的文本至人脸图像生成技术普遍面临一项挑战:由于训练集中句式模板的数量受限,导致模型在泛化能力方面存在不足。针对这一问题,本文定义了“自由风格”文本至人脸生成与操控任务,并提出了一种双分支的文本至人脸生成框架,通过人脸重建任务引导文本到人脸生成。模型中融入了Contrastive Language-Image Pre-training (CLIP)模型,以学习精确对齐的语言-视觉特征空间。得益于在大规模数据集上的充分训练,该模型得以显著扩充其能有效处理和识别的词汇集合规模。此外,为了提高文本与图像间的语义对齐精度,引入了一个能够处理不同长度和风格描述的记忆力模块,并将其文本和图像特征转化为规范化的潜空间编码,以精确映射目标人脸的特征。同时,采用半监督训练策略和多项目标函数,以提高生成图像的多样性和语义一致性。通过上述改进,本方法在处理灵活文本描述及生成多样化、逼真人脸图像方面显示出优势,能够生成更为丰富和逼真的图像。

3. 基于语义细化的精细化文本到图像生成。
基于在人脸生成任务上的实践经验和扩散模型在复杂场景生成方面的进展,本研究将关注点转向了更为复杂且要求更高的文本到图像生成问题。尽管当前基于扩散模型的文本至图像转换技术已经在创造高度逼真且创新性的图像方面取得了显著成效,但面对包含丰富细节的长文本提示时,此类方法依然存在一定的局限性,特别是在深度理解和表现长文本蕴含的多重细节信息方面,这一挑战尤为突出。本文认为,这一瓶颈在一定程度上源于CLIP模型在处理复杂、多层次的长文本描述时,存在对细微语义捕捉不足的问题。本文提出一种全新的基于扩散模型的精细化文本至图像生成策略,专注于细化文本驱动的图像生成过程,旨在加强对文本语义特征的描绘和指导,以实现更加精确的图像生成调整。具体来说,在扩散模型迭代的去噪过程中,本文引入了语义引导梯度作为额外输入机制,以促进模型更深入地解析和处理选定的子概念。通过这些梯度的直接结合,本文的框架能够有效地整合多种语义元素。此方案不仅允许用户运用完整的描绘性句子作为生成图像的基本指导,还赋予了用户针对性强调特定词汇或短语以影响生成图像细节的能力。基于多个代表性数据集的深入实践与验证,本文的方法在提升语义细节精细化方面表现出优于现有文本至图像生成技术的能力,实现了更精细的像素级控制力和更高水平的多样性生成效果。

Other Abstract

In artificial intelligence research, text and images play indispensable core roles as the most basic and widely used carriers of information, each corresponding to fundamental human cognitive patterns: linguistic logical abstraction and visual intuitive imagery. With the rapid advancements in AI technology, the integration of these two modes of information expression has become increasingly viable, particularly in the area of text-to-image generation, where this trend is notably prominent. This technological evolution not only successfully bridges the gap between language and vision but also brings innovative opportunities to various fields, including design assistance, education, entertainment, and artistic creation.

Despite significant progress in the disciplines of natural language processing and computer vision, current methods of text-to-image generation overly rely on predetermined template input modes, resulting in a lack of necessary flexibility and adaptability in practical applications. This paper focuses on text-to-image generation in unrestricted scenarios, aiming to overcome limitations related to the quantity and length of text, input content and form, and generation scenarios. Initially, the study begins with a relatively mature subdomain of image generation — facial images, exploring face generation technology based on multiple text inputs. Then, in a more diverse and flexible input environment, efforts are made to achieve high-precision and high-resolution targets for free-form text-to-face generation. Subsequently, leveraging the existing achievements in the text-to-face generation domain and combining advanced diffusion model technology, the research scope is expanded to explore the challenges of text-to-image generation in more extensive and complex open environments. The paper adopts a research trajectory from specific to general, systematically examining various scenarios and challenges in text-to-image generation.

The main contributions and innovations of the thesis are as follows:

1. Multi-text to face generation based on semantic embedding. In text-to-image generation studies that rely on generative adversarial networks, many have focused on simpler structural image objects like plants and birds. However, there is comparatively less research in more complex face image generation tasks, especially those involving multiple text descriptions, primarily due to the absence of efficient algorithmic frameworks and large, detail-rich datasets. Addressing this challenge, we propose a novel semantic embedding and attention network architecture, specifically designed for multi-text input, to generate face images that precisely match the descriptions. The semantic feature injection module developed in this paper enables the integration of diverse textual information in the model. The introduced multi-text attention mechanism efficiently fuses lexical features from various text sources, thereby enhancing the accurate restoration of facial details. To construct facial features more accurately aligned with the text descriptions, an attribute loss mechanism is designed to guide the model's generation process. Additionally, considering the limitations in scale and description accuracy of existing datasets, this paper creates the first large-scale, manually annotated face description dataset, providing ten detailed text descriptions for each face image. A series of experiments validate the efficiency and accuracy of this technology in generating face images based on multiple text descriptions.

2. Free-form text to face generation. Text-to-face generation aims to capture and reconstruct the deep and comprehensive semantic features of facial images, relying on the flexible utilization of a rich vocabulary and complex syntactic structures. Current text-to-face image generation technology generally faces a challenge due to the limited number of sentence templates in training sets, resulting in insufficient generalization capability. In response, this paper defines a "free-style" text-to-face generation and manipulation task and proposes a dual-branch framework for text-to-face generation, guided by face reconstruction tasks. The model incorporates the Contrastive Language-Image Pre-training (CLIP) model to learn a precisely aligned language-visual feature space. Benefitting from extensive training on large-scale datasets, the model significantly expands its capacity to effectively process and recognize a broader vocabulary set. To enhance the semantic alignment accuracy between text and images, a memory force module capable of processing descriptions of different lengths and styles is introduced, converting text and image features into standardized latent space encodings to accurately map the characteristics of the target face. Additionally, a semi-supervised training strategy and multi-objective functions are employed to enhance the diversity and semantic consistency of generated images. These improvements demonstrate the method's strengths in processing flexible text descriptions and generating diverse, realistic face images, capable of creating more abundant and lifelike images.

3. Refined text-to-image generation based on semantic detailing. Building on the successful practice in text-to-face generation tasks and advancements in diffusion models for complex scene generation, this study shifts its focus to more complex and demanding text-to-image generation problems. Although current diffusion model-based text-to-image conversion techniques have achieved significant success in creating highly realistic and innovative images, they still face limitations when dealing with long text prompts containing rich details, especially in deeply understanding and representing the multiple sub-details implied in long texts. We believe this bottleneck partially arises from the CLIP model's inadequacy in capturing fine semantic nuances in complex, multi-layered long text descriptions. This paper introduces a novel diffusion model-based refined text-to-image generation strategy, focusing on enhancing the text-driven image generation process for more precise image generation adjustments. Specifically, in the denoising process of the diffusion model iterations, semantic guiding gradients are introduced as an additional input mechanism

 to facilitate deeper analysis and processing of selected sub-concepts. By directly combining these gradients, our framework effectively integrates various semantic elements. This approach not only allows users to use complete descriptive sentences as the basis for generating images but also enables them to emphasize specific words or phrases to influence the details of the generated images. Based on extensive practices and validations across multiple representative datasets, our method exhibits superior capability in enhancing semantic detail refinement compared to existing text-to-image generation techniques, achieving finer pixel-level control and higher levels of diversity in generated effects.

Keyword生成式对抗网络,扩散模型,文本到图像生成,人脸图像编辑
Subject Area信息科学与系统科学
MOST Discipline Catalogue工学
Language中文
Sub direction classification生物特征识别
planning direction of the national heavy laboratory视觉信息处理
Paper associated data
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/57175
Collection模式识别实验室
毕业生_博士学位论文
Recommended Citation
GB/T 7714
孙建新. 非受限场景下文本到图像的生成方法研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
非受限场景下文本到图像的生成方法研究.p(32226KB)学位论文 开放获取CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[孙建新]'s Articles
Baidu academic
Similar articles in Baidu academic
[孙建新]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[孙建新]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.