CASIA OpenIR  > 毕业生  > 硕士学位论文
文本指导的视频生成方法研究
刘佳伟
2023-05
Pages86
Subtype硕士
Abstract

      随着深度学习技术的发展,基于人工智能的内容生成已经开始在各个领域 得到广泛应用,其中,由于互联网内容的视觉化和视频化,文本指导的视频生成 正成为学术界和工业界都备受关注的研究领域。由人类自然语言作为条件控制 的视频生成具有极强的可控性,在学术研究中数据的扩增、工业场景中的视频素 材生成和视觉特效等方面都存在着广泛的应用前景。

      尽管现有的基于对抗生成网络、基于向量量化自编码器和基于扩散模型的 方法已经实现了基本的文本到视频的生成,然而目前的文本指导的视频生成领 域中仍有一些问题亟待解决。一方面,现有方法重点关注于视频图像帧的生成, 然而真实的视频是由视觉内容和声音内容共同组成的多模态数据格式,声音是 视频中重要的一个部分。为此,本文提出了文本指导的视频生成中一个新的子任 务,即文本指导的有声视频生成,并提出了一个统一的有声视频生成方案。另一 方面,现有视频生成模型通常有着较大的训练难度和训练参数量,而事实上预训 练的文本到图像生成模型已经学习到了文本到视觉内容的生成能力,视频生成 模型可以通过在其基础上额外学习时序建模能力得到,因此本文提出在预训练 的文本到图像生成扩散模型的基础上增加额外的时序建模模块构建高效训练的 视频生成模型。

      论文的主要工作和创新点归纳如下:

      • 基于音视频向量量化自编码器的有声视频生成。在文本指导的有声视频研 究中,本文提出一个基于向量量化自编码器的统一生成方案。使用向量量化自编 码器分别对视频图像帧和声音梅尔频谱做量化编码,然后使用自回归 Transformer 序列生成模型做视觉和声音离散标识符的生成。为了在编码阶段引入多模态关 联以优化音视频量化表征,本文提出混合对比学习方法,其中模态间对比学习用 于引入跨模态关联,模态内对比学习用于保持单模态特征空间的稳定。进一步本 文提出了跨模态注意力模块,以在视觉和声音之间构建局部层面的多模态关联。 在序列生成阶段,本文提出模态交替序列格式以使得生成的标识符可以关注到 文本-视觉-声音三个模态的信息。此外,为了解决现有文本-视频对数据集缺乏对 声音的描述的问题,本文构建了人工标注的包含对视觉和声音两个模态描述的 大规模视频数据集。通过以上方法,本文实现了出色的文本到有声视频的生成。

      • 基于扩散模型的文本指导的视频生成高效训练方法。文本到图像生成模 型具备基本的多模态关联能力和视觉内容生成能力,本文基于最先进的预训练 文本到图像生成扩散模型构建视频生成模型,继承并冻结其大部分参数以减少 训练参数量并保持其生成能力。为了保持帧间连贯性,本文提出了主体保持注意 力机制,使当前生成帧可以关注到前一时刻生成帧全空间位置的特征,从而保持 相邻两帧之间的主体内容一致。此外,为了引入文本与时序多帧之间的多模态关 联,本文提出时序跨模态交叉注意力模块,在交叉注意力前使用时序卷积层整合多帧信息。通过以上设计,本文在大大减少训练参数量的条件下,在文本指导的 视频生成上实现出色性能。

Other Abstract

      With the development of deep learning models, Artificial Intelligence Generated Content (AIGC) has been widely used in various fields. Among them, Video generation has attracted a lot of attention from both academia and industry, since it has the ability to generate videos without copyright issues for media makers and aid in data augmentation for deep learning models. Text-to-video generation, in particular, which synthesises videos with natural language as a condition, has improved the controllability of video generation and is becoming a popular research subject.

      Previous video generation methods based on Generative Adversarial Networks (GANs), Vector Quantized Variational Auto-Encoders, (VQVAEs) and Diffusion Mod- els (DMs) have achieved surprising results on in-domain and open-domain text-to-video generation. However, there are still some key issues to be resolved in the task of text guided video generation. On the one hand, current text-to-video generation approaches mainly concentrate on visual frame generation. However, video is actually a type of multi-modal data that includes both visual and audio components. Thus, in this paper, we introduce a novel sub-task of text-to-video generation, i.e., text-to-sounding video generation, and propose a unified framework for generating realistic videos along with audio signals. Besides, we also produce a text-video paired dataset for this novel task. On the other hand, most of the current successful text-to-video generation models have large amounts of trainable parameters and are training costly. In fact, pretrained im- age generation models have already acquired visual generation capabilities and could be utilized for video generation. Thus, in this paper, we propose a efficient training dif- fusion model for text-to-video generation, which is built on a pretrained text-to-image generation model and a novel temporal module has been added.

      The main contributions are summarized as follows:

      • Sounding video generation with visual-audio vector quantized variational auto- encoders. In this paper, we propose a novel Sounding Video Generator (SVG) as a VQVAEs-based unified framework for text-to-sounding-video generation. Specifically, we present the SVG-VQGAN to transform visual frames and audio mel-spectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized represen- tations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based decoder is used to model associations between texts, visual frames, and audio signals at token level for auto-regressive sounding video generation. AudioSet-Cap, a human annotated text-video-audio paired dataset, is produced for training SVG. Experimental results demonstrate the superiority of our method when compared with existing text-to- video generation methods as well as audio generation methods.

      • An Efficient Training Framework for Diffusion-based Text-to-Video Genera- tion. In this paper, we construct a text-to-video generation model based on a state-of- the-art text-to-image generation model, and propose an Efficient training framework for Diffusion-based Text-to-Video generation (ED-T2V). Most of the parameters of pre- trained text-to-image generation model are frozen to inherit the generation capabilities and reduce the training cost. To model the temporal dynamic information, we propose temporal transformer blocks with novel identity attention and temporal cross-attention. The identity attention requires the currently generated frame to attend to all positions of its previous frame, thus providing an efficient way to keep main content consistent across frames. Besides, the movements in generated videos are controlled by the textual descriptions so that there should be associations between the text condition and multi- ple frames in a generated video. Thus, we propose temporal cross-attention in ED-T2V transformer block by integrating multiple tokens in the time dimension via a convolu- tional layer. With the aforementioned improvements, ED-T2V not only significantly reduces the training cost of video diffusion models, but also has excellent generation fidelity and controllability. In addition, experimental results on video editing have also been provided to demonstrate the effectiveness and versatility of proposed methods.

Keyword基于人工智能的内容生成 多模态 视频生成
Subject Area人工智能
MOST Discipline Catalogue工学::计算机科学与技术(可授工学、理学学位)
Language中文
IS Representative Paper
Sub direction classification多模态智能
planning direction of the national heavy laboratory视觉信息处理
Paper associated data
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/51922
Collection毕业生_硕士学位论文
紫东太初大模型研究中心_图像与视频分析
Recommended Citation
GB/T 7714
刘佳伟. 文本指导的视频生成方法研究[D],2023.
Files in This Item:
File Name/Size DocType Version Access License
毕业论文--刘佳伟.pdf(15246KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[刘佳伟]'s Articles
Baidu academic
Similar articles in Baidu academic
[刘佳伟]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[刘佳伟]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.