CASIA OpenIR  > 毕业生  > 硕士学位论文
基于强化学习的视觉内容生成
秦子涵
2024-05-16
页数72
学位类型硕士
中文摘要

随着深度学习技术的发展,基于人工智能的内容生成已经开始在各个领域得到广泛应用。其中,由于互联网内容的图像化和视频化,视觉内容生成正成为学术界和工业界都备受关注的研究领域,在学术研究、广告设计、艺术和媒体创作、动画和游戏制作、虚拟现实等领域具有广泛的应用。

 

近年来,视觉内容生成模型取得了显著进展,能够生成逼真且多样的图像或视频,给人留下深刻印象。然而,视觉内容生成领域仍存在许多问题需要解决:传统的训练目标通常计算模型输出与特定真实图像或视频之间的距离,忽视了数据集中的图像分布或视频分布等先验知识;评价指标可以提供有关生成内容的质量和多样性的可靠见解,但它们需要大量样本才能产生准确的分数;评价指标不能用于训练模型,导致了训练和评估阶段之间的不一致性。

 

为了解决这些问题,本文提出了基于一致性的图像评价指标(ConSensus-based Image Evaluation metric, CSIE)和视频评价指标(ConSensus-based Video Evaluation metric, CSVE),利用真实图像或视频分布的先验知识来增强传统的训练指导,提高生成图像和视频样本的整体质量。另外,使用CSIE和CSVE指标评估生成模型,显著减少了准确评估所需的样本数量。然后,本文采用基于策略梯度的强化学习方法直接优化图像或视频评价指标,使训练目标与评估指标保持一致,提高模型性能。具体而言,生成模型被看作是与外部环境进行交互的智能体,模型的输入是强化学习中的状态,模型参数定义了一个策略,导致了一个动作,即对生成的内容的预测。

论文的主要工作和创新点归纳如下:

     基于随机策略梯度的自回归图像生成。本文提出了一种新的基于一致性的图像评价指标CSIE,考虑了先验图像知识(如真实图像分布),该指标与传统评估指标具有很强的相关性,同时对于图像数目具有更好的鲁棒性。此外,为了解决训练目标和评估指标不一致的问题,针对自回归图像生成模型,本文将随机策略梯度方法应用到模型训练过程中,直接优化作为奖励的CSIE分数,弥合了训练和评估阶段之间的差距,从而提高了图像生成性能。在MS COCO数据集上的大量实验证明了CSIE指标对噪声的敏感性,与传统评价指标的相关性,和对于图像数目的鲁棒性,也证明了策略梯度方法对自回归图像生成模型的有效性。

    

基于确定性策略梯度的扩散图像生成。 随机策略会根据策略随机地选择状态下的动作。因此,随机性策略梯度得到的是动作的概率分布,然后通过采样该随机策略,朝着最大化累积奖励的方向优化策略参数。跟随机策略不同,确定性策略梯度方法的策略和状态均确定时,动作是唯一确定的。由于扩散模型的输出是确定性的图像而非概率分布的预测,本章将确定性策略梯度方法引入到扩散图像生成模型中,使用确定性策略梯度方法对预训练的图像生成扩散模型进行微调。另外,针对L2损失函数带来的均值回归问题,本文利用对抗学习的方法合成高频图像细节。在LSUN Bedroom和 ImageNet 数据集上进行的大量实验证明了本方法的有效性。

    

基于确定性策略梯度的扩散视频生成。首先,为了利用真实视频分布等先验视频知识优化视频生成模型,改善传统的基于视频分布的指标对于视频数目鲁棒性较差的问题,本文提出了一种新的基于一致性的的视频评价指标CSVE,该指标对CSIE指标进行扩展,在CSIE的基础上进一步考虑了视频的时序信息。然后,为了解决了视频生成方法训练目标和评价指标不一致带来的问题,本章使用确定性策略梯度方法对基于扩散模型的视频生成方法进行微调,在UCF101数据集上进行的实验证明了本方法的有效性。

英文摘要

With the advancement of deep learning technology, AI-based content generation has begun to find widespread applications in various fields. Among them, visual content generation is gaining significant attention from both academia and industry due to the visualization of internet content through images and videos. It has broad applications in academic research, advertising design, art and media creation, animation and game production, virtual reality, and other fields.

 

In recent years, visual content generation models have made significant progress, capable of generating realistic and diverse images or videos that leave a lasting impression. However, the field of visual content generation still faces many challenges: traditional training objectives often calculate the distance between model outputs and specific real images or videos, overlooking prior knowledge such as the distribution of images or videos in the dataset; evaluation metrics can provide reliable insights into the quality and diversity of generated content, but they require a large number of samples to produce accurate scores; evaluation metrics cannot be used for model training, resulting in inconsistency between the training and evaluation stages.

 

To address these issues, this paper first proposes Consensus-based Image Evaluation (CSIE) and Consensus-based Video Evaluation (CSVE) metrics, which leverage prior knowledge of real image or video distributions to enhance traditional training guidance and improve the overall quality of generated image samples. Additionally, using CSIE and CSVE metrics for evaluation significantly reduces the number of samples required for accurate assessment. Then, we adopt a reinforcement learning approach based on policy gradient to directly optimize image or video evaluation metrics, ensuring consistency between training objectives and evaluation metrics and enhancing model performance. In this context, the generative model is viewed as an agent interacting with an external environment, where the input to the model represents the state in reinforcement learning. The parameters of the model define a policy, leading to an action, which is the prediction of the generated content.

The main contributions and innovations of this paper are summarized as follows:

    Autoregressive Image Generation Based on Probalistic Policy Gradient. This paper proposes a new Consensus-based Image Evaluation (CSIE), which considers prior image knowledge (such as real image distribution). This metric exhibits strong correlation with traditional evaluation metrics and better robustness to the number of images. Furthermore, to address the inconsistency between training objectives and evaluation metrics, stochastic policy gradient methods are applied to the training process of autoregressive image generation models, directly optimizing CSIE scores as rewards, bridging the gap between training and evaluation stages and improving image generation performance. Extensive experiments on the MS COCO dataset demonstrate the sensitivity of CSIE to noise, its correlation with traditional evaluation metrics, and its robustness to the number of images, as well as the effectiveness of reinforcement learning for autoregressive image generation models.

    Diffusion Image Generation Based on Deterministic Policy Gradient.Probalistic policy randomly selects actions based on the policy. Therefore, the probalistic policy gradient yields a probability distribution of actions, which is then sampled to optimize the policy parameters towards maximizing the cumulative reward.Unlike probalistic policy, in deterministic policy gradient methods, the policy and state are both determined, resulting in a unique action.Since the output of the diffusion model is deterministic images rather than predictions of a probability distribution, this chapter introduces deterministic policy gradient methods into the diffusion image generation model, fine-tuning the pre-trained image generation diffusion model using deterministic policy gradient methods.Furthermore, to address the mean regression issue caused by the L2 loss function, this paper utilizes adversarial learning to synthesize high-frequency image details. Extensive experiments on the LSUN Bedroom and ImageNet datasets demonstrate the effectiveness of this approach.

    Diffusion Video Generation Based on Deterministic Policy Gradient.  Firstly, in order to leverage prior video knowledge such as the distribution of real videos and improve the robustness of traditional video distribution-based metrics against variations in the number of videos, this paper proposes a novel consistency-based video evaluation metric(CSVE), which extends the CSIE metric by further considering the temporal information of videos.Subsequently, to address the issue of inconsistency between the training objectives of video generation methods and the evaluation metrics, this chapter fine-tunes video generation methods based on diffusion models using deterministic policy gradient methods. Experimental results on the UCF101 dataset demonstrate the effectiveness of this approach.

 

关键词图像生成 视频生成 强化学习
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/57631
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
秦子涵. 基于强化学习的视觉内容生成[D],2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
qzh-硕士毕业论文-最终版-基于强化学(11517KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[秦子涵]的文章
百度学术
百度学术中相似的文章
[秦子涵]的文章
必应学术
必应学术中相似的文章
[秦子涵]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。