可控人脸视频生成理论和方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	可控人脸视频生成理论和方法研究
	宋林森
	2023-05-17
页数	146
学位类型	博士
中文摘要	人脸是人们展现出来的最直接的生物学特征。从古至今，人们就以各种形式对人脸进行描述，在多媒体时代下，人脸视频已经成为最丰富的描述人脸的媒介。作为计算机视觉领域的新兴研究方向之一，可控人脸视频生成理论和方法有着重要的研究意义与广泛的应用需求，例如电影后期阶段人脸视频的制作、互联网用户生成内容的创作等。尽管近期的相关工作在可控人脸视频生成上已经取得了一些研究进展和实际应用成果，但是可控人脸视频生成任务依然有些新兴方向需要研究，有些困难挑战亟待解决。一方面，视频中的人脸包含着丰富多样的表情以及不同类别和强度的情感。如何结合输入的驱动信息生成准确的人脸表情，如何自由控制生成视频中人脸的情感是可控人脸视频生成中不可回避的难点问题。另一方面，大部分人脸视频生成方法关注于正面人脸视频的生成，而且需要大量的训练数据作为支撑，然而这些要求在现实应用中难以实现。可控人脸视频生成中的视角挑战与数据量挑战是该技术在通往实际应用道路上必须要面对的难点问题。本文针对上述问题，从表情控制、情感控制以及视角挑战、数据量挑战对可控人脸视频生成展开研究。并探索了真实人脸可控视频生成方法在类人脸上的扩展。本文取得的研究成果如下： 1.提出了表情可控和情感可控的人脸视频生成方法。为实现表情可控，考虑到语音和视频的帧内以及帧间的相关性，本文基于自注意模型提出了语音跨模态映射到表情的音频自注意力模型和生成人脸视频的视觉自注意力模型；同时，在视觉自注意力模型中引入信息瓶颈技术以保证有限数据量下的生成模型的泛化能力。为实现情感可控，考虑到目前学术界缺少大规模的多视角情感人脸音视频数据集，本文建立了包含60个多种族演员的7种视角、8种情感、3种情感强度的音视频数据，该数据集包含281,400个音视频片段，平均每个演员的音视频数据的时长大约为39分钟。这是目前国内外公开的人脸音视频数据集中视角数最多、情感种类和强度数量最多、分辨率最高、每个演员平均数据最长的数据集。基于此数据集，本文设计了一个中性-情感转换网络和条件控制生成网络实现对人脸视频中情感的自由控制。 2.提出了面向多视角挑战和有限数据量挑战的可控人脸视频生成方法。这些方法扩大了可控人脸视频生成在实际中的应用。针对多视角挑战，本文在三维空间中对视频中的人脸进行参数化的表示与解耦，并以解耦的表情为媒介使用驱动的语音在三维空间中控制人脸的表情。依据视频中人脸的视角，三维空间中被驱动的人脸经过投影和渲染生成视角不同的可控人脸视频。此外，对语音和三维人脸模型中身份信息的解耦使得该方法实现了多对多驱动的人脸视频生成。针对有限数据量挑战，本文提出了基于搜索的半参数化视频生成模型以充分利用训练数据中的人脸纹理生成尽可能真实连续的人脸视频。有限的数据量也给学习视频数据中人脸表情的风格带来了困难，本文为此设计了跨模态融合内容与风格的网络。 3.对类人脸的可控视频生成做出尝试。这样的类人脸主要指视觉上让人觉得类似真实人脸的人脸。与真实人脸相比，类人脸在形状和纹理上有更多的多样性。针对形状多样性，本文对真实人脸和类人脸的形状利用贝塞尔曲线进行统一的形状建模和动作表示，并利用该动作表示将真实人脸的动作迁移到类人脸。针对纹理多样性，本文提出二维的运动扩散和近似方法以估计驱动类人脸图像的光流。本文首次提出这一探索性研究并提出了解决办法。类人脸的形状和纹理多样性同样出现在玩具脸、动物脸以及卡通脸等类人脸上，该研究或许可以启发更多类人脸的可控视频生成的研究。
英文摘要	Face is the most direct biological feature that people display. Since ancient times, people have represented faces in various forms, and in the multimedia era, face videos have become the richest media for representing faces. As one of the emerging research directions in the field of computer vision, the theory and method of controllable face video generation have important research significance and wide application scenarios, such as the production of face videos in the post-production stage of movies and the creation of user generated content on the internet. Although recent related work has made a great research progress and application achievements in controllable face video generation, there are still some emerging research directions and some difficult application challenges. On the one hand, faces in videos contain a rich variety of expressions and emotions of different categories and intensities. How to generate accurate facial expressions by input driving information and how to freely control the emotions of faces in the generated video are inevitable difficult problems in controllable face video generation. On the other hand, most face video generation methods focus on generating frontal face videos and require a large amount of training data, but these requirements are difficult to achieve in practical applications. The multi-view challenge and limited data challenge in controllable face video generation are challenging problems on the way of applying controllable face video generation methods in real world tasks. This thesis conducts the research of controllable face video generation on expression control, emotion control, multi-view challenge, and limited data challenge. It also explores how to extend the controllable video generation methods designed for real faces to pareidolia faces. The research contributions of this thesis are as follows: 1. The face video generation method of controlling expression and emotion is proposed. In order to achieve expression control, considering the intra-frame and inter-frame correlation of speech and video, this thesis proposes an audio transformer model that maps speech to expression and a visual transformer model that generates facial videos based on the expression. At the same time, information bottleneck technology is introduced in the visual transformer model to ensure the generalization ability of the generated model under limited data. In order to achieve emotion control, considering the lack of large-scale multi-view emotional face audio-visual datasets in the research field, this thesis establishes an audio-visual dataset that contains 60 actors of various races, 7 viewpoints, 8 emotions, and 3 emotion intensities. The dataset contains 281,400 audio-visual data clips, and the average duration of each actor's audio-visual data is about 39 minutes. Currently, among the publicly available facial audio-visual datasets, this is the audio-visual dataset with the largest number of viewpoints, emotion categories and emotion intensities, the highest resolution, and the longest average duration for each actor. Based on this dataset, this thesis designs a neutral-emotion translation network and a conditional generative model to freely control facial emotions in videos. 2. We propose controllable face video generation methods that address the challenges of multi-view and limited data. These methods expand the practical applications of controllable face video generation. To tackle the multi-view challenge, we parameterize and decouple the face of video in 3D space, and use the decoupled expressions as an intermediate representation to control the facial expressions in 3D space. Based on the face viewpoint in the target video, the driven face in 3D space is projected and rendered as the generated controllable face videos. In addition, decoupling the identity information from the voice and the 3D face model enables the proposed method to achieve many-to-many face video generation. To address the limited data challenge, we propose a search-based semi-parametric video generation model to make full use of the facial texture in the training data to generate as realistic and continuous face videos as possible. The limited data also presents challenges in learning the style of facial expressions in video data, and we design a network that combines content and style from different modals to address this issue. 3. This thesis makes an attempt to generate controllable videos of pareidolia faces, including visually illusory faces. Compared to real faces, pareidolia faces exhibit more diversity in shape and texture. To address the shape diversity, this thesis uses Bezier curves to model and represent the shape and motion of both real and pareidolia faces, and transfer the motion from real faces to pareidolia faces by Bezier curves. To address the texture diversity, we propose a 2D motion diffusion and approximation method to estimate the optical flow that enliven the pareidolia face image. This thesis firstly proposes the exploratory research of driving visually illusory faces and make the first attempt. The diversity challenges in shape and texture of pareidolia faces also appear in toy faces, animal faces, cartoon faces, and other humanoid faces. This research may inspire more studies on controllable video generation for these types of pareidolia faces.
关键词	视频生成人脸视频生成模型计算机视觉
学科领域	计算机神经网络
学科门类	工学::计算机科学与技术（可授工学、理学学位）
语种	中文
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	智能计算与学习
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51931
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	宋林森. 可控人脸视频生成理论和方法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
宋林森博士论文_2023_6_2_答辩修（46441KB）	学位论文		限制开放	CC BY-NC-SA