基于神经辐射场的动态人脸编辑

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于神经辐射场的动态人脸编辑; 基于神经辐射场的动态人脸编辑
	杨嵩林
	2024-05 ; 2024-05
页数	96 ; 96
学位类型	硕士 ; 硕士
中文摘要	生成式人工智能理论的突破，极大地促进了数字内容创作的繁荣发展，为计算机视觉研究带来了全新的范式变革，进而推动了人脸合成与编辑技术的重大进步和广泛应用。尽管目前该技术已取得令人瞩目的效果，但基于二维图像表征的生成模型仍受限于对人脸三维结构的感知不足。这导致在动态人脸编辑任务，如视角变换或序列化视频生成中，难以维持人脸几何及纹理的一致性。为此，学者们尝试通过引入人脸神经辐射场来解决该问题。然而，基于神经辐射场的动态人脸编辑仍面临诸多挑战，人脸动态编辑任务往往会破坏神经辐射场已学到的具有较高一致性的人脸表征。针对这些挑战，本文从采用三维人脸先验控制神经辐射场生成模型出发，首先对空间视角变化下三维一致的人脸属性编辑展开研究，验证了三维信息对提升时序一致的重要性；然后，本文在时序维度上进行拓展，提出了多模态信息融合的动态神经辐射场构建方法，实现了帧间平滑的说话人脸视频编辑；最后，本文深入研究无需三维先验的人脸神经辐射场稠密对应算法，进一步提升精细化人脸运动编辑性能。本文涉及到三项研究内容，分别是基于神经辐射场反演的人脸属性动态编辑、基于神经辐射场多模态融合的动态人脸编辑以及基于神经辐射场稠密对应的动态人脸编辑，具体贡献总结如下： 1) 针对基于神经辐射场反演的属性编辑结果三维一致性较差的问题，本文提出了一种基于三维感知的人脸编码器，用于提升人脸属性编辑在多视角渲染挑战下的三维一致性。该编码器引入人脸参数化模型作为三维先验，能够在对神经辐射场生成模型进行反演的过程中解耦人脸几何及纹理表征。进一步地，本文设计了一种双流属性编辑模块，能够充分利用上述解耦表征实现几何及纹理的灵活编辑。在此基础上，本文将三维感知编码拓展至序列化视频编辑，验证了该编码在序列化动态视频挑战下保持三维时序一致性的优势。 2)针对基于人脸神经辐射场说话人视频编辑依赖大量目标人物源视频训练数据、帧间抖动等问题，本文提出了一种基于文本编辑的人脸动态神经辐射场框架，实现了说话人脸视频内容的增加、删除和替换。该框架从三方面提升现有说话人脸视频编辑的性能：一是将视频动作预测建模为非自回归模型，使其能够在大规模说话人脸数据集上进行高效训练，进一步提升模型对“语音 - 视频”映射先验的学习效率；二是采用“预训练+微调”的个性化动态人脸神经辐射场建模方式，平衡了训练时间、目标人物源视频数据量与生成效果的不同需求；三是引入视频上下文动作序列作为先验，提升模型在预测和编辑人脸动作序列时的平滑性。 3) 针对采用人脸三维先验方法建模不同人脸神经辐射场之间稠密对应的局限性，本文提出了一种人脸神经辐射场稠密对应方法，能够实现不同人脸神经辐射场之间的隐式点对应，从而完成精细化人脸表情及姿态迁移，克服了之前方法无法对人眼及头发等区域进行有效信息编码的问题。本文采取三平面（Tri-Plane）作为基础神经辐射场表征，并将其解构为标准型空间、身份形变与运动形变。针对运动形变，本方法将运动信息映射为一组可学习的正交平面基的加权和。该方法是领域内最先实现无需人脸三维先验就可完成单图输入的人脸神经辐射场驱动的工作之一。 ; 生成式人工智能理论的突破，极大地促进了数字内容创作的繁荣发展，为计算机视觉研究带来了全新的范式变革，进而推动了人脸合成与编辑技术的重大进步和广泛应用。尽管目前该技术已取得令人瞩目的效果，但基于二维图像表征的生成模型仍受限于对人脸三维结构的感知不足。这导致在动态人脸编辑任务，如视角变换或序列化视频生成中，难以维持人脸几何及纹理的一致性。为此，学者们尝试通过引入人脸神经辐射场来解决该问题。然而，基于神经辐射场的动态人脸编辑仍面临诸多挑战，人脸动态编辑任务往往会破坏神经辐射场已学到的具有较高一致性的人脸表征。针对这些挑战，本文从采用三维人脸先验控制神经辐射场生成模型出发，首先对空间视角变化下三维一致的人脸属性编辑展开研究，验证了三维信息对提升时序一致的重要性；然后，本文在时序维度上进行拓展，提出了多模态信息融合的动态神经辐射场构建方法，实现了帧间平滑的说话人脸视频编辑；最后，本文深入研究无需三维先验的人脸神经辐射场稠密对应算法，进一步提升精细化人脸运动编辑性能。本文涉及到三项研究内容，分别是基于神经辐射场反演的人脸属性动态编辑、基于神经辐射场多模态融合的动态人脸编辑以及基于神经辐射场稠密对应的动态人脸编辑，具体贡献总结如下： 1) 针对基于神经辐射场反演的属性编辑结果三维一致性较差的问题，本文提出了一种基于三维感知的人脸编码器，用于提升人脸属性编辑在多视角渲染挑战下的三维一致性。该编码器引入人脸参数化模型作为三维先验，能够在对神经辐射场生成模型进行反演的过程中解耦人脸几何及纹理表征。进一步地，本文设计了一种双流属性编辑模块，能够充分利用上述解耦表征实现几何及纹理的灵活编辑。在此基础上，本文将三维感知编码拓展至序列化视频编辑，验证了该编码在序列化动态视频挑战下保持三维时序一致性的优势。 2)针对基于人脸神经辐射场说话人视频编辑依赖大量目标人物源视频训练数据、帧间抖动等问题，本文提出了一种基于文本编辑的人脸动态神经辐射场框架，实现了说话人脸视频内容的增加、删除和替换。该框架从三方面提升现有说话人脸视频编辑的性能：一是将视频动作预测建模为非自回归模型，使其能够在大规模说话人脸数据集上进行高效训练，进一步提升模型对“语音 - 视频”映射先验的学习效率；二是采用“预训练+微调”的个性化动态人脸神经辐射场建模方式，平衡了训练时间、目标人物源视频数据量与生成效果的不同需求；三是引入视频上下文动作序列作为先验，提升模型在预测和编辑人脸动作序列时的平滑性。 3) 针对采用人脸三维先验方法建模不同人脸神经辐射场之间稠密对应的局限性，本文提出了一种人脸神经辐射场稠密对应方法，能够实现不同人脸神经辐射场之间的隐式点对应，从而完成精细化人脸表情及姿态迁移，克服了之前方法无法对人眼及头发等区域进行有效信息编码的问题。本文采取三平面（Tri-Plane）作为基础神经辐射场表征，并将其解构为标准型空间、身份形变与运动形变。针对运动形变，本方法将运动信息映射为一组可学习的正交平面基的加权和。该方法是领域内最先实现无需人脸三维先验就可完成单图输入的人脸神经辐射场驱动的工作之一。
英文摘要	The breakthrough in generative artificial intelligence theory has greatly promoted the flourishing development of digital content creation, ushering in a new paradigm shift for computer vision research, thereby driving significant advancements and widespread applications in facial synthesis and editing technologies. Despite the remarkable achievements of this technology, generative models based on 2D image representations are constrained by insufficient perception of facial 3D structure. This limitation poses challenges in maintaining consistency of facial geometry and texture, especially in dynamic facial editing tasks such as viewpoint changes or sequential video generation. To address these challenges, researchers have attempted to mitigate these issues by introducing facial neural radiance fields. However, dynamic facial editing based on neural radiance fields still faces many challenges, as it often disrupts the highly consistent facial representations learned by the neural radiance fields. In response to these challenges, this thesis starts from utilizing a 3D facial prior to control the generative models based on neural radiance fields, and conducts research on 3D consistent facial attribute editing under spatial viewpoint changes, validating the importance of 3D information in improving temporal consistency. Subsequently, this thesis extends the research to the temporal dimension by proposing a method for constructing dynamic neural radiance fields with multimodal information fusion, achieving inter-frame smooth editing of talking-head videos. Finally, this thesis delves into the study of dense correspondence in facial neural radiance fields without the need for 3D priors, further enhancing the performance of fine-grained facial motion editing. This thesis involves three research contents, namely, dynamic facial attribute editing based on neural radiance field inversion, dynamic facial editing based on multimodal fusion of neural radiance fields, and dynamic facial editing based on dense correspondence of neural radiance fields. The specific contributions are summarized as follows: 1) To address the issue of poor 3D consistency in attribute editing results based on neural radiance field inversion, this thesis proposes a 3D-aware facial encoder to enhance the 3D consistency of facial attribute editing under the challenge of multi-view rendering. This encoder introduces a facial parametric model as a 3D prior, which can decouple facial geometry and texture representations during the inversion process of the neural radiance field generative model. Furthermore, this thesis designs a dual-stream attribute editing module that fully utilizes the decoupled representations mentioned above to achieve flexible editing of geometry and texture. Building upon this, the thesis extends the 3D-aware encoding to video editing, validating the advantages of these codes in maintaining 3D temporal consistency under the challenge of dynamic video editing. 2) In response to the challenges of heavy reliance on target person source video training data and inter-frame jitter in talking-head video editing based on facial neural radiance fields, this thesis proposes a text-based facial dynamic neural radiance field framework, enabling the addition, deletion, and substitution of content in talking-head videos. This framework enhances the performance of existing talking-head video editing in three aspects: First, it models video motion prediction as a non-autoregressive model, allowing for efficient training on large-scale talking-head datasets, further improving the learning efficiency of the ``speech-to-video" mapping prior. Second, it adopts a ``pre-training + fine-tuning" personalized dynamic facial neural radiance field modeling approach, balancing the different requirements of training time, target person source video data amount, and generation effectiveness. Third, it introduces video context motion sequences as priors to improve the smoothness of the model in predicting and editing facial motion sequences. 3) In response to the limitations of modeling dense correspondences between different facial neural radiance fields using facial 3D prior methods, this thesis proposes a facial neural radiance field dense correspondence method that achieves implicit point correspondences between different facial neural radiance fields, thus enabling fine-grained facial expression and pose transfer. This method overcomes the problem of ineffective information encoding in areas such as eyes and hair encountered by previous methods. The thesis adopts a Tri-Plane representation as the foundation for facial neural radiance field, decomposing it into standard space, identity deformation, and motion deformation. For motion deformation, this method maps motion information into a weighted sum of learnable orthogonal plane bases. This approach is one of the first methods which achieve facial neural radiance field animation without requiring facial 3D prior and use only one single image as input. ; The breakthrough in generative artificial intelligence theory has greatly promoted the flourishing development of digital content creation, ushering in a new paradigm shift for computer vision research, thereby driving significant advancements and widespread applications in facial synthesis and editing technologies. Despite the remarkable achievements of this technology, generative models based on 2D image representations are constrained by insufficient perception of facial 3D structure. This limitation poses challenges in maintaining consistency of facial geometry and texture, especially in dynamic facial editing tasks such as viewpoint changes or sequential video generation. To address these challenges, researchers have attempted to mitigate these issues by introducing facial neural radiance fields. However, dynamic facial editing based on neural radiance fields still faces many challenges, as it often disrupts the highly consistent facial representations learned by the neural radiance fields. In response to these challenges, this thesis starts from utilizing a 3D facial prior to control the generative models based on neural radiance fields, and conducts research on 3D consistent facial attribute editing under spatial viewpoint changes, validating the importance of 3D information in improving temporal consistency. Subsequently, this thesis extends the research to the temporal dimension by proposing a method for constructing dynamic neural radiance fields with multimodal information fusion, achieving inter-frame smooth editing of talking-head videos. Finally, this thesis delves into the study of dense correspondence in facial neural radiance fields without the need for 3D priors, further enhancing the performance of fine-grained facial motion editing. This thesis involves three research contents, namely, dynamic facial attribute editing based on neural radiance field inversion, dynamic facial editing based on multimodal fusion of neural radiance fields, and dynamic facial editing based on dense correspondence of neural radiance fields. The specific contributions are summarized as follows: 1) To address the issue of poor 3D consistency in attribute editing results based on neural radiance field inversion, this thesis proposes a 3D-aware facial encoder to enhance the 3D consistency of facial attribute editing under the challenge of multi-view rendering. This encoder introduces a facial parametric model as a 3D prior, which can decouple facial geometry and texture representations during the inversion process of the neural radiance field generative model. Furthermore, this thesis designs a dual-stream attribute editing module that fully utilizes the decoupled representations mentioned above to achieve flexible editing of geometry and texture. Building upon this, the thesis extends the 3D-aware encoding to video editing, validating the advantages of these codes in maintaining 3D temporal consistency under the challenge of dynamic video editing. 2) In response to the challenges of heavy reliance on target person source video training data and inter-frame jitter in talking-head video editing based on facial neural radiance fields, this thesis proposes a text-based facial dynamic neural radiance field framework, enabling the addition, deletion, and substitution of content in talking-head videos. This framework enhances the performance of existing talking-head video editing in three aspects: First, it models video motion prediction as a non-autoregressive model, allowing for efficient training on large-scale talking-head datasets, further improving the learning efficiency of the ``speech-to-video" mapping prior. Second, it adopts a ``pre-training + fine-tuning" personalized dynamic facial neural radiance field modeling approach, balancing the different requirements of training time, target person source video data amount, and generation effectiveness. Third, it introduces video context motion sequences as priors to improve the smoothness of the model in predicting and editing facial motion sequences. 3) In response to the limitations of modeling dense correspondences between different facial neural radiance fields using facial 3D prior methods, this thesis proposes a facial neural radiance field dense correspondence method that achieves implicit point correspondences between different facial neural radiance fields, thus enabling fine-grained facial expression and pose transfer. This method overcomes the problem of ineffective information encoding in areas such as eyes and hair encountered by previous methods. The thesis adopts a Tri-Plane representation as the foundation for facial neural radiance field, decomposing it into standard space, identity deformation, and motion deformation. For motion deformation, this method maps motion information into a weighted sum of learnable orthogonal plane bases. This approach is one of the first methods which achieve facial neural radiance field animation without requiring facial 3D prior and use only one single image as input.
关键词	神经辐射场动态人脸编辑人脸属性编辑人脸运动编辑神经辐射场动态人脸编辑人脸属性编辑人脸运动编辑
学科领域	模式识别 ; 模式识别
学科门类	工学::控制科学与工程 ; 工学::计算机科学与技术（可授工学、理学学位） ; 工学::控制科学与工程 ; 工学::计算机科学与技术（可授工学、理学学位）
语种	中文 ; 中文
是否为代表性论文	是 ; 是
七大方向——子方向分类	图像视频处理与分析 ; 图像视频处理与分析
国重实验室规划方向分类	视觉信息处理 ; 视觉信息处理
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57539
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	杨嵩林. 基于神经辐射场的动态人脸编辑, 基于神经辐射场的动态人脸编辑[D],2024, 2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
硕士学位论文-杨嵩林-最终版.pdf（11258KB）	学位论文		限制开放	CC BY-NC-SA