数据驱动的说话人头像技术及双模态表情识别研究

CASIA OpenIR > 毕业生 > 博士学位论文

	数据驱动的说话人头像技术及双模态表情识别研究
其他题名	Research on Data-Driven Talking Head and Bimodal Emotion Recognition
	辛乐
	2008-05-29
学位类型	工学博士
中文摘要	说话人头像(Talking Head)技术是自然人机交互领域近年来一个非常活跃的研究方向，作为智能前端广泛地应用于许多计算机和网站应用系统。正是说话人头像技术这种说话人头像与语音的双模态和并行协作的同步表达技术，能够促进人机交互向人与人之间的交互靠拢，突破目前以鼠标和键盘为代表的交互瓶颈，大大改进现有人机交互技术的自然性和高效性。目前说话人头像系统的表达效果仍然受到很多限制，远远不能完成构筑能说会道和富于表情的人脸表达的目标。本文从实现具有表现力的说话人头像智能多模态信息表达技术的角度，简要地介绍了特定三维人脸建模、人脸动画和多模态情感识别的研究背景和意义。以此为基础，我们利用现有数据采集设备、语料切分和标注工具所创建的多模态数据库，利用近年来的统计分析、机器学习技术的发展，依次围绕基于视频序列的真实感个性化快速三维人脸建模，具有表现力的语音驱动人脸动画生成以及增加表情感知功能的双模态表情识别等方面进行了一些初步讨论。本文的主要工作包括以下内容： ① 关注于真实感可动画三维人脸的自动建模研究，提出一种新颖的基于视频序列自动进行精确三维人脸建模的算法。首先，我们利用可方便得到的硬件资源(如价格低廉的网络摄像头)获取了一段低质量的视频序列。然后，我们的算法就高效而自动地对其进行分析，最大限度地获取人脸精确的三维结构信息。该项技术具备方便获得输入视频(使用USB摄像头拍摄的低质量视频即可)，用户使用方便(只需要用户在摄像头前面简单地转转脖子)的优点，而且整个人脸建模过程完全自动进行，不需要用户的任何交互，方便广泛人群享有更多的数字乐趣。本文算法有效解决了处理序列图像匹配的问题，有力地推动了基于图像/图像序列的三维人脸建模研究。 ② 在语音驱动的可视语音合成研究中，本文致力于语音与人脸动画同步映射建模(音视频映射)问题的研究。本文立足于对声学语音和可视语音之间同步交互作用的分析，从已在音视频融合方面取得良好效果的HMM出发，提出了一种基于Fused HMM求逆运算的动态音视频映射算法。该算法使用Fused HMM模型显式地表达音视频紧密相关的两个同步序列。对于给定的语音新输入，基于Fused HMM的求逆运算，通过最大化由Fused MMM表征的联合概率分布合成对应的视觉最佳输出，保证高质量的可视语音合成。我们提出了基于两层聚类的多模态数据子类抽取思路，保证实时真实感语音驱动人脸动画输出。 ③ 提出了一种基于boosting的自适应权重的双模态情感识别新方法。该方法在分类各种情感时，考虑了各模态特征不同的主导作用，而且在训练阶段自动调整反映主导信息的权重。该方法能够更有效提高识别易混淆情感类别的性能。为了增强下半部分人脸视觉表情参数的提取，我们也提出了与发音无关的唇部运动模型。为了提取视觉参数，需要得到不受噪声污染的人脸特征点跟踪结果，提出使用在六种特定表情人脸形状形变流型中进行点分布模型搜索的方法来保证跟踪的质量。本文对数据驱动说话人头像技术和双模态情感识别一些关键技术做了一些有益的尝试和探索，并取得了一些初步成果。希望本文的工作和有关结论能够对具有表现力的说话人头像技术研究提供帮助。
英文摘要	Nowadays see Talking Head becoming an active research topic in the natural Human-Computer Interaction area. Widely used as the intelligent front-end in many computer and web application systems, Talking Head not only increases the harmoniousness of HCI, but also improves the veracity. In order to improve the performance of expressive Talking Head system, this article focus on the automatic realistic 3D facial modeling from video, voice-driven facial animation production and bi-modal emotional recognition with the technological development of statistical analysis and machine learning in recent years. This dissertation mainly includes: ① We develop an efficient technique for fully automatic recovery of accurate 3D face shape from videos captured by a low cost camera. It is easily available for the input video in our method, and it is very feasible for users to use. Our methods improve the research of the theory about 3D facial modeling based on the image sequences or images. ② Realistic audio-visual mapping remains a very challenging problem. We present a new dynamic audio-visual mapping approach based on the Fused Hidden Markov Model Inversion method. When it is implemented in the pre-built subsets, realistic synthesized facial animation having relative short time delay is obtained. Experiments on a 3D motion capture bimodal database show that the synthetic results are comparable with the ground truth. ③ We present a novel method for multimodal emotion recognition using boosting algorithms, which can generate adaptable weights for audio and facial features during the training process is shown to give better performance for a test set. It is also shown that the importance of the two channels is different in bimodal emotion recognition and it is necessary to account for this difference in bimodal emotion recognition. In addition to other visual expression parameters extracted in the upper part of the face, we use utterance-independent lip movement (UILM) models to enhance visual expression parameters extracted in the lower part. The visual parameters are calculated on the noise-free facial salient point tracking results, which are ensured by the PDM search in the six emotion-specific facial shape deformation manifolds.
关键词	自然人机交互说话人头像三维人脸建模人脸动画可视语音合成语音驱动情感识别多模态信息融合融合隐马尔可夫模型 Natural Human-computer Interaction Talking Head Individual 3d Facial Modeling Facial Modeling Visual Speech Synthesis Speech Driven Facial Animation Emotion Recognition Fused Hidden Markov Model
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6087
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	辛乐. 数据驱动的说话人头像技术及双模态表情识别研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2008.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20031801460303（2713KB）			暂不开放	CC BY-NC-SA