基于MPEG-4的语音驱动人脸动画合成技术研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于MPEG-4的语音驱动人脸动画合成技术研究
其他题名	MPEG-4 based Speech Driven Face Animation Synthesis Research
	尹潘嵘
	2006-06-04
学位类型	工学硕士
中文摘要	随着人们对人机交互要求的不断提高，可视语音合成作为一种重要的人机交互方法受到越来越多研究者的关注。它不仅能提高人机交互的和谐性，还能改进交互识别和表达的准确性，如改进噪声环境中的语音识别效果，帮助听力障碍人士理解语言信息，也可广泛地用于虚拟现实、虚拟主持人、虚拟会议、电影制作、游戏娱乐等很多领域。可视语音合成研究的重点和难点在于语音与人脸的同步映射模型的建立。其原因在于人们对人脸及其运动太熟悉，对其运动的动态同步特性非常敏感。本文的工作除了建立完整的系统框架外，也着重在于语音人脸同步映射关系的研究。本文首先简要介绍了可视化语音合成的研究背景和研究内容，然后按照系统建立的四个主要部分分别阐述主要工作内容：建立了多个适用于不同应用的基于MPEG-4标准的多模态数据库。使用运动实时捕获仪建立了CASIA多模态数据库，该数据库包含同步的语音－二维视频－三维人脸特征点运动信息，可应用于多模态情感识别，语音驱动人脸动画等多个应用场景；从多模态数据库中分别分析、提取了语音声学特征和基于MPEG-4标准的人脸运动特征，通过FAP参数提取方法，去除了大量的数据冗余信息，并对人脸运动特征给出了主成分量化表达方法，对其进行了分析；实现了两种语音－人脸动画映射算法：基于动态基元选取的映射方法和基于HMM映射方法，前者侧重于合成动画的真实、自然及连续，后者更侧重于系统实施的实时、自动和高效；经过平滑算法，输出合成的人脸运动特征参数，驱动网格动画模型人脸运动。
英文摘要	With more and more requirements for the human-computer interaction (HCI), “Visual Speech Synthesis” receives more and more researchers’ attention. The visual speech synthesis not only increases the harmoniousness of HCI, but also improves the veracity. For example, increasing the speech recognition system’s ability in noisy environment and helping hear-impaired person to better understand others’ information. It is also widely applied in virtual reality, virtual announcer, virtual meeting, movie making and game entertainment. The key and difficult point in visual speech synthesis lies on audio-visual synchronized mapping, because people are very familiar with face movement. This paper mainly focuses on the research in audio-visual synchronized mapping except for the system framework establishment. At the first, the paper gives a brief introduction to the background and research content of visual speech synthesis. Then according to four main steps to establish such a system, the paper describes research work by step: Established a labeled MPEG-4 based multimodal database named CASIA Multimodal Database with motion capture system, which meets different research requirements, the database includes speech-2D video-3D face movement information; Analyzed and Extracted speech and visual features from multimodal data seperately. Through FAPs extraction method, redundant information in large amout of data was given up. For face movement features, the principal component expressions were given and analyzed. Implemented two audio-visual mapping algorithms: dynamic unit selection based and HMMs based audio-visual method. The former one focuses on the reality, natural quality of the synthesized animation, the latter one mainly focuses on the real-time, automatic and effective quality of the system. After smoothing algorithm, outputted the synthesized face movement parameters and drived the model-based face animation model.
关键词	可视化语音合成 Mpeg-4标准语音-视频同步映射基元选取隐马尔可夫模型 Visual Speech Synthesis Mpeg-4 Standard Audio-visual Synchronized Mapping Unit Selection Hmm
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7386
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	尹潘嵘. 基于MPEG-4的语音驱动人脸动画合成技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2006.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20032801460415（4404KB）			暂不开放	CC BY-NC-SA