CASIA OpenIR  > 毕业生  > 硕士学位论文
Alternative TitleData-Driven Visual Speech Synthesis
Thesis Advisor陶建华
Degree Grantor中国科学院研究生院
Place of Conferral中国科学院自动化研究所
Degree Discipline计算机应用技术
Keyword可视语音合成 Mpeg-4 可视韵律 基元选取 Talking Head Visual Prosody Mpeg-4 Unit Selection
Abstract可视语音合成技术的研究大大拉近了人机交互的距离,它不仅能提高人机交互的和谐性,还能改进交互识别和表达的准确性,可广泛地用于虚拟现实、虚拟主持人、虚拟会议、电影制作、游戏娱乐等很多领域。随着可视语音合成技术的逐步成熟,研究者们开始将研究重点转向以下两个方面: 1) 如何在人脸动画中融入其他的非语言信息,使合成的人脸不仅具有局部的唇动信息,而且能够做到自然的表情和头部运动,使人脸动画从“僵硬”走向“生动”,生成具有表现力的可视语音; 2) 如何在数据库大小与真实感之间进行平衡,在不降低合成效果的前提下,减小数据库大小,提高合成系统的灵活性及真实感。 本文的研究按照以上思路展开,在已有的可视语音合成系统的基础上,通过对汉语中的可视韵律进行分析,采用了基于数据驱动模型的方法,在原有的系统中融入了非语言信息,建立了一个更加具有表现力的汉语文本-可视语音转换系统。本文首先简要介绍了可视语音合成的研究背景和研究内容,然后按照系统建立的三个主要部分分别阐述主要工作内容: 1) 研究了汉语表达中,中性情感状态下朗读语气时,韵律词边界对头部运动的影响以及音素发音本身对头部运动的影响。得到了关于双字韵律内部的头部运动规律,总结了对头部运动影响较大的抬头音素以及每句话发音前的头部初始化运动规律,为后期的可视韵律融合提供了理论支持; 2) 建立了多个适用于不同应用的基于MPEG-4标准的多模态数据库。使用运动实时捕获仪建立了CASIA多模态数据库;并从多模态数据库中分别分析、提取了基于MPEG-4标准的人脸运动特征,通过FAP参数提取方法,去除了大量的数据冗余信息,并利用可变形模板的方法增强了捕获数据的鲁棒性; 3) 实现了基于动态基元选取的映射方法进行文本到可视语音的转换。采用基于数据驱动的方法合成控制参数,经过后期的重采样和平滑处理,输出合成的人脸运动特征参数,驱动MPEG-4网格动画模型构建一个汉语可视语音合成系统。
Other AbstractThe development of visual speech synthesis technology largely shorten the distance between human and computer, with the development of visual speech synthesis, more researchers are turning their research focuses to the following two aspects:1)How to integrate non-verbal information in facial animation to synthesize not only lip movement, but also facial expression, so as to make this virtual talking head more alive? 2)How to balance between the size of database and expressiveness? Or how to cut down the size of database without sacrificing the expressiveness of the talking head, thus make the overall system more flexible and realistic? Our study is carried out with the consideration of two points above, that's integrating non-verbal information into previous TTVS (Text to speech synthesize) system and seeking for new visual speech synthesis method so as to make a more expressive Chinese TTVS system. At first, the paper gives a brief introduction of the background and research content of visual speech synthesis. Then according to 3 main steps to establish such a system, the paper describes research work by step: 1) Make research into visual prosody in Chinese articulation. Especially how the boundaries of prosody word and phoneme affect the head movement when articulated in normal state, and this brings in some useful conclusion for the following synthesis. 2) Established a labeled MPEG-4 compliant multimodal database named CASIA Multimodal Database with motion capture system,also MPEG-4 compliant FAP parameters are abstracted from this multimodal database with little redundancy, and a deformable template method is implemented in this process to make the data captured more robust. 3) An expressive visual-speech synthesis system with vivid expression outputs is implemented with a method of dynamic unit selection in synthesizing parameters so as to drive a MPEG-4 face model.
Other Identifier200528014628077
Document Type学位论文
Recommended Citation
GB/T 7714
周密. 基于数据驱动的可视语音合成研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2008.
Files in This Item:
File Name/Size DocType Version Access License
CASIA_20052801462807(822KB) 暂不开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[周密]'s Articles
Baidu academic
Similar articles in Baidu academic
[周密]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[周密]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.