Visual speech synthesis technology is very important in the research of speech technology and human computer interaction. Speech is generated by human articulators, and there will be visual information come alone with the speech acoustic signal, including facial expression and the movement of articulators. The visual information plays an important role in human communications. The main contribution of this thesis is visual speech synthesis system construction, and several key aspects are addressed, including multi-modal dataset construction, virtual human articulators modeling as well as statistical mapping between acoustic features and visual speech features. Specifically, this thesis contains the following research: A visual speech multi-modal dataset designed for Mandarin is collected. This dataset contains multi-speaker's electromagnetic articulography data and real-time speech waveform. The text corpus covers all the Mandarin vowels, compound vowels and some high frequency syllables and sentences. As a supplementary material, this dataset will facilitate further studies on visual speech. A prototype of visual speech synthesis system is constructed, which takes speech waveform of arbitrary speaker as input and performs real-time visual speech animation of lip and tongue. The visual speech features across speakers are normalized by defining a set of EMA directional relative displacement features, which is inspired by the face animation parameters. The Gaussian mixture model is trained by multiple speaker's data for the mapping between acoustic parameters and the visual speech features. Graphical human articulator models based on curves and meshes are built for synthesizing animations. The subject evaluation for the system indicates that the synthesized articulatory animations can help the subjects distinguish vowels when no sound provided. The visual speech features across speakers contains speaker-specific characteristics which make it difficult to synthesize a target speaker's visual speech using multiple speaker's data. To address this problem, two strategies are used. The first strategy is feature conversion across speaker. An EMA data conversion method is proposed, which combines spatial morph method and codebook mapping algorithm and also takes the acoustic parameters into consideration. It morphs the source speaker's data using thin-plate spline approximation and then combines the morph result with the codebook mapping result. This method is ...
修改评论