With more and more requirements for the human-computer interaction (HCI), “Visual Speech Synthesis” receives more and more researchers’ attention. The visual speech synthesis not only increases the harmoniousness of HCI, but also improves the veracity. For example, increasing the speech recognition system’s ability in noisy environment and helping hear-impaired person to better understand others’ information. It is also widely applied in virtual reality, virtual announcer, virtual meeting, movie making and game entertainment. The key and difficult point in visual speech synthesis lies on audio-visual synchronized mapping, because people are very familiar with face movement. This paper mainly focuses on the research in audio-visual synchronized mapping except for the system framework establishment. At the first, the paper gives a brief introduction to the background and research content of visual speech synthesis. Then according to four main steps to establish such a system, the paper describes research work by step: Established a labeled MPEG-4 based multimodal database named CASIA Multimodal Database with motion capture system, which meets different research requirements, the database includes speech-2D video-3D face movement information; Analyzed and Extracted speech and visual features from multimodal data seperately. Through FAPs extraction method, redundant information in large amout of data was given up. For face movement features, the principal component expressions were given and analyzed. Implemented two audio-visual mapping algorithms: dynamic unit selection based and HMMs based audio-visual method. The former one focuses on the reality, natural quality of the synthesized animation, the latter one mainly focuses on the real-time, automatic and effective quality of the system. After smoothing algorithm, outputted the synthesized face movement parameters and drived the model-based face animation model.
修改评论