Chinese voice conversion is a technology that can change a Chinese speech’s specific speaker characteristic and make the transformed speech to sound as if another speaker had spoken it. The dissertation analyzes the speaker individual information in speech features, and proposes two methods to reduce overly smoothing problems of time domain and frequency domain, and then proposes pitch target model based pitch conversion method. The dissertation contains following works: 1. Analyzing the individual information in acoustic features. The difference among speeches based on the same transcription is divided into physiological difference and attitudinal difference. On the physiological difference, we investigate the difference in formant frequency representing vocal fold features and glottal parameters representing speech source features from different speakers. On the attitudinal difference, we study the distribution of prosodic features from emotional speech compared with neutral speech. 2. Enhancing voice quality of the transformed speech. Because overly smoothing problems of GMM mapping method will degrade voice quality of the transformed speech, we analyze and resolve these problems in time domain and frequency domain. As for overly smoothing in time domain, we propose a hybrid mapping method combined GMM and codebook mapping method; as for overly smoothing in frequency domain, we employ a post-filtering method to sharpen the formant bandwidth. 3. Proposing a specific pitch conversion method for Chinese. According to characteristics of Chinese pitch, we propose a pitch target model based pitch conversion method. Experiments have proved that the pitch target model has grate capabilities of describing and converting Chinese pitch. The pitch target model based pitch conversion method can not only modify the range of pitch contour, but also change the pitch contour’s trend to conform the converted pitch contour to the target pitch contour in shape. 4. Building an emotional speech conversion system. The dissertation uses STRAIGHT algorithm to construct a Chinese voice conversion system, and implement an emotional generation system based on voice conversion. Because of the proposed pitch target model based pitch conversion method, the system can successfully generate an emotional speech from an input neutral speech. 5. Proposing a non-linear formant estimation method based on frequency subband prediction. A novel method, using band pass filtering within predicted subbands instead of frequency ranges determined by experiential selection, is proposed to decompose a speech into mono-component signals. Then this method is employed in formant estimation, and this experiment indicates the method not only correctly calculates formant frequencies but also avoids complicated formant tracking procedure.
修改评论