A typical speech synthesis system is consisted of four components:a text analysis module, a prosodic analysis module, an acoustic module and a speech database. The four parts are closely related to serve the high-quality and natural synthesis systems. Our work is focused on three parts of them: First is to construct a speech database which can cover most phonetic phenomena in natural speech. The second is to determine the prosodic words boundaries which are in the lowest level of prosody hierarchy. The last is to smooth speech segment boundaries at the concatenation stage in order to make the synthetic speech sound more natural. We implemente a speech synthesis system based on a middle-scale speech database, and some ideas and solutions are given to the questions above. The main contributions and novelties include: We design a speech database covering most phonetic phenomina in natural speech for our high-quality TTS. The selection method can select text automatically from a large corpus considering multiple factors: triphone covering rate, triphone covering efficiency, triphone sparse rate and distribution of commonly used words, etc. The set of selected text covers 94.1% triphones, 75.4% most commonly used words, and also the covering rate and sparse rate are improved than that of conventional methods. We propose a Chinese prosody phrasing method based on CRF model, which solves the problems exist in conventional HMM model. After CRF segmention, we apply a TBL based error driven learning approch to refine the results. The experiments shows that the proposed method performs much better than HMM model. To improve speech quality of the smoothed speech, we propose a new spectral smoothing algorithm. The source LPC spectral envelopes are first interpolated to generate the smoothed target spectra. Then the sinusoidal + all-pole modification is performed on the source speech to get the spectra of the modified speech which will coincide with the target spectra. Experimental results show that this method can get smooth spectral envelope even if the speech boundaries have large spectral distance. Listening test proves that this algorithm is effective on avoiding degradation in quality of smoothed speech. We propose a new method representing STRAIGHT spectrum to provide the spectral parameters with the capability of interpolation and quantization, which is needed for most speech manipulation, especially for spectral smoothing. The proposed method estimates 2-band selective-LPC whose spectral envelope fits the given STRAIGHT spectrum. With the interpolation properties of LSP, the estimated selective-LPC could be converted to LSP and then simply interpolated. We apply this representation in our spectral smoothing experiments and the results show that this method can get smooth spectral envelope over the segment boundaries. Listening tests prove that this algorithm effectively smooth speech boundaries with little quality degradation.
修改评论