基于波形拼接的汉语语音合成核心技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于波形拼接的汉语语音合成核心技术研究
其他题名	Research on the Key Techniques in Mandarin Concatenative TTS
	康恒
	2006-07-30
学位类型	工学博士
中文摘要	语音合成系统由四个主要组成部分：文本分析模块、韵律分析模块、声学模块和合成语音库，它们紧密相连，共同服务于高清晰度、高自然度的语音合成系统，本文的工作主要研究了其中的三个问题。这三个问题一是如何合理设计一个语音合成语料库使之能够覆盖自然语流中的语音现象；二是在合成阶段如何进行韵律分析确定韵律层级的最小单位以使得合成语音更加自然；三是在语音拼接阶段如何使得语音听起来更加平滑、清晰和自然。　　本文通过实现一个基于中等规模语音库的TTS系统，对TTS系统的这几个部分中出现的问题给出了我们的求解思路和方法。主要工作和创新点如下：　　设计了一个能较全面的覆盖连续语流中各种语音现象的语音数据库。本文提出的语音库语料选取方法综合考虑了多种因素：三音子覆盖率、三音子覆盖效率、三音子稀疏度、常用词分布等，并完全实现程序自动选取，充分利用了原始语料。选取结果覆盖率达到94.1%，比863语音库高7.1%。覆盖效率、稀疏度、常用词分布也比传统方法有了较大改善。　　提出一种基于CRF（条件随机场）的汉语韵律词切分方法，再使用基于TBL的错误驱动方法对切分结果进行进一步细化。实验结果表明，本文提出的方法在集外测试准确率达到93.22%，比传统的基于HMM的方法高约9%，召回率相对也有较大提高。　　提出了一个有效的频谱平滑算法对语音拼接边界进行平滑。此算法将描述语音频谱大尺度信息的LPC模型参数与擅长表现频谱的细节的Sinusoidal模型参数结合起来进行语音平滑，在很大程度上克服了传统LPC平滑算法中所出现的语音音质下降的弱点。应用该算法后，测试集所有拼接边界处的平均ANBM测度下降约25%。　　为了使STRAIGHT参数能够进行有效的插值，提出了一种基于selective-LPC的方法来表示STRAIGHT频谱。我们使用2频段selective-LPC的频谱包络去拟合STRAIGHT频谱，这样估计出的参数再转换为和其等价的LSP参数。将这一方法应用到语音频谱平滑的实验中，结果显示该方法比其它传统方法有比较明显的优势，所有测试集ANBM测度平均降低27.8%，拼接后的语音更加平滑、清晰和自然。主观听感实验也表明该方法音质更好。
英文摘要	A typical speech synthesis system is consisted of four components:a text analysis module, a prosodic analysis module, an acoustic module and a speech database. The four parts are closely related to serve the high-quality and natural synthesis systems. Our work is focused on three parts of them: First is to construct a speech database which can cover most phonetic phenomena in natural speech. The second is to determine the prosodic words boundaries which are in the lowest level of prosody hierarchy. The last is to smooth speech segment boundaries at the concatenation stage in order to make the synthetic speech sound more natural. We implemente a speech synthesis system based on a middle-scale speech database, and some ideas and solutions are given to the questions above. The main contributions and novelties include: We design a speech database covering most phonetic phenomina in natural speech for our high-quality TTS. The selection method can select text automatically from a large corpus considering multiple factors: triphone covering rate, triphone covering efficiency, triphone sparse rate and distribution of commonly used words, etc. The set of selected text covers 94.1% triphones, 75.4% most commonly used words, and also the covering rate and sparse rate are improved than that of conventional methods. We propose a Chinese prosody phrasing method based on CRF model, which solves the problems exist in conventional HMM model. After CRF segmention, we apply a TBL based error driven learning approch to refine the results. The experiments shows that the proposed method performs much better than HMM model. To improve speech quality of the smoothed speech, we propose a new spectral smoothing algorithm. The source LPC spectral envelopes are first interpolated to generate the smoothed target spectra. Then the sinusoidal + all-pole modification is performed on the source speech to get the spectra of the modified speech which will coincide with the target spectra. Experimental results show that this method can get smooth spectral envelope even if the speech boundaries have large spectral distance. Listening test proves that this algorithm is effective on avoiding degradation in quality of smoothed speech. We propose a new method representing STRAIGHT spectrum to provide the spectral parameters with the capability of interpolation and quantization, which is needed for most speech manipulation, especially for spectral smoothing. The proposed method estimates 2-band selective-LPC whose spectral envelope fits the given STRAIGHT spectrum. With the interpolation properties of LSP, the estimated selective-LPC could be converted to LSP and then simply interpolated. We apply this representation in our spectral smoothing experiments and the results show that this method can get smooth spectral envelope over the segment boundaries. Listening tests prove that this algorithm effectively smooth speech boundaries with little quality degradation.
关键词	语音合成语音库韵律词频谱平滑语音信号表示 Tts Speech Corpus Prosodic Words Spectral Smoothing Speech Signal Representation Straight
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/5953
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	康恒. 基于波形拼接的汉语语音合成核心技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2006.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20041801469000（1412KB）			暂不开放	CC BY-NC-SA