文本无关的语音转换方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	文本无关的语音转换方法研究
其他题名	Study on Text-Independent Voice Conversion
	张蒙
	2010-05-25
学位类型	工学博士
中文摘要	自然的人机交互技术一直都是人们关注的对象。面向个性化语音的语音转换技术是其重要组成部分，它能够对一个人的声音进行处理，使之变成另一个人的声音，其研究成果对个性化语音生成、人机对话等方向的发展具有重要的意义。而目前已有的语音转换技术一般必须采用平行语料来进行训练，对训练数据有着较高的要求，成为制约该技术在实际中应用的重要因素，而采用非平行语料的文本无关技术，是解决这一问题的主要思路。本文主要针对文本无关语音转换进行研究。通过分析不同说话人之间语音参数个性信息的差别，并联系文本无关语音转换特殊的技术需求，建立应用音素约束辅助信息的思路。同时针对其中的重点和难点问题，非平行语料下训练数据的对齐问题，提出两种加入语音音素内容指导信息的有监督对齐方法：基于音素聚类的线性对齐方法和自组织非线性对齐方法。基于音素聚类的线性对齐方法，先将源和目标训练参数分别进行音素聚类。其中源和目标语音中共有的音素被作为映射标定强制性对齐在一起，然后将这种强制对齐关系用线性叠加的方法泛化到整个参数空间。在线性叠加中同时考虑多个共有音素的作用，使得整个映射平滑稳定。从而实现非平行语料下的数据对齐。为了进一步提高数据对齐的效果，提出基于自组织映射的非线性对齐方法。线性的对齐方法作为这种迭代方法的初始状态。这种对齐方法是建立源数据集参数的拓扑结构，然后通过自组织迭代映射的方法寻求映射拓扑结构稳定和源参数与目标参数分布一致的平衡，同时考虑音素信息的准确性约束。对于跨语言转换的情况，提出流形学习的方法对非线性方法做补充，使得其在音素“失配”情况下仍然适用。对于韵律转换提出基于声调的变换方法，以汉语普通话－方言之间的转换为例进行阐述。通过分析语音信号中发音人的基频韵律信息，应用CART的分类和回归算法，对普通话到方言的基频差别进行建模。用声学手段改变语音信号中源说话人的说话风格而保持语音内容以及背景信息不变，使得转换语音听起来像是方言口音。实验结果表明，提出的方法在数据对齐的准确性和稳定性上都有提高，对应用本方法的转换系统进行的主观实验也表明新方法在音质和相似度上有提高。两种方法都同时适用于同语言和跨语言语音转换的对齐情况。
英文摘要	Natural human-computer interaction (HCI) always receives widely attention from peoples. As an important aspect, voice conversion, which can transform the voice of one speaker so that it is perceived as the voice of another speaker, has significant meanings for speech individualization in HCI. The majority of methods proposed in existing literature assume the availability of parallel training sentences, which is speech data with same context from the source and target speakers, are referred to as the text-dependent voice conversion. The requirement of parallel database is inconvenient and sometimes hard to fulfill, which restricts the practical applications. Text-independent voice conversion is a main path to solve this problem. This paper mainly focuses on the study of text-independent voice conversion. Voice conversion can be separated into training and conversion stage. For the training stage, there are also data alignment and model training stage. The main problem of text-independent voice conversion is the data alignment under nonparallel training database. The paper proposes new supervisory data alignment methods for text-independent voice conversion which use phonetic information as a restriction during alignment: phoneme cluster based linear alignment and self-organizing based iterative learning. For phoneme cluster based linear alignment, a mapping between the source and target parameters spaces is established using weighted linear alignment based on common phonetic clusters. These common phoneme clusters between the source and target speech are regarded as anchors for the mapping from the source speaker onto the parameter space of a target speaker. And several of the nearest phonetic clusters to each vector are taken into account simultaneously to ensure mapping continuity. Furthermore, to fine-tune and improve the alignment, a nonlinear data alignment that uses a self-organizing iterative learning algorithm is proposed. The result of the linear alignment is used as the initialization of the iterative learning. The algorithm establishes an optimal balance between the phonetic restriction and preservation of the topology, and it thus maintains alignment accuracy and stability. As these nonlinear alignment results are self-organized, the underlying internal structures of the source and target spaces can be associated. As an extension of the algorithm to cross-lingual voice conversion, a manifold expansion algorithm is used in the nonlinear data...
关键词	文本无关语音转换数据对齐音素监督信息自组织映射 Text-independent Voice Conversion Data Alignment Supervisory Phonetic Restriction Self-organized Learning
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6243
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张蒙. 文本无关的语音转换方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2010.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20071801462807（1553KB）			限制开放	CC BY-NC-SA