多语言及语种无关的关键词语音搜索研究

CASIA OpenIR > 毕业生 > 博士学位论文

	多语言及语种无关的关键词语音搜索研究
其他题名	Multilingual and Language Independent Spoken Term Detection
	马泽君
	2012-06-03
学位类型	工学博士
中文摘要	目前，对跨语种通用的语音识别和搜索技术的研究，对推动语音识别及检索等相关技术研究的发展具有非常重要的实际意义。本文针对电话信道上多语言条件下的语音识别和搜索的技术难题，进行了以下几个方面广泛而深入的研究。首先，在跨语种通用声学模型建模方面，本文对基于子空间高斯混合模型的多语言统一声学建模方法进行了研究。这种方法在不同语种的声学状态之间共享相同的子空间参数，而在各语种内部对其独有的发音单元建立高斯混合模型。相比于传统基于通用音子集的高斯混合模型方法，本文所提出方法的优点主要体现为：（1）引入子空间和状态向量结构显式地对高斯分量间的共性与差异进行建模。紧凑的模型结构一方面可避免为控制模型参数规模而对不同语种的音子集进行合并，有利于提高声学建模的精度。（2）另一方面还可以避免由于合并不同语种音子而引起识别解码时的竞争与混淆，有利于提高识别精度。（3）另外，相比传统模型，跨语种通用的子空间模型更容易向新的语言扩展。其次，在目标语种缺乏训练资源的极端情况下，本文研究了基于通用音子后验图的语种无关语音搜索方法。所谓语种无关是指在一定条件下，本文讨论的语音搜索方法不依赖于目标语种的训练资源，具体方法是在覆盖多种语言的通用音子集上训练基于神经网络的音子分类器，并用音子后验图表示语音片段。考虑到在目标语种缺少训练资源的情况下，关键词难以表示为文本形式，本文提出以关键词语音样本形式输入，则测试语音片段和关键词语音样本都被表示为音子后验图。应用改进的动态时间规整算法对测试语音及关键词语音样本对应的音子后验图进行匹配以实现关键词搜索任务。再次，为获取高质量的声学模型，通常需要积累数百甚至上千小时量级的语音数据及其对应的文字内容脚本。在搜集整理训练数据的过程中，对语音数据进行人工听音标注的环节最为费时费力。因此，本文对无监督声学模型训练方法进行了研究。无监督训练方法首先利用少量人工标注语音数据训练一个种子模型，然后用种子模型对大量无标注的语音数据进行解码识别，从而自动地对语音数据进行识别与标注，生成语音训练集。通过置信度对识别结果进行挑选，并将挑出的自动标注数据加入人工标注训练集，重新估计模型参数。上述无监督声学模型训练过程可以自动并迭代地进行，从而增量式地提升模型精度。最后，针对电话信道语音识别及检索准确率低的实际问题，本文研究了通过融合多种置信度提高检索准确率的解决办法。本文对基于词图后验概率、纯声学后验概率和发音时长分布相似度的置信度进行了深入的研究，并通过实验证明：三种置信度之间存在互补信息，通过融合这些置信度能够提升检索系统的准确率。
英文摘要	This paper introduces multilingual spoken term detection, STD, system and evaluates it on CallHome and CallFriend multilingual databases published by Linguistic Data Consortium. Seven languages including Arabic, English, German, Japanese, Korean, Chinese Mandarin and Spanish, are considered in the multilingual acoustic modeling task. A lot of works are focused on the comparison of multilingual acoustic models estimated by two different methods - the conventional global phoneme set, GPS, based GMM method and the recently proposed subspace Gaussian mixture model, SGMM, method. The experimental results show that the resulting multilingual STD system is capable of supporting seven different languages simultaneously without any adaptation. It is also observed substantial performance gains achieved by multilingual system over monolingual systems. A language-independent STD method is discussed in this paper for those languages or dialects which are extremely lack of training resource. The solution to sparseness issue of training resource in target language is to use available training resources of existing languages to estimate a multilingual phoneme classifier, multi-layer perceptron, MLP. The keyword audio samples and test utterances can both be represented as general phonetic posteriorgram, which is a time sequence consisting of phonetic posterior vectors, generated by multilingual MLP mentioned above. The modified dynamic time warping algorithm with ``sliding matching window" and loose path extension constraint is applied to find warping paths between keyword sample and test utterance with minimum matching cost. Currently, the majority of state-of-art speech recognition systems depend on a large amount of transcribed speech data to robustly estimate acoustic model. However, the acquisition of large training resources is a challenging task. Especially, the transcription of audio data typically involves expensive manual labors of language experts in particular, and is very time-consuming. In the case of under-resourced language or dialect in general, the collection of large training data is major bottleneck for developing LVCSR system. The unsupervised learning has been gaining popularity as a method to greatly reduce human efforts. It is the first attempt to apply unsupervised learning method on subspace acoustic model. The key features presented in this paper include: augmented UBM by bootstrapping and enhanced lattice-based utterance level confide...
关键词	语音关键词搜索多语言统一声学建模语种无关语音搜索声学模型无监督训练置信度融合 Spoken Term Detection(Std) Multilingual Acoustic Modeling Language-independent Std Unsupervised Acoustic Model Training Confidence Combination
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6475
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	马泽君. 多语言及语种无关的关键词语音搜索研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2012.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20091801462804（2724KB）			暂不开放	CC BY-NC-SA