面向语音识别的高性能统计语言模型的研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向语音识别的高性能统计语言模型的研究
其他题名	High Performance Statistical Language Model for Speech Recognition
	陈浪舟
	1999-06-01
学位类型	工学博士
中文摘要	统计语言模型是用统计的方法模拟自然语言的信源。传统的n-gram模型在语音识别中取得了重大成功。然而随着语音识别技术的发展，人们对高性能统计语言模型的需求也越来越高。高鲁棒性和强预测能力已成为语言模型研究的发展趋势。本文着重研究如何在统一考虑模型的鲁棒性和预测能力的前提下改进传统的统计语言模型的性能并提出新的建模方法和思路。我们在对统计语言模型的最新进展和主要问题进行研究的基础上，主要工作集中在以下三个方面。 ▲第一个方面是统计语言模型数据稀疏问题的解决方案。在实际应用中，如何去估计训练语料中未出现事件的概率是我们必须解决的问题。解决数据稀疏问题通常有两个途径：平滑和聚类。传统方法对数据稀疏问题的平滑是利用低阶的模型去平滑预测能力较强的模型中的未出现事件。这种平滑方法实际上是牺牲一部分预测能力去换取模型的鲁棒性。不同于传统的以低阶模型(n-1)gram 估计未出现事件概率的方法，我们提出了一种基于相似词概念估算未出现事件概率的模型平滑新方法，并应用奇异值分解方法求取相似词。利用这种方法我们在同阶模型中实现数据的平滑，从而在一定程度上实现了模型预测性能和鲁棒性的统一。解决数据稀疏问题的另一途径是通过聚类建立基于词类的语言模型。我们对基于词类的语言模型进行了深入的研究，从聚类到建模提出了一系列算法。首先，我们提出一种高速高效的基于词类之间相似性的分层聚类算法，该算法比传统的基于训练语料似然函数的贪婪聚类算法快10倍以上，同时聚类结果更加能够反映汉语的语义分类体系。同时我们对基于词类的模型的建模方法进行了探讨，提出了一种基于节点熵和节点置信度的vari-gram文法树构造算法，该算法所生成的基于词类的vari-gram模型性能远远优于基于类的n-gram文法。我们的另一个尝试是将利用一个树状词表将基于词的模型和基于词类的模型结合起来，生成一种混合语言模型。这种模型能够在树状词表的任何层次上实现数据平滑，取得比单一的平滑方法更好的效果，并将不同的平滑策略纳入到共同的框架中来。 ▲我们工作的第二个方面主要集中在利用领域信息改善模型的性能。首先我们提出了一种改进的MAP(maximam a posteriori)领域自适应算法。我们仔细分析了影响领域无关模型和自适应语料之间耦合强度的各种不同因素，提出了一种基于模糊控制的MAP自适应方法，把诸多的因素转换为模糊控制器的输入，动
英文摘要	Statistical language model simulates the natural language source using statistical method. Traditional n-gram model had great success in speech recognition. However, as the development of speech recognition, there is much demand to build high performance, high robust language model. The research of this paper wanted to improve the performance of language model, explore the new method of language modeling from the robustness and prediction ability. After studying the current progress and major problem carefully, we thought that there are three directions that we can improve the performance of language model greatly. The first one is the method to solve the data sparseness problem of language model. There are two way to solve the data sparseness problem: one is smoothing, the other is building class based model. Traditional smoothing method is using less specific model to smooth the unseen event in current model. Because the less specific model has the less performance, if we introduced too much less specific model, the prediction ability of models will be decreased. In order to solve this problem, we have presented a new smoothing method. We have introduced a truncated SVD of word relation matrix to calculate the similarity of two words. This method decrease the proportion of less specific model evidently and get the union of model' s robustness and prediction ability. Building class based model is another way to solve the data sparseness problem. We have presented a series of algorithm from word clustering to class based model building. At first, a fast, high performance word clustering method based on the similarity between word classes and hierarchical structure. This algorithm is faster than traditional greedy method based on likelihood of training data more than 10 times. And the clustering result is much closer to the semantic system of Chinese than traditional method too. At the same time, we have presented a method to construct the vari-gram tree based on the entropy and confidence of nodes. This method can get class based vari-gram much better than class based n-gram model. Our another new idea is building the class based and word based mixture models according to a tree structure lexicon. We can realize the data smoothing in arbitrary level of lexicon tree and the smoothing performance is much better than traditional method. At the same time, different smoothing strategy can be got in the common framework. The second aspect of our work focused on improving the performance of models using topic information. In order to introduce the topic information into our models, an improved MAP method for language model adaptation has been presented. The traditional MAP method mixes the task independent corpus and task dependent corpus using a fixed weight. In new method, a fuzzy controller was introduced in adaptation process, many factors that influent the adaptation process were considered as th
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/5699
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	陈浪舟. 面向语音识别的高性能统计语言模型的研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,1999.