CASIA OpenIR  > 毕业生  > 博士学位论文
面向语音识别的高性能统计语言模型的研究
其他题名High Performance Statistical Language Model for Speech Recognition
陈浪舟
1999-06-01
学位类型工学博士
中文摘要统计语言模型是用统计的方法模拟自然语言的信源。传统的n-gram模型在 语音识别中取得了重大成功。然而随着语音识别技术的发展,人们对高性能统 计语言模型的需求也越来越高。高鲁棒性和强预测能力已成为语言模型研究的 发展趋势。 本文着重研究如何在统一考虑模型的鲁棒性和预测能力的前提下改进传统 的统计语言模型的性能并提出新的建模方法和思路。我们在对统计语言模型的 最新进展和主要问题进行研究的基础上,主要工作集中在以下三个方面。 ▲第一个方面是统计语言模型数据稀疏问题的解决方案。在实际应用中, 如何去估计训练语料中未出现事件的概率是我们必须解决的问题。解决数据稀 疏问题通常有两个途径:平滑和聚类。传统方法对数据稀疏问题的平滑是利用 低阶的模型去平滑预测能力较强的模型中的未出现事件。这种平滑方法实际上 是牺牲一部分预测能力去换取模型的鲁棒性。不同于传统的以低阶模型(n-1)gram 估计未出现事件概率的方法,我们提出了一种基于相似词概念估算未出现事件概 率的模型平滑新方法,并应用奇异值分解方法求取相似词。利用这种方法我们在 同阶模型中实现数据的平滑,从而在一定程度上实现了模型预测性能和鲁棒性 的统一。 解决数据稀疏问题的另一途径是通过聚类建立基于词类的语言模型。我们 对基于词类的语言模型进行了深入的研究,从聚类到建模提出了一系列算法。 首先,我们提出一种高速高效的基于词类之间相似性的分层聚类算法,该算法比 传统的基于训练语料似然函数的贪婪聚类算法快10倍以上,同时聚类结果更加 能够反映汉语的语义分类体系。同时我们对基于词类的模型的建模方法进行了 探讨,提出了一种基于节点熵和节点置信度的vari-gram文法树构造算法,该算 法所生成的基于词类的vari-gram模型性能远远优于基于类的n-gram文法。我 们的另一个尝试是将利用一个树状词表将基于词的模型和基于词类的模型结合起 来,生成一种混合语言模型。这种模型能够在树状词表的任何层次上实现数据 平滑,取得比单一的平滑方法更好的效果,并将不同的平滑策略纳入到共同的 框架中来。 ▲我们工作的第二个方面主要集中在利用领域信息改善模型的性能。首先 我们提出了一种改进的MAP(maximam a posteriori)领域自适应算法。我们 仔细分析了影响领域无关模型和自适应语料之间耦合强度的各种不同因素,提 出了一种基于模糊控制的MAP自适应方法,把诸多的因素转换为模糊控制器 的输入,动
英文摘要Statistical language model simulates the natural language source using statistical method. Traditional n-gram model had great success in speech recognition. However, as the development of speech recognition, there is much demand to build high performance, high robust language model. The research of this paper wanted to improve the performance of language model, explore the new method of language modeling from the robustness and prediction ability. After studying the current progress and major problem carefully, we thought that there are three directions that we can improve the performance of language model greatly. The first one is the method to solve the data sparseness problem of language model. There are two way to solve the data sparseness problem: one is smoothing, the other is building class based model. Traditional smoothing method is using less specific model to smooth the unseen event in current model. Because the less specific model has the less performance, if we introduced too much less specific model, the prediction ability of models will be decreased. In order to solve this problem, we have presented a new smoothing method. We have introduced a truncated SVD of word relation matrix to calculate the similarity of two words. This method decrease the proportion of less specific model evidently and get the union of model' s robustness and prediction ability. Building class based model is another way to solve the data sparseness problem. We have presented a series of algorithm from word clustering to class based model building. At first, a fast, high performance word clustering method based on the similarity between word classes and hierarchical structure. This algorithm is faster than traditional greedy method based on likelihood of training data more than 10 times. And the clustering result is much closer to the semantic system of Chinese than traditional method too. At the same time, we have presented a method to construct the vari-gram tree based on the entropy and confidence of nodes. This method can get class based vari-gram much better than class based n-gram model. Our another new idea is building the class based and word based mixture models according to a tree structure lexicon. We can realize the data smoothing in arbitrary level of lexicon tree and the smoothing performance is much better than traditional method. At the same time, different smoothing strategy can be got in the common framework. The second aspect of our work focused on improving the performance of models using topic information. In order to introduce the topic information into our models, an improved MAP method for language model adaptation has been presented. The traditional MAP method mixes the task independent corpus and task dependent corpus using a fixed weight. In new method, a fuzzy controller was introduced in adaptation process, many factors that influent the adaptation process were considered as th
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/5699
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
陈浪舟. 面向语音识别的高性能统计语言模型的研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,1999.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[陈浪舟]的文章
百度学术
百度学术中相似的文章
[陈浪舟]的文章
必应学术
必应学术中相似的文章
[陈浪舟]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。