基于数据聚类的语言建模研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于数据聚类的语言建模研究
其他题名	Data Clustering based Language Modeling for Speech Recognition
	王晓瑞
	2008-04-26
学位类型	工学博士
中文摘要	语言模型是语音识别系统中至关重要的部分。目前，语音识别技术正在进入应用发展阶段，语音识别的研究也走向更加广泛的应用领域和更加复杂的识别任务。在这个过程中，研究基于数据聚类的的语言建模技术和语言模型自适应技术，可以为语音识别技术走向更加广泛的应用打下良好的基础。本文针对语言建模中的数据稀疏问题，主要进行了以下几个方面广泛而深入的研究。首先，在基于词聚类的语言模型方面，本文研究实现了基于Mod-KN平滑的层次化词类语言模型。这种模型优先使用更为具体的语境，在数据不足的情况下根据一棵层次化的词类树进行回退。相比于国际上基于Good-Turing平滑的方法，本文实现的基于Mod-KN平滑的层次化词类语言模型具有更高的精度，在已知事件和未知事件两部分都能够降低模型的困惑度，提高系统的识别率。其次，在基于历史聚类的语言模型方面，本文提出一种基于共享回退的随机森林语言模型，并首次将随机森林语言模型应用到语种识别领域。这种模型能够减轻决策树分裂过程中由于贪心算法造成的过训练问题，对未知事件部分的预测更加准确。本文提出的基于共享回退的随机森林语言模型，在保持每棵决策树原有随机性的基础上，进一步提高了模型的鲁棒性。在语种识别中，性能比目前常用的Ngram模型和决策树模型获得了显著的提高。另外，在语音识别系统中，本文还把随机森林语言模型和层次化词类模型相结合，进一步降低模型的困惑度，提高系统的识别率。第三，在语言模型的自适应方面，本文提出一个面向广播语音识别的语言模型自适应框架。本文将语言模型与识别任务之间的语言差异分为三类：词典差异、风格和内容差异以及模型的概率分布差异。基于这种分类，本文提出的一个面向广播语音识别的语言模型自适应框架，联合多个技术减小模型与任务之间语言差异。该框架联合了以下技术：一种新的非迭代的新词提取方法，一种新的中文开放式词典语言模型，一种基于困惑度的背景语料筛选方法和一个Ngram概率分布自适应模块。实验表明，在中文的广播语音识别系统中，该框架使系统性能提高了10%。最后，本文提出一种基于模板匹配的语音识别结果纠错方法。该方法的特点包括：无需显式的对识别结果检错，避免了检错环节带来的错误；利用置信度对结果切分，使用短的识别片段纠错；利用编辑距离和声学混淆度比对识别片段和纠错模板，提高纠错结果的可靠性。实验表明，这种纠错方法在模板库覆盖度较好的情况以及覆盖度一般的情况下都能够提高系统的识别率。
英文摘要	Language model is always a crucial part of speech recognition system. Today the research of speech recognition turns to more application fields and more complex tasks. The data sparseness problem becomes more severe. During the five years of my Ph.D. study, I have investigated the key technologies of data clustering based language modeling and language model adaptation. The main research work focused on the following three aspects: I proposed a Mod-KN smoothing based hierarchical class language model. This model always favors longer contexts. For unseen events, it takes the backoff according to a hierarchical word class tree. It takes advantage of both the power of word n-grams for frequent events and the predictive power of class n-gram for unseen or rare events. The Mod-KN smoothing based hierarchical class language model outperforms the Good-Turing smoothing based one on both frequent events and unseen events. I proposed a shared backoff for random forest language models, and applied random forest language models to language identification systems. The random forest language models can decrease the drawback of greedy node splitting of decision trees by randomness. The shared backoff method can improve the models robustness while maintaining the randomness of each decision tree. Language identification experiments showed that the random forest language models significantly outperforms n-gram and binary decision tree language models. Furthermore, for speech recognition tasks we combined the random forest language model and the Mod-KN based hierarchical class language model, and obtained further improvements both on perplexity and recognition accuracy. Broadcast news recognition is now a focused field of speech recognition research. This paper presents a unified language model adaptation framework for broadcast news recognition which combines our non-iterative new words extraction approach, a novel open-vocabulary Chinese language model, a perplexity-based corpus selection approach and an n-gram distribution adaptation module. In our experiments, this framework obtained 10% relative error reduction. I also proposed a new template based recognition error correction method. This method does not need hard decisions of error detection, thus avoiding the errors of detection module. It segments the speech recognition results into small parts, which is easier to correct. It also uses edit distance and acoustic confusion scores to select from templates. This improves the robustness of the correction results. Experiments showed that recognition accuracy can be improved both on better covered test set and on normally covered test set.
关键词	层次化词类语言模型随机森林语言模型语言模型自适应语音识别结果纠错 Hierarchical Class Language Model Random Forest Language Model Language Model Adaptation Speech Recognition Error Correction
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6054
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	王晓瑞. 基于数据聚类的语言建模研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2008.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20051801462808（1198KB）			暂不开放	CC BY-NC-SA