面向中文语音识别的自适应语言模型研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向中文语音识别的自适应语言模型研究
	黄非
	1999-06-01
学位类型	工学硕士
中文摘要	语言处理和语言模型是语音识别中的重要组成部分。灵活高效的语言模型对构建高性能的语音识别系统至关重要。在参加国家863高技术研究发展计划“汉语连续语音听写机”项目的研究中，作者对面向中文语音识别的语言模型及自适应方法进行了研究。作者首先建立了39,925词的词典，并设计了人工交互与机器学习相结合的自适应分词方法，此方法的切分准确率可达99．89％。在此基础上，利用八千万字的训练语料，建立了基于词的N元统计模型(N=1，2，3)。此模型应用于大词汇量汉语听写机中，使识别结果得到明显改善。论文也讨论了统计语言模型的数据稀疏问题，并实现了有折扣参数的 backoff数据平滑方法，此方法使原语言模型的音—-字转换错误率下降1 6．95％。此外，论文还对三元文法模型在中文语音识别中的作用进行了研究，并对其在语言解码中的有限作用进行了统计实验，还给出了语言学意义上的解释。在综述语言模型自适应方法的基础上，作者研究并实现了基于有监督的语言模型Cache自适应方法。通过在线学习语言知识，修正语词发现概率，此方法可以使音－字转换错误率降低20％～40％。针对特定领域中所使用的词汇的有限性，论文介绍了基于领域关键词的词典及语言模型自适应方法，并实现了基于语词连接概率的新词自动检测。这种自适应方法可以使词典规模缩小60％以上，模型概率参数缩小50％，耗时减少50％以上，而音－字转换错误率下降37％左右。上述研究结果不仅可应用于语音识别后处理，作为中文信息处理的重要研究内容，它们还在文字识别、机器翻译、汉字输入等方面也将发挥重要作用。
英文摘要	Language model is to capture and utilize the regular patterns of natural language. It plays an increasingly important role in speech recognition, machine translation and other applications of natural language processing. As a part of the research project Speaker-independent Chinese Continuous Speech Dictator, a National High-Tech R&D Plan project, this research aims at the language model and its adaptation for Chinese speech recognition. Before the construction of an N-gram(N=l,2,3) statistical language model, an adaptive word segmentation algorithm based on interactive machine learning is designed, with the word segmentation correct rate of 99.89%. Using 80M training text from People" s Daily, this language model greatly improves the performance of our large-vocabulary continuous speech recognition system. The data smoothing method, Backoff method with discounting parameters, is analyzed in this thesis. It reduces the pinyin (transcription)-character conversion error rate by 16.95%. The role of trigram in Chinese language model is also investigated, in which a linguistic explanation is proposed after some statistical analysis. On the basis of a detailed introduction and comparison of different adaptive language models, the supervised Cache-based language model adaptation is introduced. This method can reduce the pinyin (transcription)-character conversion error rate by 20-40% after learning the content and style of test articles online. Finally, the domain-keywords based lexicon and language model adaptation is proposed. By "forgetting" irrelevant words and learning Out-of-Vocabulary domain keywords automatically, the lexicon size can be reduced by 60%, the probability parameters by 50% and the running time by 50%. The pinyin (transcription)-character conversion error rate can be reduced by 37% when the supervised Cache-based language model adaptation is combined.
关键词	语言模型语音识别分词 N元文法 Cache自适应词典自适应 Language Model Speech Recognition Word Segmentation N-gram Cache-based Adaptation Lexicon Adaptation
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7271
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	黄非. 面向中文语音识别的自适应语言模型研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,1999.