面向汉字识别的语言模型研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向汉字识别的语言模型研究
	张胜
	2001-05-01
学位类型	工学硕士
中文摘要	本论文主要研究和探讨了面向文字识别的语言‘模型的理论与实践。语言模型是文字识别后处理的重要组成部分，高性能的语言模型对提高整个系统的识别率至关重要。在经典语言模型的基础上，针对汉字识别的特点，提出了几种具有实用价值的新的语言模型。由于汉字识别有其自身特点，首先考察了450多万字的识别件，在此基础上得到了三干二百字的识别易错字字典。同时建立语料库，利用统计语言学知识和本字典可以更加有效地提高识别率。将它们应用于提出的几种新的模型，分别达到了较为满意的效果。本文也对语言‘模型作了比较全面且系统的论述。首先引入语言模型及其决策机制；然后着重分析介绍N-gram语言模型，阐述了这种语言模型提出的基本思想，说明它的数学意义和解释，指出其存在的问题；接着引入了隐马尔可夫链，在此基础上探讨了插值语言模型的研究；最后分别介绍能反映文章自身特点的 Cache模型和基于词类的N—class模型。在综合以上几个经典语言模型的基础上，针对汉字识别中错误的具体特点，首先研究并实现了混合五字语言模型[附录(2)．2]。这种语言模型与传统的语言模型最大不同于：它不仅利用过去的信息，也利用后续信息。从而大大增加了信息量，提高了该语言模型的性能。为了反映句子中的词语结构，接着给出变长度语言模型[附录(2)．3]的概念。与以日，』的语言模型相比较，该模型实现了多种语言。模型的自动选择与转换。然后针对版面模糊等原因造成的识别过程中错误比较密集，提出了一种语言模型[附录(2).4]较好地解决这种情况。最后探讨了语言模型在预测系统中的应用。该系统首先实现了信息分类，即对对象文本进行自动分类，然后调用该特定领域的词典。并且在预测过程中随着文本的改变可实现词典的自动切换。在此基础上的语言模型较好地发挥了预测能力，如将它应用于联机汉字识别系统可简化整个识别过程中许多环节，提高正确率，并加快输入速度。上述研究成果不仅可应用于汉字识别后处理，作为中文信息处理的重要研究内容，针对不同情况在稍作修改之后，它们还可以在语音识别、机器翻译和汉字输入等方面发挥重要作用。
英文摘要	In this thesis, we researched the theory and application of language model for Chinese character recognition. Language model plays an increasingly important role in post-processing of Chinese character recognition and it can improve the performance of the whole recognition system. Based on several traditional ones, some new language models, which are practicable were brought forward. Before building training corpus, sample texts with about 4.5 million Chinese characters have been reviewed then we got a dictionary with 3,200 Chinese characters. When those new language models incorporated the corpus and dictionary, they all show good performance. Language mode] has been systematically discussed in the thesis. Before the construction of an N-gram statistical language model, its fundamental mechanism is explained. Then on the base of HMM(Hidden Markov Model), the data smoothing method is analyzed. Finally, we introduce Cache model which can capture the long distance information, and N-class model which is based on POS(Part of Speech). Incorporated the feature of recognition of Chinese character, first, we introduce 5-gram combined model, which can capture both forward and backward statistical characters of one word. In order to reflect the structural feature of every line in test text, secondly, variable length language model is introduced. Compared to previous language model, it realizes the automatic choice of language model that is always constant before. Finally, another language model is introduced to raise recognition rate when there are dense errors in sentences. At the end of the thesis, we discuss the application of language model in predictive system. After categorizing document stream automatically, this recognition system, with a language model intends to predict Chinese character exactly. Automatic categorization can make this model predict intentionally. With this predictive language model, the task of the recognizer can be cut down and the correct rate of the whole system can be raised. All of these language models can be used not only in the post-processing of Chinese character recognition, after some modification they can be used in speech recognition、machine translation and so on also.
关键词	语言模型汉字识别 N-gram模型 Cache模型 Language Model Recognition System Character Recognition Markov N-gram Model 4-gram Trigram Cache-based Model
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7332
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	张胜. 面向汉字识别的语言模型研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2001.