CASIA OpenIR  > 毕业生  > 硕士学位论文
汉字识别后处理方法研究
其他题名Study on the Post-processing Methods of Chinese Character Recognition
刘端正
学位类型工学硕士
导师戴汝为
1991-06-01
学位授予单位中国科学院自动化研究所
学位授予地点中国科学院自动化研究所
学位专业模式识别与智能系统
关键词汉字识别后处理 匹配方法 松驰方法 模糊词法关系连接表 模糊语义关系连接表 句法、语义分析 词汇功能语法 人工神经元网络 联想记忆 Chinese Character Recognition (Ccr) Post-processing Word Matching Method Relaxation Method Fuzzy Lexical Connection Table Fuzzy
摘要汉字识别是中文信息处理的一个重要环节,对于计算机在中国的应用与普及具有 非常重要的意义。在汉字识别的研究过程中,人们越来越清楚的认识到,只利用单个 汉字字符本身的信息,识别率已很难得到进一步的提高,而必须利用汉语高层次的信 息,如词法、句法和语义信息,因此,作为这部分信息具体应用的汉字识别后处理过 程,就显得更加重要了。 一个完整的汉字识别系统主要包括三部分,即前处理、识别和后处理。汉字识别 的这三个部分并不是截然分开的,在一些系统中,前处理与识别过程或识别过程和后 处理已密切地结合在一起了。 汉字识剐的后处理方法,从用户参予的程烹来分,可分为三类:手工处理、交互式 处理和计算机自动处理。手工处理就是把识别后形成的文本文件送给一个标准的文 本编辑软件,如Word star或PE,然后由用户逐个纠正误识的字,给出拒识的字。交互 式处理就是把识别后形成的文本文件送给一个处理程序,该程序能够为每一个误识或 拒识的汉字提供一些候选字,然后通过与用户的交互过程,完成对错误的纠正。计算 机自动处理就是通过一个程序,自动地纠正识别后所形成的文本文件中的错误。从所 应用的方法上分,后处理过程也可以分为三类:基于词汇信息的方法、基于句法分析 和语义分析的方法和新近出现的人工神经元网络方法。 本文在基于知识的模式识别和自然语言处理这两大背景下,从理论和实践两方 面,第一次对汉字识别的后处理方法进行了系统的探讨,主要内容包括:①实现了一个 基于综合匹配法的汉字识别后处理系统;②第一次将松驰方法用于汉字识别后处理, 提出了基于非线性概率松驰过程的汉字识别后处理方法;③提出了句法信息与语义信 息的一种表示方法一模糊词法关系连接表和模糊语义关系连接表,并描述了基于这种 表示的汉字识别后处理方法;④提出了用词汇功能语法对汉字识别初级结果进行句法 分析的基本思想;⑤从一些常用的人工神经元网络(ANN)模型入手,讨论了ANN的 信息处理原理及其与传统方法的联系和区别;⑥给出了汉语词汇在ANN中的一种表 示方法,并基于这种表示构造了一个综合利用监督学习和非监督学习的汉字识别后处 理系统NETpocer。
其他摘要Chinese Character Recognition (CCR) is an important part of Chinese information processing, it plays a significant role for the application and popularization of computer in China. As the progress of the study, people become more and more clear that the recognition rate can't improve much if we only use the character information itself We must use the high-love! information of Chinese, such as morphology, syntax and semantic information. As a result, the post-processing of CCR which make use of this high-- level information become more and more important. The post-processing method of CCR, look from the degree of the participate of the user, can be divided into three classes: user manual correction, interactive correction and computer automatic correction. User manual correction is a method for which the text file after recognition is processed by the user to correct the wrong recognized characters and give the unrecognized characters under some standard text editors, such as Word Star or PE. The interactive correction is a method for which the text file after recognition is transformed into a program that can offer some candidates for the incorrectly recognized character or unrecognized characters. The computer antomatic correction is a method for which the computer correct the mistakes in the text file after recognition automatically through a program. If we look from the method it use, the post-processing procedure can be divided into three classes: methods based on word information, methods based on syntax and set antic analyses and methods based on artificial neural networks (ANNs). On the background of knowledge-based pattern recognition and natural language processing, this paper make a systematically study on the post-processing method of CCR both from theory and practice. The main content include: (I) We have made a poss.-processing system of CCR based on synthetic word matching method. (2) We have first use the relaxation method for the post-processing of CCR, and put forward a post-processing method based on non-liner propabality relaxation process. (3) We have posed a method for representing the lexical and semantic information - fuzzy lexical connection table and fuzzy semantic connection table, and describe a post-processing method based on this representation. (4) We have put forward the basic idea of using the lexieal functional grammer for the syntax analysis of the initial recognition results. (5) We have discuss the connection and difference between the information processing method of ANN and the traditional method from some concrete ANN models. (6) We have advanced a representation method of the Chinese word in ANN, and accomplished a post-processing system of CCR: NETpocer bassed on this representation which use both supervised and unsupervised learning.
馆藏号XWLW208
其他标识符208
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/6996
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
刘端正. 汉字识别后处理方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,1991.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[刘端正]的文章
百度学术
百度学术中相似的文章
[刘端正]的文章
必应学术
必应学术中相似的文章
[刘端正]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。