自由手写汉字识别方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	自由手写汉字识别方法研究
其他题名	Unconstrained Handwritten Chinese Character Recognition
	邵允学
	2013-05-28
学位类型	工学博士
中文摘要	手写汉字识别因其重要的理论意义和潜在的应用价值，吸引了大量的研究者。在有限制自由书写的数据集上，手写汉字识别已经取得了较大的成绩，但针对自由手写汉字识别的研究较少且识别性能偏低，限制了字符识别应用的进一步开展。为了克服当前手写汉字识别技术存在的一些不足之处，例如，预处理方法应对形变能力不足以及和特征提取方法联系不够紧密、分类器集成策略缺乏针对性、相似字的区分能力不足等，本文主要研究了如何设计一个应对字符形变能力较强且和特征提取联系更加紧密的预处理方法；如何针对手写汉字识别设计简单有效的分类器集成方法；如何利用相似字的关键区域结构信息来更好的区分相似字。本文研究工作包括以下三个部分： 1.针对预处理方法应对字符形变能力不足以及和特征提取方法联系不够紧密的问题，本文提出了基于视觉词密度(VWD)的非线性归一化方法。该方法同时考虑归一化后样本的类内和类间方差，弥补了传统归一化方法仅考虑类内方差的不足；同时，在词典密度学习的过程中用到了特征提取方法，从而使得归一化和特征提取的联系更加的紧密，为后续的分类识别打下良好的基础。在自由手写和有限制自由手写汉字数据集上的实验结果表明，本文提出的方法在分类性能上优于常用的非线性归一化方法。 2.手写汉字识别问题具有类别集大、训练样本少等特点，现有的很多分类器集成方法很难直接应用到该问题上，针对该问题的特点，本文提出了基于快速自产生投票(FSGV)的手写汉字识别方法。首先，利用本文提出的快速自产生方法产生一个测试样本集合；然后，利用一个基分类器去识别这个测试样本集合中的样本；最后，对这些识别结果进行加权投票给出最终识别结果。另外，为了提高产生样本之间的互补性，本文通过贪心法学习得到一个较小的但互补性较好的产生参数集合，使得投票的速度和分类性能都有进一步的提升。在自由手写和有限制自由手写汉字数据集上的实验结果表明，本文提出的方法是实用和有效的。 3.在相似字区分问题中，基于两类线性判别分析的相似字区分方法是较为常用的方法。这类方法针对线性可分的相似字区分效果较好，但是对于自由手写汉字，相似字之间往往是线性不可分的。针对此问题，本文提出了基于自适应关键区域分析(ACRA)的相似字区分方法。该方法充分考虑到了关键区域的尺度和位置的可变性以及可能产生的各种形变，达到自适应测试样本的目的。同时，针对训练样本少导致的AdaBoost方法泛化性能较差的问题，提出了多列AdaBoost方法。在自由手写汉字数据集上的实验表明，本文提出的ACRA方法的识别性能优于常用的相似字区分方法。
英文摘要	The problem of handwritten Chinese character recognition (HCCR) has been investigated by many researchers for its theoretical significance and potential in many applications. The performance of constrained handwritten Chinese character recognition has achieved great improvement. However, the research on the unconstrained handwritten Chinese character recognition is far from enough and the recognition accuracy of unconstrained handwritten characters is still not satisfactory, which restricts many applications of character recognition. In this thesis, three algorithms are proposed to alleviate the limitation of some existing methods in HCCR, including learning a better character normalization method, designing a fast and effective classifier combination method and exploring a similar character discrimination method which is more suitable for unconstrained handwritten Chinese characters. The main work and contributions are presented as following: Firstly, a visual word density (VWD) based nonlinear normalization method is proposed. In contrast to the traditional nonlinear normalization methods which only minimize the within-class variance, the proposed method minimizes the ratio of within-class variance and between-class variance, in which both the within-class variance and the between-class variance are considered. Moreover, feature extraction is involved in the learning procedure of the proposed method which makes the relationship between normalization and feature extraction closer than the traditional relationship between them. This is beneficial for classification. Experimental results on constrained and unconstrained handwritten Chinese character databases show that the proposed method outperforms the traditional normalization method including dot density based and line density based nonlinear normalization methods. Secondly, a fast self-generation voting (FSGV) based method is proposed considering the characteristics of HCCR. Combining classifiers can exploit their individual advantages in order to reach an overall better performance than could be achieved by using each of them separately. Due to the large number of categories and lacking training samples, directly applying most of the existing classifier combining methods on the HCCR problem would fail to perform well. In the proposed method, a virtual testing set is first generated by the proposed fast self-generation method. Then each sample in the virtual testing set is classified by a baselin...
关键词	自由手写汉字识别字符图像归一化视觉词密度快速自产生投票自适应关键区域分析多列adaboost Unconstrained Handwritten Chinese Character Recognition Character Image Normalization Visual Word Density Fast Self-generation Voting Adaptive Critical Region Analysis Multi-column Adaboost
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6530
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	邵允学. 自由手写汉字识别方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2013.