脱机中文手写字符串切分方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	脱机中文手写字符串切分方法研究
其他题名	Methods for Offline Chinese Handwritten Character String
	许亮
	2013-05-30
学位类型	工学博士
中文摘要	对于脱机手写中文文本识别，字符切分是其中一个十分重要的部分。由于在字符识别之前不能准确地切分，往往采取过切分方法，即将字符串切分成基元片段，然后结合字符识别和上下文动态地组合基元片段得到字符。过切分一般通过连通块标记和粘连字符切分来实现。粘连字符切分的目的是保证在争取切分开粘连区域的前提下，尽可能少的切分。这是一个研究的难点，虽然已有一些前人的相关研究工作发表，但仍然有很多遗留问题未能解决，值得进一步进行深入研究。本文通过对于脱机粘连手写字符串过切分方法的深入研究，有效提高了手写字符串的切分和识别正确率。本文的主要贡献如下：（1）建立国内首次公开的粘连字符串数据库。我们利用已经标记好的脱机手写文本数据库CASIA-HWDB，抽取出其中的所有粘连字符串，建立了一个标注好的粘连字符串数据库CASIA-HWDB-T。该数据库总共包含56,469个粘连字符串，其中大部分是单粘连字符串，余下的小部分是1,818个多粘连字符对。（2）提出一种基于字符轮廓匹配的过切分算法。该方法的主要特点在于我们使用动态时间折叠（Dynamic Time Warping, DTW）技术，来找到轮廓特征点相应的对面轮廓最佳匹配点。这样即使在粘连区域附近不存在上轮廓或者下轮廓角点的情形下也能生成切分线段。在大规模粘连字符串数据库上的实验结果表明该方法能够正确切分开绝大部分粘连字符串（即很高的召回率）。（3）提出一种结合前景骨架分析和字符轮廓分析的过切分算法。相比较于轮廓分析，前景骨架分析有利于更准确地找到正确切分点。同时我们基于轮廓分析的切分点的可见性度量能有效地过滤掉冗余切分点。在大规模粘连字符串数据库上的实验结果表明，该方法能够正确切分开大部分粘连字符串，而冗余切分点比例是比较适中的。（4）提出一种结合规则和学习过滤的过切分算法。基于学习的过滤可以克服以往完全基于经验规则过滤冗余切分点时不够鲁棒的缺陷。我们在标记的正确切分线段和冗余切分线段样本上，提取切分线段相关的多维几何特征，训练线性分类器（Linear Discriminant Function 和Linear Support Vector Machine），并将分类器输出通过Sigmoid 变换转化为置信度概率，然后根据单个切分线段的置信度和相邻切分线段的置信度对比去除冗余。实验结果表明，该方法能够取得比较好的切分点检测召回率和精度的折中，并且有助于提升字符串识别性能。（5）提出一种基于隐马尔可夫模型（Hidden Markov Model, HMM）的切分线段过滤算法。HMM是一种一维序列模式识别的方法，能更好地描述前后切分线段的相关性，来从整体上判断冗余切分线段。在大规模粘连字符串数据库上的实验结果表明了该方法的可行性。
英文摘要	Character segmentation is a very important part for offline handwritten Chinese text recognition. The ambiguity of character segmentation is commonly overcome by over-segmentation, which separates the character string into primitive segments and combines primitive segments into characters incorporating character recognition and contexts. Over-segmentation is usually performed in two steps: connected component labeling and touching character splitting. Over-segmentation of touching characters is a challenging and unsolved task, though numerous works have been published in the past decades. In order to improve the performance of segmentation and recognition for Chinese handwriting, we have provided a large public touching character database and have proposed four effective over-segmentation methods for handwritten touching characters, which are summarized as follows. 1. We built a first large public touching character database from Chinese handwriting. We collected all the touching strings from an annotated Chinese handwriting database CASIA-HWDB to form our touching string database CASIA-HWDB-T. The database contains 56,469 two-character or multiple-character touching strings, among which 1,818 strings have multiple-touching characters. 2. We propose an over-segmentation algorithm based on contour matching. To reliably locate separating points on the contour of touching pattern, we pair upper and lower contour points using DTW (Dynamic Time Warping) such that a corner point in upper/lower contour can be always paired with a proper contour point of opposite side for forming a separating line. Our experimental results show that the proposed method can correctly separate most of touching characters (i.e., a high recall rate of touching point detection). 3. We propose an over-segmentation algorithm combining foreground skeleton analysis and contour analysis. Foreground skeleton analysis can help detect touching points more accurately, while the profile visibility analysis of separating points is used to filter out redundant ones efficiently. Our experiment on a large touching character database shows that the proposed method can locate most of between-character boundaries, with a moderate percentage of redundant separating points. 4. We propose an effective over-segmentation method with learning-based filter. The learning-based filter helps improve the robustness of filtering only by heuristics. On extracting geometric features from the samples of corre...
关键词	字符切分粘连手写字符串脱机手写中文文本识别过切分切分线段过滤 Character Segmentation Touching Characters Offline Handwritten Text Recognition Over-segmentation Separating Line Filtering
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6545
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	许亮. 脱机中文手写字符串切分方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2013.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20091801462806（5142KB）			暂不开放	CC BY-NC-SA