联机中文手写文档的关键词检索

CASIA OpenIR > 毕业生 > 博士学位论文

	联机中文手写文档的关键词检索
其他题名	Keyword Spotting from Online Chinese Handwritten Documents
	张恒
	2013-05-30
学位类型	工学博士
中文摘要	随着笔输入设备和笔式用户界面的广泛应用，联机手写文档分析、识别和检索技术成为重要的研究方向。文档识别技术虽然取得了很大的进步，但是受到识别精度的影响，不能够正确识别的词语就无法查找到，导致召回率不够高。而关键词检索技术是在不需要对文档进行精确识别的情况下，计算关键词和文档中候选词之间的相似度，通过调节相似度的阈值来平衡召回率和精度，这样做可以找到更多有用的信息。本文主要研究大规模多书写人的联机中文手写文档关键词检索方法，以字符识别和文本识别为基础，在候选切分-识别网格中计算候选字的置信度，并利用字符相似度计算词语的相似度。通过字词相似度的有效计算和动态搜索，可以在大数据库上有效地查找关键词的位置。本文的主要工作和贡献如下： (1) 提出一种基于一对多(one-versus-all)原型分类器的关键词检索方法。和基于最小分类错误(minimum classification error，MCE)准则训练的多类分类器不同，一对多原型分类器可以更好地拒识错误类别。实验结果表明，一对多分类器在关键词检索中的性能优于多类分类器。 (2) 提出一种基于候选切分-识别网格中N-best路径的字符置信度估计的关键词检索方法。该方法的路径评价准则是一种集成字符分类器、二元语言模型和几何模型的判别函数。本文利用soft-max把路径的分数转换成概率，置信度参数可以通过训练文本上的字符混淆网格(character confusion network，CNN)解码进行估计。实验结果验证了该方法的有效性。 (3) 提出一种基于候选切分-识别网格剪枝和边概率计算的关键词检索方法。基于半马尔可夫-条件随机场(semi-Markov conditional random fields, semi-CRFs)模型，利用前向-后向算法对候选切分-识别网格进行剪枝，并计算边概率，作为候选字的置信度。为了提高关键词检索的召回率，提出一种误差校正的字符同步动态搜索算法。实验结果验证了半马尔可夫-条件随机场和误差校正的动态搜索算法的有效性。
英文摘要	With the increasing use of pen-based input devices and user-interfaces, more research attention has been paid on online document analysis techniques including text segmentation, recognition and retrieval. In spite of the great progress on handwritten text recognition, the remaining recognition errors can still present locating the keywords. Keyword spotting is to locate the instances in the document without accurate recognition of the document. The user can adjust the similarity threshold to balance the recall and the precision for fulfilling different needs. This thesis studies into text-query-based keyword spotting techniques on large database of multi-writer online handwritten Chinese documents. Based on handwriting recognition, candidate character confidences are computed on the candidate segmentation-recognition lattice and combined into word similarities. Due to the accurate character/word similarity computation and dynamic search, the query can be efficiently located on the lattice. The major contributions of this work are as follows: (1) A keyword spotting method based on one-vs-all(OVA)trained prototype classifier is proposed. Compared with the prototype classifier trained with minimum classification error(MCE)criterion, the OVA classifier can better detect target words and reject imposters. Our experimental results demonstrate the effectiveness of keyword spotting using OVA classifiers. (2) A spotting method based on the character confidence computed from the N-best list on the candidate segmentation-recognition lattice is proposed. Each path is evaluated by a scoring function combining multiple contexts including the character classification score, bi-gram linguistic score and geometric scores. The scores of the N-best paths are transformed to posterior probabilities using soft-max with its parameter estimated from the character confusion network, which is generated from the N-best paths of a training data set of text lines. The experimental results demonstrate the superiority of this method. (3) A keyword spotting method with edge probability computation on the pruned candidate segmentation-recognition lattice is proposed. Based on the semi-Markov conditional random fields(semi-CRFs)model, the candidate segmentation-recognition lattice is pruned by a forward-backward algorithm and the edge probability is computed as the character confidence. We further propose to improve the recall of the keyword spotting using an error-correcting character...
关键词	联机中文手写文档关键词检索置信度计算字符同步动态搜索半马尔可夫-条件随机场 Online Chinese Handwritten Document Keyword Spotting Confidence Computation Character-synchronous Dynamic Search Semi-markov Conditional Random Fields
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6540
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张恒. 联机中文手写文档的关键词检索[D]. 中国科学院自动化研究所. 中国科学院大学,2013.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20091801462806（10389KB）			限制开放	CC BY-NC-SA