印刷体数学公式识别应用研究

CASIA OpenIR > 毕业生 > 博士学位论文

	印刷体数学公式识别应用研究
其他题名	Applied Research on Understanding for Printed Mathematical Expressions
	郭育生
	2007-09-01
学位类型	工学博士
中文摘要	科技工程文献中存在大量数学公式，而现有的OCR产品难以有效的识别其中的数学公式，其识别结果往往面目全非。本文针对数学公式识别中存在的问题，从数学公式定位、公式图像二值化、公式符号切分/识别和公式结构分析等多个方面进行了深入的研究，建立了一个初步实用化的数学公式识别系统。本文的主要研究工作包括： (1) 提出了一种数学公式定位的方法。该方法首先通过中文字符识别和公式符号识别区分中文字符和非中文字符（如果是英文文档，则无需进行中文字符区分），然后根据相邻符号间的空间位置信息和符号自身的语义信息从非中文字符中提取内嵌公式符号，最后根据公式的版式信息定位独立公式。在148幅文档图像共3690个公式中取得了91.19%的公式定位正确率。 (2) 针对数学公式图像中符号笔划断裂和粘连的情况，提出了一种公式图像二值化集成方法。为了减少符号笔划的断裂，使用了基于连通体的二值化方法。为了减少相邻符号的粘连，使用了基于直方图的全局二值化方法。最后基于符号识别结果集成这两种二值化方法。 (3) 提出了一种基于三阶段动态规划方法的数学公式符号切分方法。在该方法中首先使用动态规划方法从竖直方向切分公式子图，然后再从水平方向切分公式符号，最后使用动态规划方法合并可能断裂的公式符号。在1322幅公式图像组成的测试集上取得了96.40%的符号切分正确率。 (4) 提出一种带有拒识模型的符号识别方法。在测试数据集上取得了98.58%的符号识别正确率。 (5) 提出了一种层次结构分析方法。该方法降低了公式分析的复杂度，提高了公式的分析正确率，在1322幅公式图像组成的测试集上取得了87.59%的结构分析正确率。 (6) 建立了一个初步实用化的数学公式识别系统。在148幅文档图像共3690个公式组成的测试集中取得了81.24%的公式识别正确率。本文建立的公式识别系统已经嵌入到汉王OCR中，并已正式销售。
英文摘要	There are many mathematical expressions (MEs) in the science and technology documents, yet the OCR in commerce couldn’t efficiently understand the ME in these documents. We, in this paper, focus our research on some key technique of automatic ME recognition, such as ME identification in Chinese document image, binary algorithm for ME image, symbols segmentation/recognition in ME and structure analysis for ME. The main contributions of this thesis include: (1) A ME identification algorithm is proposed. Based on Chinese character recognition and ME symbol recognition, Chinese character and non-Chinese character are distinguished (if there are Chinese characters in the document image). Then according to some features of ME symbols, ME symbols are extracted from non-Character symbols. Finally, using format information, isolated ME is discriminated. ME identification accuracy with 91.19% on the database with 148 document images which contains 3690 MEs is reached. (2) A binary algorithm for ME image is proposed. On one hand, a binary algorithm based on connect component is adopted in order to decrease broken probability of ME symbols. On the other hand, a binary algorithm based on histogram is adopted so that touching probabitlity among adjacent symbols is reduced. The two binary algorithms are integrated based on symbol recognition. (3) A symbol segmentation algorithm in the ME image based on three-stage dynamic programming (DP) is introduced. DP algorithm is firstly adopted to segment sub-images in vertical direction. Then symbols in every block are segmented using DP algorithm in horizontal direction. Finally, broken symbols are combined based on DP algorithm. The experiments were implemented on a database with 1322 images and the symbols segmentation accuracy reached 96.40%. (4) A symbol recognition method with non-symbols rejection model is proposed. Symbol recognition accuracy with 98.58% on the database was obtained. (5) A hierarchical structure analysis algorithm for ME is proposed through reconstructing the ME global structure. The method decomposes the ME into several basic sub-expressions, which efficiently decreases ME structure analysis complexity. ME structure analysis accuracy on the database with 1322 ME images reaches 87.59%. (6) An automatic understanding system for ME is built. The expereiment datebase is consisted of 148 document images which contain 3690 MEs. The accuracy of 81.24% in ME recognition was obtained. Now, the ME recognition system has been integrated into HWOCR, and it has been in commerce.
关键词	中文文档数学公式二值化拒识模型动态规划层次结构 Chinese Document Mathematical Expression Binary Image Reject Modle Dynamic Programming Hierarchical Structure
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6036
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	郭育生. 印刷体数学公式识别应用研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2007.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20041801462802（7615KB）			暂不开放	CC BY-NC-SA