CASIA OpenIR  > 毕业生  > 硕士学位论文
格式文档图像配准与识别方法及应用
何坤
2017-05-25
学位类型工程硕士
中文摘要格式文档是指在一定程度上具有固定格式的文档,如票据、证照、表单等,在教育、金融、物流、税务、行政管理等各个行业都有着广泛的应用,目前对这些格式文档的电子存档主要依靠人工录入的方式,耗费大量的人力物力。因此,格式文档的自动识别技术具有极大的经济效益和广泛的社会价值。
格式文档的自动识别系统包括图像的预处理,格式配准,以及待识别内容的提取与字符串的识别等环节,本文针对这些环节开展了系统的研发工作,主要工作如下:
1、提出了一种基于笔画结构对称性的文档图像二值化方法。文字笔画的对称性特性包括笔画边缘梯度方向对称性和笔画边缘过渡区域前背景像素的对称性,该方法利用了这一个特性在文档图像上提取结构对称元素,并据此进行局部二值化。同时,算法通过背景归一化解决了部分光照不均带来的困难和污渍污染的问题。算法在公开数据集以及实际项目收集的真实格式文档图像数据集上取得了较好的效果。
2、提出了一种通用的格式文档模式定义方法。模式定义即对格式版面的描述,一般包括版面的结构属性和逻辑属性。该方法采用将格式文档中的信息表示成Label-Value对的思想,将模式定义总结为一系列Label-Value对的集合。并开发了一个交互式的格式文档模式定义工具软件,该方法及工具在银行流水单数据、车票识别等多个应用系统中得到应用。
3、设计了一种格式文档模式配准与内容提取算法。主要解决了两个问题,一个是对文档图像的格式进行分类,另一个就是提取文档中的待识别图像区域。根据预先制作的模式定义,构建多尺度弹性框架,然后利用这个弹性框架,结合滑动窗的策略寻找最佳配准结果。在实际项目中收集的银行流水单数据集上进行了测试,验证了算法的有效性和鲁棒性。
4、设计实现了一种基于长短时记忆(Long-Short Term Memory, LSTM)循环神经网络和卷积神经网络(Convolutional Neural Network, CNN)相结合的字符串识别技术。采用CNN提取文本图像特征序列,作为LSTM网络的输入,并采用Connectionist Temporal Classification(CTC)来解决模型训练中的数据标签对齐问题。针对深度神经网络学习需要大量训练样本的问题,本文还设计实现了一种简易的字符串文本图像样本生成算法。识别模型在出租车票的合成数据集上进行训练,在真实的出租车票数据上测试,取得了较好的识别结果。
英文摘要A structured document is a document that has a fixed format to a certain extent,such as bills, certificate and bank statements. It is widely used in education, finance, logistics, taxation, administrative management, and other various industries. At present, the electronic transformation from the papery structured document mainly relies on manual input, and this will lead to great manpower and resource cost. Therefore, the automatic recognition of structured documents may bring great economic benefit and wide social value.  
The automatic recognition system of the structured document includes image preprocessing, format matching, the extraction of content to be recognized and the text recognition and so on. In this paper, we do some related work on these parts, and as follows:
1. A novel binarization method of document image based on the structural symmetry of strokes is proposed in this paper. The symmetric properties consist of two parts: the opposite gradient directions of stroke edges and the coexistence of the foreground and background pixels. The proposed method uses these properties to extract structural symmetry elements in the document image, and conduct the local binarization. At the same time, the proposed method can solve the problem of the uneven illumination and the pollution of the stain by background normalization. The proposed method achieves satisfactory results on several public datasets and a dataset collected in actual projects.
2. A method of pattern definition is proposed. The pattern definition refers to the description of the structured layout, which includes the structure and the logic of the layout. In this paper, we use Label-Value pair to represent the information in the structured document. And an artificial interaction tool is created to guide the user to complete the pattern definition. The proposed method and artificial interaction tool is used in the recognition system of bank documents and tickets.
3. A method of pattern registration and content extraction is proposed. We mainly address two problems, which include classifying the document image format and extracting the text regions to be identified in the document. According to the pattern definition, we build a multi-scale flexible framework, and use this frame to find the best match results combining with sliding window strategy. At last, we test on datasets of bank statement to verify validity and robustness of the proposed algorithm.
4. Design and implement a text recognition technology based on Long-Short Term Memory (LSTM) and convolution neural network (CNN). CNN is used to extract the characteristics of the text image as a sequence of input to the LSTM, and use Connectionist temporal classification technology to solve the problem of data label alignment. Aiming at the problem that deep learning needs a large number of training samples, this paper also designs a simple text image sample generation algorithm. Finally, the model is trained on the synthetic dataset of taxi tickets, and gets a good test result on the real taxi ticket dataset.
关键词格式文档 二值化 格式配准 内容提取 字符串识别
学科领域计算机视觉与模式识别
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/14665
专题毕业生_硕士学位论文
作者单位中科院自动化研究所
推荐引用方式
GB/T 7714
何坤. 格式文档图像配准与识别方法及应用[D]. 北京. 中国科学院研究生院,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
格式文档图像配准与识别方法及应用.pdf(4703KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[何坤]的文章
百度学术
百度学术中相似的文章
[何坤]的文章
必应学术
必应学术中相似的文章
[何坤]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。