拍照文档的识别方法研究与应用

CASIA OpenIR > 复杂系统管理与控制国家重点实验室 > 影像分析与机器视觉

	拍照文档的识别方法研究与应用
	贾馥溪
	2019-05-31
页数	152
学位类型	博士
中文摘要	随着计算机软硬件的快速发展和各种拍照设备的普及，人们的生活和工作中产生着越来越丰富的图像介质。在一幅图像中，文字信息通常能够对事物或事件进行准确的描述。因此，如何提取和识别图像中的文字信息尤为重要，例如，发票图像中金额和日期等内容的识别。传统的光学字符识别（OCR）技术以处理扫描仪生成的图像为主，难以应对由数码相机、智能手机等新型采像设备带来的拍照文档识别问题。这是因为现有拍照设备的采像环境是开放的，所生成的图像可能存在透视弯曲变形、光照不均、运动离焦模糊以及复杂背景等特性，这些特性使得拍照文档识别具有较强的挑战性。本文选择各个行业中普遍应用的格式文档（在一定程度上具有固定格式的文档）图像作为研究对象，从理论、方法与实际应用相结合的思路对拍照文档的结构化识别进行了研究，特别对拍照文档二值化（创新点1）、关键词检测（创新点2）以及文本行分割识别（创新点3）三个关键技术进行了重点研究和应用开发。以此为基础，选择财务领域的票据自动录入应用为切入点，开发制作了中国科学院智能财务系统中票据识别云服务生产系统，至今已运行维护超过一年时间。本文的主要工作和创新点归纳如下： 1. 面向拍照退化现象，设计并实现了一种基于文字笔画对称性的二值化方法对于拍照文档的各种退化情况，如浅色文字、笔迹渗透和墨水污渍等，传统的局部阈值二值化方法很难取到较好的效果。主要原因在于它们使用邻域中的所有像素，包括随机噪声和背景干扰来计算唯一的局部阈值。为了解决这些问题，我们使用邻域内结构对称像素估计多个阈值，并采取投票策略确定中心像素是否属于文字。结构对称像素是指位于文字笔画周围梯度幅值较大且方向相反的像素，它具备梯度方向和像素灰度两方面的对称性。因此基于结构对称像素灰度统计值估计的局部阈值，可以避免引入噪声，同时能够很好地区分文字和背景像素。而多阈值投票机制不仅能够补偿不精确的结构对称像素带来的误判，还可以有效地消除邻域中的边界噪声。在七个公开数据集上的实验结果表明了该方法的有效性和鲁棒性，同时在智能财务的实际生产系统中展现了该方法的高效性。 2. 设计并实现了一种结合识别反馈的拍照文档中文关键词检测方法对于拍照文档的结构化识别问题，如何合理利用文档本身的语义指导信息十分重要。由于文档中的关键词能够表明数据类型，关键词的语义识别与位置检测结果是获取结构化数据的重要依据，因此对文档图像关键词检测的研究具有重要意义。大多数现有方法的处理对象是英文单词，并将其视为两个独立的任务：文本检测和文本识别，这意味着检测错误将导致对识别结果不可逆的损害。而相对于英文单词来说，中文关键词具有不同的长度、方向、类别，并且字符间隔可能非常大，这使得中文关键词检测更加困难。为了应对这些挑战，我们首先检测单个字符，利用单字识别结果的反馈过滤掉非关键字符并进一步检测更多的单字。然后采用灵活的匹配策略融合单字形成初始关键词，并依靠初始识别结果对关键词内部丢失的字符进行二次检测与识别。最后通过优化识别置信度和几何分布信息的代价函数来选择最终的关键词结果。我们收集了两个手机拍照的发票图像数据集并在上面评测本文方法。实验结果表明，与现有的基于深度学习模型的目标和文本检测方法相比，本文所提出的中文关键词识别方法具有良好的有效性与较强的适应性。 3. 设计并实现了一种基于最优化灰度投影的拍照浅文本分割识别方法在得到文档的关键词检测结果后，可根据关键词的语义信息指导其他文本行的分割识别，如日期文本行中字符类别是有限的。在实际应用中，打印机由于缺少印刷油墨可能会生成一些浅文本的发票。此外，相机捕获图像过程将进一步引入许多图像退化情况，例如低分辨率、过度曝光和离焦或运动模糊。为了解决上述问题，我们提出一种基于灰度投影的最优化字符分割识别方法。该方法利用图像梯度投影的局部最小值首先提取一系列的字符分割位置候选，然后为所有可能的分割路径建立分割树，综合分割位置投影值、总体几何分布和识别置信度三种信息为每条路径设置得分，选择得分最高的最优路径作为该文本图像最终分割和识别的结果。为了评估本文方法的有效性，我们从浅出租车发票拍照图像中收集了一个浅文本行识别数据集。实验结果表明，本文方法大幅度提高了实际产品中浅色和模糊文本的识别性能。4. 产品化：智能云财务共享服务平台（Intelligent Finance Shared Service Platform, i-FSSP）为了解决科研人员报销难问题，中国科学院于2018年6月从科研项目管理、资产财务管理、人才计划管理三个方面，提出了落实“放管服”改革、建立绿色通道的十项具体改革举措。针对实际工作中基础财务信息的采集和加工的质量参差不齐、效率低以及人力成本高等问题，我们提出了智能云财务共享服务平台，在票据的智能录入、数据分析以及决策支撑等多项关键技术上进行研究。仅从人力成本考虑，该项目预计每年能够为中科院节省7500万元。其中票据的智能录入是指手机拍照票据的结构化识别，与传统财务记账中记录的日期、金额和科目等部分信息相比，能为后续的大数据分析决策提供更为全面的财务基础数据。为实现科研算法的产品化，我们在诸如运行速度、初始化内存、长时间稳定运行以及用户高并发访问等方面均进行了优化。实际应用以及上线反响表明，本文设计的拍照文档识别方法在票据智能录入方面的应用是可行且有效的。
英文摘要	With the development of computer hardware, software and the popularity of image acquisition devices, there are more and more images in human's daily life and work. In the composition of an image, text information implies the accurate description of things and events. Therefore, how to extract text information from images captured by modern devices is very important, such as the automatic recognition of the amount and date printed on the invoice images. The traditional optical character recognition (OCR) technology is mainly used to recognize document images generated by scanners. It is difficult to deal with the recognition problem of camera-captured document images generated by new image acquisition devices such as digital cameras and mobile phones. This is because the photographic environment of the new devices is usually open. The interference factors such as perspective bending, uneven illumination, motion defocusing and complex background make the camera-captured document recognition more challenging. This paper chooses the formatted document images (document with the fixed format to a certain extent) which is widely used in various industries as the research object. According to the rule of combining theoretic analysis with practical application, a series of researches have been made on the camera-captured document recognition issues. Especially, three key technologies have been researched and implemented: document image binarization (innovation point 1), key word spotting (innovation point 2), and text line segmentation and recognition (innovation point 3). Taking these technologies as the foundation, we choose the application of automated handling invoices as a starting point, develop and manufacture the invoice recognition cloud service system adopted in the intelligent financial system of the Chinese Academy of Sciences. This service system has been in operation and maintenance for more than one year. The main contributions of this dissertation are summarized as follows: 1. A binarization method using structural symmetry of strokes has been designed and implemented for dealing with degradations. For the camera-captured document images, there are different types of degradations such as faint characters, bleed-through background and ink stains. It is difficult to get a successful binarization result when we apply the traditional local thresholding methods to binarize these degraded document images. The reason might lie in the fact that they compute a unique threshold using all the pixels in neighborhood including the possible random noise and background disturbance. In order to solve these problems, we utilize the structural symmetric pixels (SSPs) to calculate the local threshold in neighborhood and the voting result of multiple thresholds will be adopted to determine whether one pixel belongs to the foreground or not. The SSPs are defined as the pixels around strokes whose gradient magnitudes are large enough and orientations are symmetric opposite. Since the SSPs contain both text and background pixels, the intensity statistic of these pixels is a good approximation of the local threshold used to distinguish text from background. The voting framework is able to compensate for some inaccurate detections of SSP and effectively eliminate the boundary noise in the neighborhood. The experimental results on seven public document image binarization datasets show the effectiveness and robustness of the proposed method. Meanwhile, the practical application of intelligent financial system verifies the high efficiency of this method. 2. A Chinese key word spotting method based on recognition feedback for camera-captured document image has been designed and implemented. For the structured recognition problem, how to make reasonable use of the semantic information of the document itself is very important. The key words of invoices imply the attributes of the structured data. Therefore, the study on the key word spotting for invoice images is of great significance. Most existing works consider word spotting as two separate tasks: word detection and word recognition. This implies that the detection errors will lead to irreversible damage to the recognition results. In the invoice images, the Chinese key words have various lengths, orientations, categories and the character intervals may be much larger than that of the English words. In order to deal with these challenges, we first detect individual characters directly. Using the feedback of the character recognition results, we filter out the non key characters and further detect more character candidates. Then a flexible matching strategy is used to group characters into word candidates and it is able to provide valuable information for the character re-detection. For the final word spotting, the cost function integrating the information of recognition confidence and geometric distribution is optimized to remove the false positives. We collect two datasets TID and VATID of the camera-captured invoice images and evaluate our method on them. Experimental results demonstrate the effectiveness and robustness of the proposed method. 3. A grayscale-projection based optimal character segmentation and recognition method for camera-captured faint text has been designed and implemented. After we obtain the key words from a document image, we could conduct the character segmentation and recognition of text lines according to the semantic information of key words. The invoice printer may produce some invoices with shallow text due to the lack of printing ink, which happens a lot in practice. In this way the invoice image has faint text inherently. Besides, the camera-captured image acquisition process will further introduce many degradations such as low-resolution, over-exposure and out-of-focus or motion blur. In order to solve these problems, we propose a new character segmentation method for faint text images by only using the grayscale information. Instead of extracting the character candidates, we use the gradient projection to extract a series of segmentation candidates. Then we construct a segmentation tree in which each branch represents a reasonable segmentation path. For each path, we set an evaluation score combining three scores of single point projection, overall distribution and recognition probability. Finally, we obtain the optimal segmentation path by selecting the path with the highest score. We collect a faint text recognition dataset and evaluate our method on it. The experimental results show that the proposed method greatly improves the recognition performance of faint and fuzzy text lines. 4. Productization: Intelligent Finance Shared Service Platform (i-FSSP). In order to solve the reimbursement problem for researchers, in June 2018, the Chinese Academy of Sciences proposed ten specific reform measures to implement the reform of "decentralization, regulation and service" and establish a green channel from three aspects: scientific research project management, asset financial management and personnel planning management. In order to deal with the problems of uneven quality, low efficiency and high human cost in the basic financial information collection process, we propose the Intelligent Finance Shared Service Platform, and study on the key technologies such as intelligent invoice entry, data analysis and decision support. Considering human cost alone, the project is expected to save 75 million RMB annually for the Chinese Academy of Sciences. The intelligent invoice entry refers to the structured recognition of the camera-captured invoice documents. Our system can provide more comprehensive financial basic data for the subsequent large data analysis and decision-making. In order to productize the proposed algorithms, we optimize the intelligent invoice entry product in many aspects, such as running speed, initialization memory, long-term stable operation and high concurrent access of users. Practical application and online response show that the camera-captured document recognition method designed in this paper is feasible and effective in the application of intelligent invoice entry.
关键词	拍照文档识别，文档二值化，笔画结构对称性，关键词检测，智能财务
语种	中文
七大方向——子方向分类	文字识别与文档分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/23889
专题	复杂系统管理与控制国家重点实验室_影像分析与机器视觉
推荐引用方式 GB/T 7714	贾馥溪. 拍照文档的识别方法研究与应用[D]. 中国科学院自动化研究所智能化大厦三层第五会议室. 中国科学院自动化研究所,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis-答辩后修改版.pdf（15527KB）	学位论文		开放获取	CC BY-NC-SA