历史文档版面分析与文字识别

CASIA OpenIR > 毕业生 > 博士学位论文

	历史文档版面分析与文字识别
	徐玥
	2022-08-22
页数	160
学位类型	博士
中文摘要	文档作为信息记录和传播的载体，一直以来对人类的文化和生活有着非常重要的意义。近年来，为了便于计算机的处理与理解，纸质历史文档数字化成为一种趋势。版面分析与文字识别作为文档数字化过程中的两项关键技术，多年来一直受到研究者们的关注，也在各行各业得到了广泛的应用。因此，研究如何准确地对文档图像进行版面分析，并识别其文档内容，是兼具理论意义和应用价值的。在实际应用中，文档的类型多种多样，其相应的版面分析与文字识别技术也各不相同。本文着眼于历史文档图像分析与识别研究，致力于提升版面分析和文字识别两个关键环节的性能。本文的主要贡献归纳如下： 1.基于全卷积神经网络的文档图像版面分析。相较于现代文档，历史文档在版面分析上面临诸多新的挑战，而现有方法往往无法很好解决这些问题。为此，本文提出了一种基于单任务全卷积神经网络的文档图像版面分析方法。该方法利用全卷积神经网络对历史文档图像进行像素级预测，可以得到精准的版面分析结果。为提高版面分析的算法效率，本文进一步提出了基于多任务全卷积神经网络的文档图像版面分析方法。该方法可以同时解决文档图像二值化、页面分割、文本行分割、基线检测等多个版面分析子问题。多任务的网络结构既可以提高算法的处理效率，也可以通过任务间的信息交互，进一步提升算法的准确率。在文档图像二值化数据集和中世纪手写文档数据集上的实验，验证了上述方法的有效性和优越性。 2.中文古籍文档版面分析与数据库构建。本文提出一种面向中文古籍文档图像的版面分析方法，利用多任务的全卷积神经网络，同时解决文档图像二值化、文本行分割、字符切分等多个版面分析子问题。在该方法的基础上，本文设计了交互式的标注软件，对来自四库全书和古代经文的大量中文古籍文档图像进行二值化、文本行分割、字符切分、文字类别标注等处理，建立了一个大规模的中文古籍文档数据库。该数据库包含一万余页中文古籍文档图像及其二值化、文本行分割、字符切分和字符分类的标记，适用于多种研究问题。本文给出了基础的评测指标和实验结果，为领域内相关研究提供参考基线。 3.基于卷积原型网络的大类别集增量学习。在对中文古籍文字进行识别时，由于生僻字、异体字众多，往往难以提前获知全部文字类别，并进行批量学习。因此模型应该能够不断扩展对新类别的分类能力，即进行类别增量学习。本文面向中文古籍文字识别，提出了一种基于卷积原型网络的大类别集增量学习方法。本文解释了卷积原型网络相较于传统卷积神经网络，在类别增量学习这种开放世界问题上的天然优势，提出可以通过引入无监督重构损失、增加预训练类别等策略，增强网络的特征提取能力和鲁棒性，进而提升网络的增量学习性能。此外，本文还提出了在不变特征空间中基于原型约束的增量学习方法，以及在变化特征空间中基于原型和网络参数约束的增量学习方法，两种方法均在中文古籍手写文字数据集中达到当前类别增量学习算法的先进水平。
英文摘要	As the carrier of information recording and dissemination, documents have always been of great significance to the human life and culture. In recent years, for processing and understanding by computers, the digitization of historical paper documents has become a trend. As two key technologies in the process of document digitization, layout analysis and character recognition have attracted a lot of attention from researchers for many years, and have been widely used in many applications. Therefore, it is of both theoretical significance and application value to study how to analyze the document layout and recognize the document contents accurately. In practice, the variety of documents entails different technologies of layout analysis and character recognition. This thesis focuses on the analysis and recognition of historical document images, with the aim of improving the performance of layout analysis and character recognition. The main contributions are summarized as follows: 1. Layout analysis for document images based on fully convolutional network. Compared with contemporary documents, historical documents face many new challenges in layout analysis, and the existing methods often fail to solve these problems well. Therefore, we propose a document image layout analysis method based on the single-task fully convolutional network (FCN). This method uses a fully convolutional network to perform pixel-level prediction on historical document images, and obtains accurate layout analysis results. In order to improve the algorithm efficiency of layout analysis, a document image layout analysis method based on the multi-task fully convolutional network is also proposed. This method can solve multiple tasks of layout analysis simultaneously, such as document image binarization, page segmentation, text line segmentation and baseline detection. The multi-task network can not only improve the efficiency of processing, but also improve the accuracy of layout analysis through the information interaction between different tasks. Experiments on the document image binarization dataset and medieval manuscripts dataset prove the effectiveness and superiority of the proposed methods. 2. Layout analysis and database construction of Chinese ancient documents. For Chinese ancient document images, we propose a layout analysis method based on the multi-task fully convolutional network, which can simultaneously solve multiple layout analysis tasks such as document image binarization, text line segmentation and character segmentation. Based on this layout analysis method, we design an interactive annotation software to carry out the document image binarization, text line segmentation, character segmentation and character class annotation for a large number of Chinese ancient document images from Complete Library in Four Sections and Ancient Scriptures, and build a large database of Chinese ancient document images. The resulting database contains more than 10,000 pages of Chinese ancient document images and their labels for binarization, text line segmentation, character segmentation and character classification. Therefore, it is suitable for a variety of research problems. This thesis presents the basic evaluation metrics and experimental results to provide baselines for researches in the field. 3. Class-incremental learning of large class set based on convolutional prototype network. When recognizing ancient Chinese characters, it is often difficult to know all character classes in advance and learn them in batch due to the large number of rare characters and variant characters. Therefore, the model should be able to expand the classification capability of new classes continuously, that is, class-incremental learning. We propose a class-incremental learning method of large class set based on the convolutional prototype network (CPN) for Chinese ancient character recognition. After interpreting the inherent advantages of the convolutional prototype network over traditional convolutional neural network (CNN) in the open world issues such as class-incremental learning, we propose that the feature extraction ability and robustness of the network can be enhanced by strategies such as combining unsupervised reconstruction loss and adding pre-training classes, thereby improving the class-incremental learning performance of the network. We also propose a method based on prototype regularization in invariable feature space, and a method based on prototype and network parameter regularization in variable feature space. Both methods yield superior performance in class-incremental learning on the Chinese ancient handwritten character dataset.
关键词	版面分析文字识别类别增量学习文档数据库历史文档全卷积神经网络卷积原型网络
语种	中文
七大方向——子方向分类	文字识别与文档分析
国重实验室规划方向分类	视觉信息处理
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/49903
专题	毕业生_博士学位论文
通讯作者	徐玥
推荐引用方式 GB/T 7714	徐玥. 历史文档版面分析与文字识别[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
徐玥-博士论文-历史文档版面分析与文字识（34832KB）	学位论文		限制开放	CC BY-NC-SA