基于结构化预测的联机手写文档分析

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 模式分析与学习

	基于结构化预测的联机手写文档分析
	叶君宇
	2020-12
页数	139
学位类型	博士
中文摘要	随着平板电脑、书写板、数码笔等电子输入设备的逐渐流行，用户能够在更广泛的输入界面上进行输入文本、绘制表格与图形等操作，同时电子输入设备能将用户的笔迹捕获并保存为联机文档。相对于以传统的纸与笔为输入媒介的脱机文档，联机文档保留了更多的输入信息，从而在处理上具有潜在的便利性。但是，联机文档中无约束书写的灵活性也给联机手写文档中的分析带来了挑战。一般来说，对于联机手写文档的分析，首先要将笔划分类为不同的对象，如文本内容和图形，然后再对不同的文档对象分别进行分析。对于同类的文档对象，可进一步分割为不同的实例，如将文本块分割为不同的文本行，或将流程图细分为不同的基本图形单元。本文对联机手写文档分析中的文本/非文本笔划分类、文档对象分类与文档实例分割三个主要问题进行研究，论文的主要工作包括以下方面：一、为了有效地利用笔划之间的时序关系信息以及神经网络的特征建模能力，本文提出了一种基于神经网络与条件随机场的组合模型 NN-CRF 对笔划进行文本/非文本分类。条件随机场中的势函数由神经网络关于笔划几何特征的非线性映射得到，整个模型的参数通过优化负对数似然损失进行联合学习。该方法在联机手写英文文档数据集上进行了实验，与之前的代表性方法相比，以更少的参数量与更快的速度达到了相当的性能。二、为了有效地刻画文档对象的上下文依赖关系，本文提出了一种基于边特征融合的图注意力网络 (EGAT) 模型对联机手写文档中的笔划进行文档对象分类。该方法首先将文档中的笔划根据它们之间的时空上下文关系构建关系图，然后基于 EGAT 模型对图中的节点进行特征学习。在 EGAT 模型的信息传递过程中，本文提出使用自注意力机制以及基于二元笔划特征的边注意力机制来控制节点间的信息传递，同时设计了边特征更新过程对边特征进行学习。该方法在联机手写英文文档、日文文档、流程图、自动机图、草图等数据集上进行了实验，与之前的代表性方法相比取得了更优的分类性能。三、为了应对联机手写文档中对象之间的复杂几何关系以及文本行书写方向任意、相互之间不必平行、距离可以任意靠近等难点，本文提出了一种基于边池化机制的图注意力网络 (EPAT) 与距离度量学习结合的框架对联机手写文档中的文档对象进行实例分割。该方法将文档中的实例分割问题建模为图上的节点分类与聚类问题，并通过在笔划时空上下文关系图上进行信息传递得到分布式的节点特征与边特征对笔划进行分类与聚合。在信息传递过程中，本文提出了边池化机制来加强相邻边特征之间的交互，从而使得模型能有效地处理文档对象或文本行之间的复杂几何关系。同时，本文提出使用多任务监督框架对笔划分类和聚类任务进行联合学习，从而可以在统一的框架下对联机文档进行多层次的分析。该方法在联机手写英文文档、日文文档、流程图数据集上进行了文档实例分割的实验，在多个评价指标上相比之前的代表性方法取得了更优的性能。
英文摘要	With the increasing use of tablet PCs, electronic whiteboards and, digital pens, users can input various heterogeneous content such as text, tables, and drawings on a large writing area and digital ink can be captured by those devices. Compared to traditional offline documents recorded by pen and paper, ink documents contain more information thus it offers potential advantages for analysis. However, the unconstrained writing style in handwritten ink documents brings new challenges to document analysis. In general, for ink document analysis, the ink strokes should be first classified into different classes, e.g. text and drawings, which are then analyzed respectively. Strokes of the same class can be further segmented into different instances, e.g. segment the text block into different text lines or segment the flow chart into different basic graphic units. This dissertation focuses on three important issues of ink document analysis: text/non-text stroke classification, document object classification, and document instance segmentation. The main contributions of this dissertation are summarized as follows: (1) For text/non-text stroke classification, we propose the NN-CRF model combined with neural networks (NN) and conditional random fields (CRF) to utilize the temporal relationships between strokes and the feature learning capacity of neural networks. In this method, the potentials of CRF are output by neural networks with stroke geometric features as input. The parameters are jointly learned by the negative log likelihood loss. The proposed method is validated on online handwritten English documents. The experimental results show that the proposed method achieves comparable performance with fewer parameters and faster speed compared with previous representative methods. (2) For classifying strokes into different document objects, we propose the edge graph attention network (EGAT) to exploit the contextual dependency of document objects. In this method, strokes are first formulated as a relational graph based on their temporal and spatial relationships, then distributed node features are learned through the EGAT model. In the message passing procedure of the EGAT model, we propose to use self attention mechanism and edge attention mechanism to control the information flow between neighboring nodes. In addition, we propose the edge update procedure for learning better edge features. The proposed method is validated on online handwritten English, Japanese, flow chart, automata, sketch documents. The experimental results show that the performance of the proposed method is superior to previous representative methods. (3) Since there are complex geometric relationships between document objects and text lines can be in arbitrary orientation, not necessarily parallel and arbitrarily close in online handwritten documents, we propose edge pooling attention network (EPAT) with distance metric learning to address the document instance segmentation problem. The proposed method formulates the document instance segmentation problem as node classification and node clustering problems in the graph and learns distributed node and edge features by conducting the message passing procedure in the temporal/spatial relational graph. In the message passing procedure, the edge pooling mechanism is exploited to enhance the interaction between neighboring edge features, thus the proposed method can tackle complex geometric relationships between document objects or text lines. In addition, the multi-task learning framework is used to learn the stroke classification and clustering tasks simultaneously, thus the online document analysis can be done in a unified framework. The proposed method is validated on online handwritten English, Japanese and flow chart documents. The experimental results show that the proposed method is superior to previous representative methods in various evaluation metrics.
关键词	联机手写文档分析笔划分类实例分割文本行分割结构化预测条件随机场图神经网络注意力机制
语种	中文
七大方向——子方向分类	文字识别与文档分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/43292
专题	多模态人工智能系统全国重点实验室_模式分析与学习
推荐引用方式 GB/T 7714	叶君宇. 基于结构化预测的联机手写文档分析[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
学位论文_叶君宇.pdf（5220KB）	学位论文		开放获取	CC BY-NC-SA