基于结构化预测的联机手写文档分析
叶君宇
2020-12
页数139
学位类型博士
中文摘要

随着平板电脑、书写板、数码笔等电子输入设备的逐渐流行,用户能够在更
广泛的输入界面上进行输入文本、绘制表格与图形等操作,同时电子输入设备
能将用户的笔迹捕获并保存为联机文档。相对于以传统的纸与笔为输入媒介的
脱机文档,联机文档保留了更多的输入信息,从而在处理上具有潜在的便利性。
但是,联机文档中无约束书写的灵活性也给联机手写文档中的分析带来了挑战。
一般来说,对于联机手写文档的分析,首先要将笔划分类为不同的对象,如文本
内容和图形,然后再对不同的文档对象分别进行分析。对于同类的文档对象,可
进一步分割为不同的实例,如将文本块分割为不同的文本行,或将流程图细分为
不同的基本图形单元。本文对联机手写文档分析中的文本/非文本笔划分类、文
档对象分类与文档实例分割三个主要问题进行研究,论文的主要工作包括以下
方面:
一、为了有效地利用笔划之间的时序关系信息以及神经网络的特征建模能
力,本文提出了一种基于神经网络与条件随机场的组合模型 NN-CRF 对笔划进
行文本/非文本分类。条件随机场中的势函数由神经网络关于笔划几何特征的非
线性映射得到,整个模型的参数通过优化负对数似然损失进行联合学习。该方法
在联机手写英文文档数据集上进行了实验,与之前的代表性方法相比,以更少的
参数量与更快的速度达到了相当的性能。

二、为了有效地刻画文档对象的上下文依赖关系,本文提出了一种基于边特
征融合的图注意力网络 (EGAT) 模型对联机手写文档中的笔划进行文档对象分
类。该方法首先将文档中的笔划根据它们之间的时空上下文关系构建关系图,然
后基于 EGAT 模型对图中的节点进行特征学习。在 EGAT 模型的信息传递过程
中,本文提出使用自注意力机制以及基于二元笔划特征的边注意力机制来控制
节点间的信息传递,同时设计了边特征更新过程对边特征进行学习。该方法在联
机手写英文文档、日文文档、流程图、自动机图、草图等数据集上进行了实验,
与之前的代表性方法相比取得了更优的分类性能。
三、为了应对联机手写文档中对象之间的复杂几何关系以及文本行书写方
向任意、相互之间不必平行、距离可以任意靠近等难点,本文提出了一种基于边
池化机制的图注意力网络 (EPAT) 与距离度量学习结合的框架对联机手写文档中
的文档对象进行实例分割。该方法将文档中的实例分割问题建模为图上的节点
分类与聚类问题,并通过在笔划时空上下文关系图上进行信息传递得到分布式
的节点特征与边特征对笔划进行分类与聚合。在信息传递过程中,本文提出了边
池化机制来加强相邻边特征之间的交互,从而使得模型能有效地处理文档对象
或文本行之间的复杂几何关系。同时,本文提出使用多任务监督框架对笔划分类
和聚类任务进行联合学习,从而可以在统一的框架下对联机文档进行多层次的
分析。该方法在联机手写英文文档、日文文档、流程图数据集上进行了文档实例
分割的实验,在多个评价指标上相比之前的代表性方法取得了更优的性能。
 

英文摘要

With the increasing use of tablet PCs, electronic whiteboards and, digital pens,
users can input various heterogeneous content such as text, tables, and drawings on
a large writing area and digital ink can be captured by those devices. Compared to
traditional offline documents recorded by pen and paper, ink documents contain more
information thus it offers potential advantages for analysis. However, the unconstrained
writing style in handwritten ink documents brings new challenges to document analysis.
In general, for ink document analysis, the ink strokes should be first classified into different classes, e.g. text and drawings, which are then analyzed respectively. Strokes of the
same class can be further segmented into different instances, e.g. segment the text block
into different text lines or segment the flow chart into different basic graphic units. This
dissertation focuses on three important issues of ink document analysis: text/non-text
stroke classification, document object classification, and document instance segmentation. The main contributions of this dissertation are summarized as follows:

(1) For text/non-text stroke classification, we propose the NN-CRF model combined with neural networks (NN) and conditional random fields (CRF) to utilize the
temporal relationships between strokes and the feature learning capacity of neural networks. In this method, the potentials of CRF are output by neural networks with stroke
geometric features as input. The parameters are jointly learned by the negative log likelihood loss. The proposed method is validated on online handwritten English documents.
The experimental results show that the proposed method achieves comparable performance with fewer parameters and faster speed compared with previous representative
methods.

(2) For classifying strokes into different document objects, we propose the edge
graph attention network (EGAT) to exploit the contextual dependency of document objects. In this method, strokes are first formulated as a relational graph based on their
temporal and spatial relationships, then distributed node features are learned through
the EGAT model. In the message passing procedure of the EGAT model, we propose to use self attention mechanism and edge attention mechanism to control the information
flow between neighboring nodes. In addition, we propose the edge update procedure for
learning better edge features. The proposed method is validated on online handwritten
English, Japanese, flow chart, automata, sketch documents. The experimental results
show that the performance of the proposed method is superior to previous representative methods.
(3) Since there are complex geometric relationships between document objects and
text lines can be in arbitrary orientation, not necessarily parallel and arbitrarily close in
online handwritten documents, we propose edge pooling attention network (EPAT) with
distance metric learning to address the document instance segmentation problem. The
proposed method formulates the document instance segmentation problem as node classification and node clustering problems in the graph and learns distributed node and edge
features by conducting the message passing procedure in the temporal/spatial relational
graph. In the message passing procedure, the edge pooling mechanism is exploited to
enhance the interaction between neighboring edge features, thus the proposed method
can tackle complex geometric relationships between document objects or text lines. In
addition, the multi-task learning framework is used to learn the stroke classification and
clustering tasks simultaneously, thus the online document analysis can be done in a
unified framework. The proposed method is validated on online handwritten English,
Japanese and flow chart documents. The experimental results show that the proposed
method is superior to previous representative methods in various evaluation metrics.

关键词联机手写文档分析 笔划分类 实例分割 文本行分割 结构化预测 条件随机场 图神经网络 注意力机制
语种中文
七大方向——子方向分类文字识别与文档分析
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/43292
专题多模态人工智能系统全国重点实验室_模式分析与学习
推荐引用方式
GB/T 7714
叶君宇. 基于结构化预测的联机手写文档分析[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2020.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
学位论文_叶君宇.pdf(5220KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[叶君宇]的文章
百度学术
百度学术中相似的文章
[叶君宇]的文章
必应学术
必应学术中相似的文章
[叶君宇]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。