英文摘要 | With the increasing use of tablet PCs, electronic whiteboards and, digital pens,
users can input various heterogeneous content such as text, tables, and drawings on
a large writing area and digital ink can be captured by those devices. Compared to
traditional offline documents recorded by pen and paper, ink documents contain more
information thus it offers potential advantages for analysis. However, the unconstrained
writing style in handwritten ink documents brings new challenges to document analysis.
In general, for ink document analysis, the ink strokes should be first classified into different classes, e.g. text and drawings, which are then analyzed respectively. Strokes of the
same class can be further segmented into different instances, e.g. segment the text block
into different text lines or segment the flow chart into different basic graphic units. This
dissertation focuses on three important issues of ink document analysis: text/non-text
stroke classification, document object classification, and document instance segmentation. The main contributions of this dissertation are summarized as follows:
(1) For text/non-text stroke classification, we propose the NN-CRF model combined with neural networks (NN) and conditional random fields (CRF) to utilize the
temporal relationships between strokes and the feature learning capacity of neural networks. In this method, the potentials of CRF are output by neural networks with stroke
geometric features as input. The parameters are jointly learned by the negative log likelihood loss. The proposed method is validated on online handwritten English documents.
The experimental results show that the proposed method achieves comparable performance with fewer parameters and faster speed compared with previous representative
methods.
(2) For classifying strokes into different document objects, we propose the edge
graph attention network (EGAT) to exploit the contextual dependency of document objects. In this method, strokes are first formulated as a relational graph based on their
temporal and spatial relationships, then distributed node features are learned through
the EGAT model. In the message passing procedure of the EGAT model, we propose to use self attention mechanism and edge attention mechanism to control the information
flow between neighboring nodes. In addition, we propose the edge update procedure for
learning better edge features. The proposed method is validated on online handwritten
English, Japanese, flow chart, automata, sketch documents. The experimental results
show that the performance of the proposed method is superior to previous representative methods.
(3) Since there are complex geometric relationships between document objects and
text lines can be in arbitrary orientation, not necessarily parallel and arbitrarily close in
online handwritten documents, we propose edge pooling attention network (EPAT) with
distance metric learning to address the document instance segmentation problem. The
proposed method formulates the document instance segmentation problem as node classification and node clustering problems in the graph and learns distributed node and edge
features by conducting the message passing procedure in the temporal/spatial relational
graph. In the message passing procedure, the edge pooling mechanism is exploited to
enhance the interaction between neighboring edge features, thus the proposed method
can tackle complex geometric relationships between document objects or text lines. In
addition, the multi-task learning framework is used to learn the stroke classification and
clustering tasks simultaneously, thus the online document analysis can be done in a
unified framework. The proposed method is validated on online handwritten English,
Japanese and flow chart documents. The experimental results show that the proposed
method is superior to previous representative methods in various evaluation metrics. |
修改评论