多语种和跨语种的文本行识别方法研究
陈卓
2021-06
页数128
学位类型博士
中文摘要

现有文本行识别方法大多局限于特定的单一语种。为了能适用于更加复杂的应用场景,本文针对多语种与跨语种的应用场景进行了研究。

多语种文本行识别任务用于识别多种不同语言的文本行图像中的文本内容。该任务可以自动地处理海量的多语种文档,对于文档的数字化具有重要意义。然而,多语种文本在字符类别、语法规则、书写风格等方面存在着较大差异,对识别任务造成了巨大的挑战。

跨语种文本行识别任务旨在从源语言文本行图像中识别目标语言文本内容。该任务可以满足有外语阅读需求的人,帮助他们从外语场景中获得需要的信息。很多工作选择将文本行识别器与机器翻译模型串联进行跨语种文本行识别,然而这些方法却存在误差累积和参数冗余等问题。

本文针对多语种和跨语种的文本行识别问题进行了深入的研究,并提出了几种有效的方法。本文的主要创新性工作如下:

提出了一个基于多任务的同时进行语种分类和手写文本行识别的多语种文本识别网络(MuLTReNets)。该框架存在两种形式,分别为合并字符集的统一识别系统(MuLTReNetV1)和联合训练的级联式语种分类与单一语种识别系统(MuLTReNetV2)。两个系统都以端到端的方式训练四个主要模块:特征提取器,语种分类器,手写文本行识别器和自动加权器。其中,特征提取模块采用一种可分离的多维长短期记忆网络(SepMDLSTM),能有效捕获文本行图像中的上下文语义信息。在五种语言的手写文本行数据集上的实验结果表明:本文提出的方法在语种分类和手写文本行识别方面均有出色表现。其中,手写文本行识别性能不仅优于级联式系统,甚至超过了专门为特定语种设计的识别系统。

面向实际应用场景提出了名为‘端到端跨语种文本行识别’的任务,并基于多任务框架设计了跨语种文本行识别系统。该系统应对图像文本翻译的需要,将一种语言的输入图像直接输出另一种语言的文本,可减少图像识别和语言翻译过程中的误差累积。该系统以基于注意力机制的跨语种文本行识别为主任务,以基于长短期记忆网络的同语种文本行识别为辅助任务,可以同时输出源语言与目标语言的文本行内容,并对两个任务都有一定促进作用。本文在新收集的中英文双语电影字幕图像数据集上对系统进行了评估。实验结果表明,多任务学习框架在两个任务上均具有出色的表现。

提出了一种基于跨模态的跨语种文本行识别方法。该方法利用来自文本模态的语义特征对图片模态的特征进行教学,提升跨语种识别系统的学习能力。基于注意力机制与对抗学习算法,该方法可以同时利用语义空间中的全局与局部信息,使目标语言模型能够学习到更加全面的语义特征。通过这种方式,该系统可以利用大量已有的大规模双语语料数据和少量文本图像进行训练,并在新收集的数据集上取得了令人满意的结果。

英文摘要

Most existing methods of text image recognition are limited to specific scripts. For application to more complex scenarios, this thesis studies into multi-lingual and cross-lingual text recognition.

Multi-lingual text recognition aims to recognize text from images of different languages. It can deal with massive multi-lingual documents automatically and has great significance for document digitalizing. However, there is great challenge for multi-lingual text recognition due to the differences in alphabets, grammar, writing styles, et al.

Cross-lingual text recognition aims to recognize text images in source language and output text content in target language. The task can help people obtain knowledge from scenarios of foreign languages. Many previous methods simply stack text recognizer and machine translation model for cross-lingual recognition. Nevertheless, these methods have disadvantages such as error accumulation and parameter redundancy.

This thesis studies into multi-lingual and cross-lingual text recognition, and proposes several effective methods. The main contributions of this work are as follows:

A multi-task learning based framework, called MuLTReNets, is proposed for simultaneous script identification and handwritten text recognition. There are two schemes in the framework, i.e., unified recognition with merged alphabet (MuLTReNetV1) and joint script identification and single script recognition(MuLTReNetV2). Besides, four major modules, including feature extractor, script identifier, handwriting recognizer, and auto-weighter, are developed in an end-to-end manner for both schemes. Specifically, the feature extraction module adopts the proposed separable MDLSTM (SepMDLSTM) to effectively capture the semantic information in text line images. Experimental results on the datasets of five languages show that the proposed method performs superiorly for both script identification and handwritten text recognition, particularly, outperforms cascade systems and script-specific recognizers.

Considering a task called end-to-end cross-lingual text image recognition, a cross-lingual text recognition method based on multi-task learning is proposed. To address the text image translation problem, the system can recognize image text in source language and output text transcript in target language while eliminating the accumulated error derived from text image recognition and machine translation. Specifically, the system consists of an attention-based cross-lingual text image recognition branch as the main task and a LSTM-based mono-lingual text image recognition branch as the auxiliary task. Both tasks are trained jointly in multi-task learning. We evaluated the system on a newly collected dataset of Chinese and English bilingual movie subtitle images. The experimental results show that the multi-task learning framework performs superiorly in both tasks.

A cross-lingual text image recognition method based on cross-modal learning is proposed. By cross-modal learning, the text mode information can guide the image mode model to improve the  cross-lingual text recognition performance by learning from large corpus of bilingual texts. Besides, based on attention mechanism and adversarial learning, both global and local information in semantic space can be integrated to target language model to learn more representative semantic feature. Comprehensive experiments on a newly collected cross-lingual dataset demonstrate the superiority of the proposed method.

关键词多语种文本行识别 跨语种文本行识别 语种分类 多任务学习 序列到序列学习
语种中文
七大方向——子方向分类文字识别与文档分析
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/45033
专题模式识别国家重点实验室_模式分析与学习
推荐引用方式
GB/T 7714
陈卓. 多语种和跨语种的文本行识别方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2021.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Thesis_chenzhuo_v15.(5867KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[陈卓]的文章
百度学术
百度学术中相似的文章
[陈卓]的文章
必应学术
必应学术中相似的文章
[陈卓]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。