跨模态信息融合的文本图像翻译方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	跨模态信息融合的文本图像翻译方法研究
	马聪
出处	博士学位论文
	2024-05
页数	124
学位类型	博士
中文摘要	近年来，端到端的神经机器翻译方法快速发展，其应用场景从文本扩展到图像、语音等多模态翻译领域，并在学术界和工业界中备受关注。文本图像翻译是一种把图像中的语言文字（源语言）自动地翻译到另外一种语言（目标语言），在拍照翻译、扫描文档翻译和视频翻译等领域具有广泛应用。当前对于文本图像翻译的相关工作主要分为两种：一种是文本图像识别和机器翻译模型的级联方法，另一种是使用统一编解码器的端到端方法。在已有的研究工作中发现，级联方法需要训练和部署图像识别、机器翻译两套模型，存在参数量大和解码速度慢的问题，同时图像识别模型产生的识别错误会在翻译模型中进一步放大，发生错误传递的问题。端到端方法具有参数量少、解码速度快的优势，但面临训练数据稀缺的问题，且该方法无法充分利用额外的图像识别和机器翻译数据，导致端到端的文本图像翻译的性能不佳。针对上述的问题，本文从数据、特征和参数三个方面，研究如何将跨模态信息融入到端到端文本图像翻译中，以提升文本图像翻译的质量。论文的主要工作和创新点归纳如下： (1) 提出了一种跨模态数据交互融合的文本图像翻译方法端到端文本图像翻译数据稀缺且标注成本高，而端到端模型的训练又需要大规模数据进行参数优化，因此，常常导致端到端文本图像翻译模型训练不充分，翻译性能不佳。针对这一问题，本文提出了一种基于跨模态数据交互融合的文本图像翻译方法。该方法通过参数共享和多任务训练的方式，将文本图像识别和机器翻译任务中的图像、文本跨模态数据融合到端到端的文本图像翻译模型训练过程中，并在解码阶段使用交互式注意力融合文本图像识别的历史信息来增强翻译效果。实验表明，本文所提出的方法在合成的文本图像、视频的字幕图像及街景拍照三个文本图像翻译领域上均能显著提升译文质量。 (2) 提出了一种跨模态特征对比融合的文本图像翻译方法文本图像和对应源端语言文本在语义上完全相同，因此这两类样本在语义特征空间中应具有相似的特征表达。但是现有的文本图像翻译方法分别对图像和文本进行编码，导致语义相同的图像和文本被编码到不同的特征子空间中，使得目标端语言解码器的参数优化难度大，影响了翻译性能。针对这一问题，本文提出了一种基于跨模态特征对比融合的文本图像翻译方法。该方法通过模态对比学习拉近语义相同的图像和文本之间的特征距离，在此基础上通过图像模态内对比学习拉近具有相同内容但背景不同的文本图像之间的特征距离，最终文本模态内的对比学习拉近具有相似内容但表述不同的源语言文本特征距离。实验表明，该方法能够有效地将图像和文本映射到共享的语义空间中，且相似语义的图像和文本之间具有相似的特征表达。该方法通过提升图像编码器的语义编码能力，显著提升了翻译性能。 (3) 提出了一种跨模态参数高效融合的文本图像翻译方法文本图像识别编码器和机器翻译的解码器与文本图像翻译模型的对应模块具有相同的功能，且这两个模型在各自的任务上均进行了充分的参数训练，因此将预训练的参数迁移到文本图像翻译中有望带来性能的提升。然而由于这两个模型建模的任务不同，导致图像和文本编码器得到的特征维度、特征分布不一致，无法直接连接模型参数进行端到端的文本图像翻译。为此，本文提出了一种基于跨模态参数高效融合的文本图像翻译方法。该方法通过可训练的模态适配器，将文本图像识别的图像编码器和机器翻译的文本解码器连接起来，并以参数高效微调的方式更新模态适配器参数。进一步，在训练过程中提出使用多层次、多粒度的知识迁移损失来将机器翻译模型的特征知识迁移到文本图像翻译中，以降低了模态适配器参数优化的难度。实验表明，该方法可以有效地连接预训练图像识别和机器翻译的模型参数，与级联式和端到端的文本图像翻译方法相比具有更少的训练参数，并在合成的文本图像、视频字幕图像及街景拍照图像翻译测试集上得到了显著的性能提升。综上所述，本文深入研究了跨模态信息融合的文本图像翻译方法，并面向图像和文本之间的跨模态数据、特征和参数三类信息，提出了对应的跨模态信息融合与知识迁移方法。实验表明，本文所提的三种方法呈递进关系，且三种方法之间可以相互融合，显著提高了端到端文本图像翻译的译文质量。
英文摘要	In recent years, end-to-end neural machine translation methods have developed rapidly, expanding translation scenarios from text to multimodal fields such as image and speech. These advancements have gained significant attention in both academia and industry. Text image translation aims at automatically translating the source language text embedded in the image to the target language, which is widely used in the fields of photo translation, scanned document translation, and video translation. Existing work on text image translation is divided into two main methods: cascade method by connecting text image recognition and translation models, and end-to-end approach using a unified encoder-decoder architecture. From existing research findings, the cascade methods need to train and deploy two models, which have the problems of parameter redundancy and decoding latency. Furthermore, the recognition errors are further propagated in the translation model, causing more mistakes in the final translation results. The end-to-end method has the advantages of parameter-efficient architecture and fast decoding speed, but faces the challenge of data scarcity, and the method can not fully utilize the additional image recognition and machine translation dataset, resulting in performance degradation. To address problems in existing text image translation methods, this paper explores how to incorporate the cross-modal information into end-to-end text image translation model from the aspects of data, features, and parameters. The main contributions of this paper are summarized as follows: (1) Cross-Modal Data Interactive Fusion based Text Image Translation End-to-end text image translation data is scarce and the annotation cost is high. However, the training of the end-to-end model requires large-scale data for parameter optimization, thus the end-to-end text image translation model is not adequately trained and the translation performance is limited. To address these problems, this paper proposes a text image translation method based on cross-modal data interactive fusion. The method fuses image and text cross-modal data from text image recognition and machine translation tasks into the end-to-end text image translation model through parameter sharing and multi-task training. Meanwhile, interactive attention fusion is proposed to incorporate text image recognition information to enhance text image translation generation. Experimental results show that the method proposed in this paper can significantly improve the translation quality in synthetic text image, video subtitle image, and street view photo test sets. (2) Cross-Modal Feature Contrastive Fusion based Text Image Translation Text images and corresponding source language texts are semantically identical with just different data modalities. Thus these two types of samples should have similar representations in the semantic feature space. However, existing text image translation methods encode images and texts separately, resulting in semantically identical images and texts being encoded into different feature subspaces, which makes it difficult to optimize the parameters of the translation decoder and affects the translation performance. To address these problems, this paper proposes a text image translation method based on cross-modal feature contrastive fusion. The method brought semantically identical image and text features closer through modal contrastive learning. Meanwhile, image intra-modal and text intra-modal contrastive losses are utilized to attract semantically similar image pairs and text pairs. Experimental results show that the method proposed in this paper can effectively map semantically identical images and texts into the shared semantic space with similar feature representations. Experimental results show that the proposed method significantly improves the performance of the text image translation by enhancing the semantic encoding capability of the image encoder. (3) Cross-Modal Parameter Efficient Fusion based Text Image Translation The encoder of text image recognition and the decoder of machine translation have the same function as the corresponding modules of the text image translation model. Both text image recognition and machine translation models are adequately trained on their specific tasks, thus migrating the pre-trained parameters to the text image translation model is promising to bring performance improvements. However, due to the different task settings, the feature dimensions and feature distributions obtained by the image and text encoders are inconsistent, and directly connecting the model parameters for end-to-end text image translation is not feasible. To address the above problems, this paper proposes a text image translation method based on the efficient fusion of cross-modal parameters. The method connects the image encoder of text image recognition and the text decoder of machine translation through a trainable modal adapter and optimizes the modal adapter parameters in a parameter-efficient fine-tuning method. Experimental results show that the proposed method effectively integrates pre-trained models with fewer training parameters compared to both cascaded and end-to-end approaches. Furthermore, the cross-modal parameter efficient fusion method significantly improves the translation performance on synthetic, subtitle, and street view text image test sets. In summary, this paper deeply studies cross-modal information fusion and proposes to transfer knowledge from cross-modal data, feature, and parameter aspects. Experimental results show that the three methods proposed in this paper have a progressive relationship, and the three methods can be fused with each other, which significantly improves the translation quality of end-to-end text-image translation.
关键词	文本图像翻译跨模态信息融合多任务学习跨模态对比学习参数高效微调
语种	中文
出版者	博士学位论文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57614
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	马聪. 跨模态信息融合的文本图像翻译方法研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
马聪-学位论文-含签字-20240516（11285KB）	学位论文		限制开放	CC BY-NC-SA