融合图片信息的神经机器翻译方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	融合图片信息的神经机器翻译方法研究
	黄鑫
	2023-06-02
页数	113
学位类型	博士
中文摘要	近年来，端到端的神经机器翻译方法取得了飞速的发展。相比于传统的统计机器翻译方法，翻译质量有显著提升，同时在融合跨模态信息方面也凸显出其独有的优势。融合图片信息的神经机器翻译方法是一种在基于编码器-解码器框架的翻译模型中，利用图片中的视觉信息改善文本翻译质量的方法。图片中往往蕴含着文本以外的额外信息，能够补充或强调文本信息，因此在句子编码过程中加入图片信息以完善文本的表示，或在解码过程中提供图片信息作为参考以指导译文的生成，都是在机器翻译中利用图片信息的可行方案。尽管图片对于机器翻译过程有着重要作用，但是在神经机器翻译中融合图片信息依然面临着多项挑战，例如，跨模态对齐关系弱导致翻译模型倾向于忽略图片信息而退化为纯文本翻译模型；或是图片信息难以被有效利用，在翻译模型中通过输入图片带来的翻译质量提升与输入噪音时的差距很小；以及图片信息作用不明确，这使得模型的改进难以具有针对性。本文围绕如何设计有效的图片与文本之间的跨模态信息融合方法提升神经机器翻译的质量展开研究，如通过明确图片信息在文本中的作用目标从而规避模型对图片信息不敏感的问题，或强化图片信息在模型训练过程中的作用进而提升模型对图片信息的敏感度。论文的主要工作和创新点归纳如下： 1. 提出了一种基于跨模态文本重构的神经机器翻译方法在神经机器翻译中融合图片信息的常规方法采用隐式跨模态信息融合方法，其将图片和句子一同输入到翻译模型中参与编码或解码过程，从而使图片信息与文本信息充分地融合。然而，这类方法存在图片信息作用方式不明确的问题。为了探究显式跨模态信息融合法是否可行，本文提出了一种基于跨模态文本重构的神经机器翻译方法。该方法在训练中将源语言句子中的名词或短语的位置显式地替换为图片中对应的视觉目标，并将该跨模态序列输入到重构模型中用于生成完整的句子。最后通过参数共享的方式将重构模型的参数与翻译模型共享，达到了提升翻译质量的目的。实验表明，该方法通过提升实体词翻译准确率的方式提升了纯文本翻译模型的性能。 2. 提出了一种基于双向跨模态实体重构的神经机器翻译方法为了在显式方法中更进一步地利用图片信息，并融合隐式方法的优点，本文提出了一种基于双向跨模态实体重构的神经机器翻译方法。不同于之前方法进行文本级别的重构，该方法在文本实体和视觉实体之间进行双向重构。同时，为了更进一步在文本上下文中融合图片信息，还增加了文本非实体的重构。然后，将以上三种重构任务与机器翻译任务通过多任务学习的方式相结合。实验表明，该方法在测试阶段不需要输入图片的情况下进一步地提升了机器翻译的质量，双向实体重构与非实体重构的多任务组合方式是有效的。 3. 提出了一种基于图文对比对抗训练的神经机器翻译方法针对句子中有歧义词或语义不完整等问题时，需要将图片输入到翻译模型中，利用图片中的额外信息辅助翻译过程，从而得到更准确的译文。为此，本文提出了一种基于图文对比对抗训练的神经机器翻译方法。为了拉近双语的语义关系，该方法在编码端增加了图文与目标语言句子之间对比学习，并在负样本集中引入了包含源语言句子与错误图片组合而成的对抗样本。对比损失函数为了能够将正负样本区分开，迫使翻译模型能够判断图片信息是否与源语言句子的语义一致，并将正确的图片信息融合到文本的表示中。这样能够达到提升视觉信息在模型中的作用程度的目的。实验表明，该方法能够使翻译模型更有效地利用图片信息提升翻译准确率。综上所述，本文旨在设计更好的图片信息融合方法，提升图片信息在神经机器翻译模型中的作用效果。本文所设计的显式的跨模态信息融合方法、隐式跨模态信息融合方法以及两种方式相结合的方法能够有效地将图片信息在模型的训练阶段融合到翻译模型中以优化模型参数，或在模型的测试阶段与待翻译句子相融合以优化原文表示，最终为模型的翻译质量带来提升。
英文摘要	End-to-end neural machine translation methods have evolved rapidly in recent years. Compared with traditional statistical machine translation methods, it not only has significantly improved translation quality but also highlights its unique advantages in crossmodal information fusion. The approach of incorporating image information into neural machine translation is to improve the quality of text translation by using the visual information from the image in a translation model based on the encoder-decoder framework. Images often contain extra information than text. Therefore, image information is added in the sentence encoding process to improve the text representation, or provided as a reference in the decoding process to guide the generation of translations. These are feasible solutions for fusing image information into neural machine translation. Although image information is important for translation processes, there are still some challenges in the cross-modal information fusion and utilization. For example, the cross-modal alignment information between sentences and images is weak, so the translation model tends to ignore image information and degenerates into a conventional text-only translation model, or image information is challenging to use. And the difference in the improved translation quality caused by between inputting image or noise is minimal. And the role of image information is not clear in a machine translation model, which makes it challenging to improve the model in a specific manner. This paper focuses on how to design an effective cross-modal information fusion method between images and texts to improve the quality of neural machine translation, such as avoiding the problem of model insensitivity to image information by clarifying the role of image information in sentences or strengthening the role of image information in the model training process to improve the sensitivity of the model to image information. The main work and innovations of this paper are summarized as follows: 1. Neural machine translation based on cross-modal text reconstruction The conventional approaches of fusing image information into neural machine translation adopt cross-modal information fusion in an implicit way, which inputs images and sentences into the translation model to participate in the encoding or decoding process, so that the image information and text information are fully fused. However, these methods suffer from unknowing how the image information works. In order to explore whether explicit cross-modal information fusion is feasible, this paper proposed a neural machine translation method based on cross-modal sentence reconstruction. During training, the method explicitly replaces the positions of nouns or phrases in the source language sentences with the corresponding visual objects in the images, and inputs the sequence into the reconstruction model to generate sentences Finally, the parameters of the reconstruction model are shared with the translation model to achieve the purpose of improving translation quality. This method improves the performance of the text-only translation model by improving the translation accuracy of entity words. 2. Neural machine translation based on bidirectional cross-modal entity reconstruction To make full use of image information in an explicit way and combine the advantages of implicit methods, this paper proposed a neural machine translation method based on bidirectional cross-modal entity reconstruction. Unlike previous methods that perform sentence-level reconstruction, this method performs bidirectional reconstruction between textual and visual entities. At the same time, to further fuse image information into the sentence context, the text none-entity reconstruction is also applied. Then, the above three reconstruction tasks are combined with the translation task through multi-task learning method. Experiments show that this method further improves the quality of machine translation without requiring input images in the testing phase and the multi-task combination of bidirectional entity reconstruction and non-entity reconstruction is effective. 3. Neural machine translation based on image-text contrastive adversarial training When there are ambiguous words or incomplete semantics in the sentences, it is necessary to input images into the translation model and use the additional information in images to guide the translation process, so as to obtain a more accurate translation. To this end, this paper proposed a neural machine translation method based on image-text contrastive adversarial training. To bridge the gap in the semantic between bilinguals, a contrastive learning method between image-text pairs and sentences in the target language is applied in the encoding stage. And adversarial samples containing source language sentences combined with wrong images are introduced into the negative sample set. To distinguish positive and negative samples, the model needs to identify whether the image information is consistent with the semantics of the source language sentence. This method will fuse image information into text representations, thereby improving participant of visual information in the model. Experiments show that this method is able to utilize image information more effectively while improving the translation accuracy. To sum up, this paper aims to design better image information fusion methods to improve the usage of image information in neural machine translation models. The explicit cross-modal information fusion method, the implicit cross-modal information fusion method, and the combination of the two methods designed in this paper can effectively fuse image information into the translation model in the training phase to optimize model parameters, or fused with the sentence to be translated in the test phase of the model to optimize the original text representation, and ultimately improve the translation quality of the model.
关键词	神经机器翻译跨模态信息融合多任务学习对比学习
语种	中文
七大方向——子方向分类	自然语言处理
国重实验室规划方向分类	语音语言处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52128
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	黄鑫. 融合图片信息的神经机器翻译方法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
thesis.final.2023062（10395KB）	学位论文		限制开放	CC BY-NC-SA