CASIA OpenIR  > 毕业生  > 博士学位论文
跨语言语义关联增强的无监督机器翻译方法研究
陆金梁
2024-05-14
Pages116
Subtype博士
Abstract

近年来,端到端神经机器翻译方法取得了快速的发展,在译文质量上相比于传统方法有着显著的提升。但是,神经机器翻译模型的训练严重依赖大规模双语平行数据,这使得资源匮乏的语言上难以构建先进的机器翻译系统。为了缓解该问题,无监督机器翻译方法应运而生,其旨在仅利用单语数据进行翻译建模,核心思想体现在基于两种语言的数据分布规律挖掘隐式跨语言语义关联。然而,由于当前方法缺乏显式跨语言语义关联建模,导致模型收敛速度缓慢,且在性能上与有监督机器翻译存在明显差距。因此,在无监督机器翻译各个阶段中显式建模并增强跨语言的语义关联,对加快无监督翻译模型的收敛速度、提高翻译性能以及推动实际应用具有重要意义。本文围绕着无监督机器翻译模型参数初始化、训练及解码三个阶段,研究跨语言语义关联增强方法,以期显著提升无监督机器翻译性能。论文的主要工作和创新点归纳如下:

 

1. 提出了一种基于语义级别语码转换的模型参数预训练方法

 

无监督机器翻译的参数一般使用跨语言预训练模型进行初始化,所以预训练参数的跨语言性质决定了无监督机器翻译的基本性能表现。由于平行语料的缺失,跨语言预训练模型在训练过程中一般不涉及显式跨语言信号的建模,导致初始化参数的跨语言对齐能力受限。本文分析发现语言模型仅需多种语言的单语数据即可学习到准确的词汇对齐规律。但是,显式跨语言信号的缺失导致跨语言的语义空间同构关系随着模型层数的增加而不断弱化,从而限制了模型的跨语言能力。为了解决该问题,本文提出一种基于语义级别语码转换的翻译模型参数预训练方法,利用单语数据在语义空间中构造语言混杂的训练样本,要求模型依据语言混杂的上下文预测出掩码词汇或序列。所提方法无需任何外部的平行语料或双语词典即可有效建模跨语言关系,增强预训练参数的跨语言对齐能力。实验表明,使用得到的预训练参数初始化无监督翻译模型后能够带来显著的性能改进。

 

2. 提出了一种基于样本难度与译文质量的回翻数据筛选方法

 

在无监督机器翻译的训练过程中,单语数据一般通过译文迭代回翻的方式来构造用于训练的伪平行数据。但是,由于样本翻译难度不同以及模型能力的差异,译文回翻构造的伪平行句对的质量往往参差不齐。其中,质量较高的训练数据能够提供正确的词汇或短语对齐规律,而质量较低的样本则相反。因此,本文提出一种基于样本难度和译文质量的回翻数据筛选方法,一方面平衡训练进程中的数据质量,另一方面鼓励模型从高质量的平行数据中学习正确的跨语言语义对齐关系,抑制错误语义对齐关系带来的影响。具体地,所提方法首先依据跨语言映射程度定义样本翻译难度,依据模型能力在适当难度范围内采集样本,减少训练前期构造过多质量差距明显的训练数据。之后,所提方法根据模型本身的跨语言性质实时计算伪平行句对中词汇和句子的质量分数,并依据该分数控制相应位置的损失权重。该策略加权不同质量的样本产生的损失,降低了低质量回翻数据的影响。实验结果表明,所提方法有效提升了无监督机器翻译模型的译文质量,同时加快了模型的收敛速度。

 

3. 提出了一种模型自我修正的回翻数据优化方法

 

无监督机器翻译训练过程中所构造的回翻数据不可避免地存在不准确的跨语言语义对齐关系,这容易导致模型在翻译知识的学习上出现偏差。为了缓解该问题,本文提出一种基于模型自我修正的回翻数据优化方法。具体地,所提方法首先根据最优传输理论计算伪平行句对的词汇对齐关系,并根据对齐程度来判断可能出现错误的词汇位置以及相应的语义偏差幅度。之后,根据模型词向量参数中所蕴含的词汇对齐关系寻找可能的翻译结果,并依据偏差幅度对错误位置进行软性修正,从而改进回翻数据的质量。随着训练的进行,模型翻译能力逐步提升,所提方法在后期转换为数据增强方法,能够构造出语义等价但形式不同的训练样本。在此基础上,本文提出第二阶段训练目标,约束原始样本和增广样本的输出一致性。实验表明,所提方法有效改善了回翻数据的质量,显著提升了传统无监督机器翻译模型以及基于大语言模型的无监督机器翻译模型的翻译效果。

 

4. 提出了一种基于互信息重排的翻译解码方法

 

大语言模型的产生与发展促使无监督机器翻译的研究逐步转向基于大模型的方法。传统无监督机器翻译一般使用译文迭代回翻的方式建模从源语言到目标语言的翻译概率,因此采用贪心搜索或柱搜索方式解码译文非常可靠。但是大模型在预训练阶段不涉及直接的翻译建模,因此在给定翻译指令后容易出现错翻、漏翻以及翻译幻觉等不忠实源语言文本的解码结果。为了改善该问题,本文提出一种基于互信息重排的无监督机器翻译解码方法。不同于传统解码策略,在最大化输出概率的基础上,所提方法同时最大化译文与源语言文本的互信息。经过推导,可以将互信息的计算转化为对译文片段还原源语言文本概率增益的计算。具体地,当解码步骤中存在多组潜在的连续译文片段时,所提方法将这些片段反向输入模型之中,计算给定相应片段时源语言的概率增益,并据此挑选出更优的翻译片段。实验表明,所提解码方法能够显著提升基于大语言模型的无监督机器翻译解码性能。

 

综上所述,本文深入研究了无监督机器翻译中跨语言语义关联增强方法,并分别针对参数初始化、模型训练与解码等不同阶段进行理论或实验分析,提出相应的语义关联优化策略。实验表明,本文所提出的方法能够加快无监督机器翻译模型的收敛速度、显著提高无监督机器翻译的译文质量。

 

Other Abstract

In recent years, end-to-end neural machine translation has made rapid progress, showing significant improvements in translation quality compared to traditional methods. However, training neural machine translation models heavily relies on large-scale parallel corpora, making it difficult to construct advanced machine translation systems for low-resource language pairs. To alleviate this problem, unsupervised neural machine translation (UNMT) methods have been proposed, aiming to model translation using only monolingual corpora. The core idea lies in mining implicit cross-lingual semantic correlations based on the distribution of sentences in two languages. However, current unsupervised machine translation still cannot adequately and accurately explore the cross-lingual semantic correlations, leading to slow convergence speeds and significant performance gaps compared to supervised machine translation. Therefore, enhancing cross-lingual semantic correlations at different stages of unsupervised machine translation is important for accelerating convergence speed, improving translation performance, and promoting practical applications. This paper focuses on three stages of unsupervised machine translation, namely parameter initialization, model training, and decoding, to investigate methods to enhance cross-lingual semantic correlations. The main contributions of this paper are summarized as follows:

 

1. Semantic-Level Code-Switching based Cross-Lingual Pre-Training for Parameter Initialization

 

The parameters of UNMT models are generally initialized using cross-lingual pre-trained language models. Therefore, the cross-lingual nature of pre-training parameters determines the basic performance of unsupervised machine translation. Due to the lack of parallel sentences, cross-lingual pre-trained language models generally do not involve explicit modeling of cross-lingual signals during the training process. It usually results in the limited cross-lingual capability of the pre-trained parameters. Therefore, this paper first analyzes cross-lingual pre-trained models and finds that language models can learn accurate token alignments from multiple monolingual data and store them in the embedding table. However, the lack of explicit cross-lingual signals decrease geometric similarity of token representations at higher layers, restricting the cross-lingual capabilities. To address this issue, this paper proposes a semantic-level code-switching based cross-lingual pre-training method, enhancing the cross-lingual correlations of pre-trained parameters. Specifically, the proposed method uses monolingual data to construct language-mixed training samples in the semantic space, requiring the model to predict masked tokens or sequences based on language-mixed contexts. The proposed method can effectively model cross-lingual correlations without any external parallel corpora or bilingual dictionaries, enhancing the cross-lingual nature of pre-training parameters. Experimental results show that using the pre-trained parameters to initialize unsupervised translation models can significantly improve translation performance.

 

2. Back-Translated Data Filtering based on Sample Difficulty and Translation Quality

 

In the training process of unsupervised machine translation, monolingual data is generally used to construct pseudo-parallel sentences as training samples through iterative back-translation. However, due to differences in the sample difficulty and model capabilities, the quality of pseudo-parallel sentence pairs constructed by back-translation is often uneven. High-quality training examples can provide reliable token alignments, while low-quality examples have the opposite effect. Therefore, this paper proposes a back-translated data filtering method based on sample difficulty and translation quality. On one hand, the proposed method guarantees the data quality during the training process. On the other hand, it encourages the UNMT model to learn correct cross-lingual semantic alignments from high-quality parallel sentences, suppressing the influence of erroneous semantic alignments. Specifically, the proposed method first defines sample translation difficulty based on cross-lingual mapping degree and collects samples within an appropriate difficulty range based on model capabilities. This strategy decreases the number of training examples with significant quality differences in the early stage of training. Then, the proposed method calculates the quality scores of tokens and sentences in pseudo-parallel sentences and controls the loss by using these scores as weights, which decreases the impact of the noisy examples. Experimental results show that the proposed method effectively improves the translation quality of unsupervised machine translation models and accelerates the convergence speed.

 

3. Self-Correction for Back-Translated Data Enhancement

 

The back-translated data constructed during the training process of unsupervised machine translation inevitably contains inaccurate cross-lingual semantic alignments, which can easily lead to biases during training. To alleviate this problem, this paper proposes a self-correction based method to enhance the quality of back-translated training examples. Specifically, the proposed method first calculates the token alignments of pseudo-parallel sentences based on optimal transport, estimating the possible incorrect tokens and the corresponding semantic deviations. Then, the proposed method searches for possible translations of under-translated tokens and corrects the mistranslated ones in the continuous space, thus improving the quality of back-translated sentences. As training progresses, the translation ability of UNMT models gradually improves. And the proposed method transforms into a data augmentation method in the later stage, capable of constructing semantically equivalent and diverse training examples. Based on this, the paper proposes a second-stage training objective to constrain the consistency between the outputs of the original examples and the augmented ones. Experimental results show that the proposed method effectively improves the quality of back-translated data and significantly enhances the translation capability of both traditional UNMT models as well as UNMT models based on large-scale language models.

 

4. Mutual Information based Re-Ranking for Translation Decoding

 

The emergence and development of large language models (LLMs) have gradually shifted the research of unsupervised machine translation. Traditional unsupervised machine translation generally models translation probabilities from the source language to the target language using iterative back-translation. Therefore, traditional decoding methods, such as greedy search or beam search, are reliable. However, LLMs do not directly involve translation modeling in the pre-training stage, making the decoding results unfaithful to source sentences. To address this issue, this paper proposes a mutual information based re-ranking method for translation decoding. Specifically, besides maximizing the output probability, the proposed method also maximizes the mutual information between the generated translation and the source sentence. Through derivation, the calculation of mutual information can be transformed into the probability gains of reconstructing the source sentence from generated fragments. Specifically, when multiple potential translation fragments exist in decoding steps, the proposed method calculates the probability gains of the source sentence given the generated fragments as inputs. Then, the best fragment is selected based on the probability gains. Experimental results show that the proposed decoding method significantly improves the decoding performance of unsupervised machine translation based on LLMs.

 

In summary, this paper thoroughly investigates methods to enhance cross-lingual semantic correlations in unsupervised machine translation. Specifically, theoretical or experimental analyses are provided for different stages such as parameter initialization, model training, and decoding. Various methods are proposed to enhance the semantic correlations at these stages. Experimental results demonstrate that the methods proposed in this paper can accelerate the convergence speed of unsupervised machine translation models and significantly improve the translation quality.

Keyword神经机器翻译,跨语言预训练,译文质量估计,译文回翻,互信息
Language中文
Sub direction classification自然语言处理
planning direction of the national heavy laboratory语音语言处理
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/57389
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
陆金梁. 跨语言语义关联增强的无监督机器翻译方法研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
陆金梁_博士论文.pdf(3544KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[陆金梁]'s Articles
Baidu academic
Similar articles in Baidu academic
[陆金梁]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[陆金梁]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.