统计机器翻译中模型的训练、自适应和学习算法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	统计机器翻译中模型的训练、自适应和学习算法研究
其他题名	Research on the Training, Adaptation and Learning for the Models in Statistical Machine Translation
	卢世祥
	2014-05-27
学位类型	工学博士
中文摘要	虽然统计机器翻译的重要组成部分是三个子问题：语言模型、翻译模型和解码算法，但是“主要瓶颈”问题是模型本身，模型着决定翻译性能的上限。本文以提高翻译性能为总体目标，从统计机器翻译中模型（语言模型和翻译模型）的训练、自适应和学习三个方面入手，对统计机器翻译的关键算法进行了系统地研究。主要研究内容如下： 1. 大规模语言模型分布式快速训练本文提出了基于分布式计算平台的大规模语言模型快速训练方法。在该方法中，“客户端-服务器”模式的分布式计算平台增加了大规模数据的处理能力；内存交互方式提高了语言模型的训练速度；修正的Kneser-Ney平滑算法实现了分块并行训练；分布式存储解决了大规模语言模型的存储问题。 2. 基于语料筛选的模型自适应 2.1. 基于跨语言语料筛选的语言模型自适应统计机器翻译中，语言模型的性能受限于训练语料的数量（语料规模）和质量（语料和翻译任务是否匹配），很多学者选择从训练语料中筛选和翻译任务相似的训练语料的角度来提高其性能。但是，传统的语料筛选方法大都是基于词袋模型的单语筛选方法，容易产生“噪声繁衍”问题，并且没有考虑上下文信息。本文提出了两种基于上下文信息的跨语言语料筛选方法。实验表明，与基于词袋模型的单语筛选方法相比，两种方法都可以进一步提高语料筛选的质量和翻译性能。两种方法分别是：  基于翻译模型的跨语言语料筛选方法：结合短语翻译模型和跨语言信息检索的思想，本文提出了基于翻译模型的跨语言语料筛选方法。给出当前翻译任务中的一个源语言句子，我们可以在翻译之前直接地估算目标语言端的语言模型训练语料中的句子是否与它相似，并据此相似度筛选相似语料。在语料筛选过程中，我们的方法以短语为单位，引入了上下文信息。另外，本文还引入了线性排序模型框架，可以将很多不同的模型作为特征引入到该框架中。  基于双语主题模型的跨语言语料筛选方法：从词的主题分布角度将语义相关的词作为全局上下文信息，本文提出了基于双语主题模型的跨语言语料筛选方法。我们将语料中跨语言相似的句对看作是在语言上独立地跨语言语义表示，假设它们含有相同或是相似的主题分布，也就是相同或是相似的全局上下文信息。 2.2. 基于双语语料筛选的翻译模型自适应翻译模型的性能同样受限于双语语料的数量和质量。集内的双语语料很难获取，通常规模较小。相比之下，通用领域的双语语料却是很容易获取，通常规模较大，但是要面临语料差异性的代价：和翻译任务要么完全无关，要么在主题或是领域上相差较大。首先，为了提高翻译性能，我们从网络上收集了大量的通用领域的双语语料。然后，本文提出了基于短语的双语语料筛选方法，并且应用于从通用领域双语语料中筛选“伪集内”语料。实验表明，筛选得到的“伪集内”语料可以进一步提高翻译性能。 3. 基于深度神经网络的翻译模型学习 3.1. 基于深度自动编码器（DAE）的短语特征学习虽然很多特征被引入到统计机器翻译中来并提高了翻译性能，但是其中的绝大多数特征都是从基于双语语言对的语...
英文摘要	Although the main context of statistical machine translation (SMT) is the three sub-problems: language model (LM), translation model (TM) and decoding algorithm，but the ``major bottleneck" problem is the model itself, model determines the upper limit of translation performance. The objective of this paper focuses on improving the translation performance, we conduct systematical research on the key technologies of the models (LM and TM) in SMT, from the aspects of training, adaptation and learning. Our main contributions are listed here: 1. Distributed Training of Large-Scale Language Model We propose a distributed architecture based language model fast training method with large-scale corpus. In our distributed training infrastructure, we utilize client-server based distributed computing platform for increasing the capacity of large-scale data processing, memory interactive paradigm for improving the training speed, modified Kneser-Ney discounting for parallel training in blocks, respectively. 2. Corpus Selection based Model Adaptation 2.1. Cross-lingual Corpus Selection based Language Model Adaptation Improving the performance of LM not only needs more training corpus, but also needs the training corpus match the translation task. Many researchers have preferred to select similar training corpus with the translation task from the training corpus. However, most traditional corpus selection methods are bag-of-words models based monolingual selection methods, which causes noisy proliferation and do not take into account any contextual information. We propose two context-aware cross-lingual corpus selection methods, and experimental results show these two methods significantly outperform the traditional methods for corpus selection and translation performance. Our two methods are:  Translation model based cross-lingual corpus selection method. Inspired by the phrase based TM and cross-lingual information retrieval, we propose phrase TM based cross-lingual corpus selection for LM adaptation. Given a source sentence in the translation task, our method directly estimates the probability before translation that a sentence in the target LM training corpus is or not similar, then select corpus by this probability. Our method performs at the phrase level and captures some contextual information in modeling the selection of phrase as a whole. Moreover, we propose a linear ranking model framework to further improve the performance, where different models a...
关键词	统计机器翻译大规模语言模型翻译模型领域自适应深度神经网络 Statistical Machine Translation Large-scale Language Model Translation Model Domain Adaptation Deep Neural Network
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6619
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	卢世祥. 统计机器翻译中模型的训练、自适应和学习算法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2014.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20111801462805（2569KB）			暂不开放	CC BY-NC-SA