面向统计机器翻译的双语对齐方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向统计机器翻译的双语对齐方法研究
其他题名	Approaches to Bilingual Alignment for Statistical Machine Translation
	周玉
	2008-02-03
学位类型	工学博士
中文摘要	随着全球化进程的加快，人际交往与信息交流中因为语言分歧带来的阻碍日益突出，克服语言障碍的问题就显得愈发重要，而利用计算机技术实现不同语言之间的自动化翻译，是解决这一问题的重要途径。目前在机器翻译领域内统计方法占据着主导地位，而统计方法之中则以基于短语的翻译模型较为成熟。众所周知，双语词语对齐和短语对抽取的质量始终是影响翻译系统性能的重要因素。同时，双语词语对齐技术也是支持其他统计翻译模型和实现跨语言信息检索的基础性关键技术。本论文的选题因而具有重要的理论意义和实用价值。本论文的研究重点在于翻译知识的自动获取方法，通过大量的实验探讨了如何从训练语料中挖掘出更多的翻译知识，从而更好地为解码过程服务。论文的主要工作归纳如下：（1）提出了利用后缀数组数据结构结合N元文法信息对训练语料进行过滤的方法，并用无贪婪扩展特征算法对过滤后的训练语料进行了预处理操作统计机器翻译质量的优劣在很大程度上依赖于训练数据的规模和质量，运用较为理想的训练数据对翻译模型的参数进行估计会更接近其真实值。对此，本文提出一种有效方法对原始训练语料进行预处理，以期获取我们所需的“集中、同质、精确”的训练语料。该方法在基于后缀数组数据结构的算法的基础上，结合N元文法信息对语料进行过滤和选择。然后，我们提出“无贪婪扩展特征算法”对过滤后的训练语料进行切割和重组处理，以求获得高精度的对齐工整的训练语料。（2）提出并实现了一种多粒度的双语词对齐方法在统计翻译系统中几乎所有后续翻译知识的映射都建立在词语对齐基础上，因此词语对齐至关重要。本文提出了一种多粒度的词对齐算法，其主导思想是“分而治之”，即将词对齐限定在比较可靠的小范围内，以取代整句范围的搜索，在此基础上，利用对数线性模型融合不同粒度下的词对齐结果，这就较为充分地挖掘了训练句对的翻译知识，从而获得了更好词语对齐结果。（3）提出并实现了一种基于多层过滤的短语对抽取方法短语翻译对是支撑统计机器翻译系统的重要的知识源，在众多的短语抽取方法中，最流行的是Franz J. Och提出的短语抽取方法和David Chiang在Franz J. Och方法上提出的改进的分层短语抽取算法，该类方法仅仅利用词对齐信息，简单有效；问题在于，随着语料规模的扩大，抽取的短语对数量急剧增加，这不仅使短语对的存储空间太大，而且加重了解码器的负担。为此，我们提出了一种基于多层过滤的短语对抽取方法，该方法能够直接根据当前句对的词对齐信息生成多组短语对，并且能够有效地实现短语对过滤，尤其能对空词的无限制扩展进行有效的控制。（4）结合2007年的国际口语翻译评测任务（IWSLT2007），分析了基于短语的翻译引擎（基准系统）中各模块的作用，并在汉英翻译评测中对本文提出的各种方法进行了检验。实验表明，上述方法都使基准系统的翻译性能在不同程度上有所提高。综上所述，本论文面向统计机器翻译在训练语料预处理、双语词语对齐、短语翻译对自动抽取等方面做了大量的实验、进行了深入的研究，有效地改进了现有实验系统的性能，为进一步探索新的翻译方法奠定了良好的基础。
英文摘要	Automatic machine translation has currently become a hot research issue to conquer the language barrier. During the past decade, the statistical machine translation has shown considerable success and the phrase-based translation models have been the state of the art in the statistical methods. However, bilingual word alignment and phrase pair extraction are always the two main issues influencing the translation result. This thesis, therefore, makes an intensive study on the automatic extracting of the translation knowledge. Extensive experiments have been made so as to find out a way to deeply extract more useful translation knowledge from the training data. The main contributions are summarized as follows: (1) Study on one method of filtering the training data using the suffix-array-based data structure combined with n-gram information and one method of an un-greedy step-expanding feature algorithm to preprocess with the filtered training data We propose one feasible way to preprocess the training data to obtain the “concentrated, homogeneity and accurate” training data. We first use the suffix-array-based data structure algorithm combined with the n-gram information to filter the training data. Then we realize an un-greedy step-expanding feature algorithm to segment and recombine the filtered data to get more accurate training data. (2)Study on one method of multi-grain-based word alignment This thesis describes a method of multi-grain-based word alignment. This method adopts the “dividing and conquering” measure to restrict the word alignment in a relatively more accurate and smaller range instead of the whole sentence scope. A log-linear model is used to combine all the different word alignment under the different multi-grains to gain a better word alignment. (3) Study on one method of phrase extraction with multi-layer filtering There is one big problem existing in these proposed methods: the number of the extraction phrase pairs will be augmented heavily with the increasing size of training data, which will greatly increase the memory space and bring more burdens in the decoding process. Therefore, we propose a multi-layer filtering method to solve such problem. Moreover, the method can generate more kinds of phrase pairs according to the fixed word alignment and can effectively control the freely expanding of null words. (4) We integrate the 2007 international workshop on spoken language translation (IWSLT07) evaluation task to analyze the function of each modular in the phrase-based translation system. According to a series of experiments, we validate the good influence of our methods on the phrase-based machine translation system. The word described in this thesis is mainly focused on the preprocessing of the training data, the word alignment and the phrase extraction, which have greatly improved the translation result and established a good basis for the future research on the new translation methods.
关键词	统计机器翻译语料预处理词语对齐短语对过滤短语对抽取 Statistical Machine Translation Training Data Preprocessing Word Alignment Phrase Pair Filtering Phrase Pair Extracion
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6049
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	周玉. 面向统计机器翻译的双语对齐方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2008.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20021801460323（4031KB）			暂不开放	CC BY-NC-SA