基于短语的统计翻译中短语对抽取方法和调序模型研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于短语的统计翻译中短语对抽取方法和调序模型研究
其他题名	Research on Phrase Extraction Method and Reordering Model for Statistical Machine Transaltion
	何彦青
	2009-05-31
学位类型	工学博士
中文摘要	机器翻译（Machine Translation, MT）是指用计算机实现从一种语言到另一种语言的文本或者语音的自动翻译。在以知识经济为主要特征的当今社会中，日益频繁的国际交流和不断加快的全球化进程使得跨语言的信息交换总量急剧增加，不同国家和地区之间原本存在的自然语言屏障显得越发突出。机器翻译作为一门能够突破语言障碍的计算机技术，在经济发展和社会生活中发挥着越来越重要的作用。迄今为止，机器翻译方法发展到现在，基于统计的机器翻译方法逐渐占据了主流地位。在基于统计的翻译方法中，基于短语的翻译模型仍然是研究的热点。但是，基于短语的翻译方法中有三个主要问题影响了它的发展：短语表的构建鲁棒性差；短语的连续性使之缺乏泛化能力；以及短语的重排序能力弱等。本论文的研究重点定位在为基于短语的统计机器翻译模型建立高性能的短语抽取方法和短语重排序模型，从而改善基于短语的统计机器翻译系统性能。论文的主要工作归纳如下：（1）提出了基于“松弛尺度”的短语抽取方法。短语表的构建是基于短语的统计翻译方法中的关键技术。目前以Och提出的短语对抽取方法应用最为广泛，但它过分依赖于词对齐的结果，因而只能抽取与词对齐完全相容的短语对。为此我们提出了一种基于“松弛尺度”的短语抽取方法，对那些与词对齐不能完全相容的短语对，结合词性标注信息和词典信息来判断是否进行抽取。由于该方法放松了“完全相容”的限制，能为更多的源语言短语找到对应的目标短语，挖掘出了平行语料中更多的翻译知识，从而有助于提高基于短语的统计机器翻译质量。（2）提出了一种泛化的重排序模型，在括号转录文法（Bracketing Transduction Grammar，BTG）中引入非连续短语，从而增加了括号转录文法中短语的泛化能力。为了克服传统的基于短语的统计翻译模型中连续短语泛化能力差的弱点，我们提出了一种具有泛化能力的重排序模型 (GREM)，为括号转录文法引入非连续短语，增加了该文法短语的泛化能力，使用规则组合连续短语和非连续短语以便于获取尽可能多的连续的目标翻译。该模型不仅可以获取短语的局部和全局重排序规则，而且借助非连续短语进一步增强了短语的泛化能力。（3）提出了一种基于多层短语的重排序策略。受层次翻译模型的启发，基于多层短语的重排序策略，根据不同短语的特性相应地使用不同的重排序模型。该策略将源语言长句分割为多层短语，在不同层次的短语上应用不同的重排序模型来获取最终的目标翻译。该模型很容易将风格不同的短语重排序模型（例如，分层短语重排序模型、BTG风格的重排序模型和单调翻译的重排序模型等）融合在一起，甚至能够整合更为复杂的重排序模型（例如，基于语言学句法的重排序模型），并将其控制在较小的范围内，而在更大的范围内则使用较为简单的重排序模型，从而达到平衡翻译性能和翻译速度的目的。综上所述，本论文面向基于短语的统计翻译模型在短语表的构建、连续短语的泛化和重排序模型的设计等方面进行了深入的研究，提出的方法有效地改善了基于短语的统计机器翻译系统的性能，为进一步探索新的翻译方法...
英文摘要	Machine translation investigates the use of computer to translate text or speech from one language to another. In the era of knowledge economy, increasingly frequent interna-tional communication and continuously quickening globalization process make the mag-nitude of cross-language information exchange rapidly increase. The natural linguistic barrier between different countries or areas is becoming more and more prominent. As a computer technology to conquer language baffle, machine translation is playing a more and more important role in economic development and social life. So far, statistical machine translation has become a mainstream method in machine translation. In statistical translation methods phrase-based models still make an active area of research. However, there are three main problems in phrase-based models which retard their development: lack of robustness in the construction of phrase table; poor abil-ity of contiguous phrases’ generalization and phrases reordering. This thesis puts the re-search emphasis on two subfields of phrase-based statistical machine translation methods, namely methods of phrase extraction and phrase reordering model. The major contributions of this thesis are listed as follows: (1) We propose a flexible-scale-based method of phrase translation extraction. The phrases translation pair extraction is the key technique in phrase-based statistical machine translation. Och’s method of phrase extraction is the most widely applied method, which heavily depends on word alignments and extracts the phrase pairs fully consistent with the word alignments. We propose a method of phrase pair extraction with a flexible scale. It not only makes use of the merit of Och’s method but also extracts those phrase align-ments Och’s method can not obtain. The flexible scale is based on the two features: POS and dictionary information. Our method relaxes the restriction of “total consistency with word alignment” and can find corresponding target phrases for more source phrases. In this way our method can extract more translation information from parallel data and im-porve the translation performance of phrase-based statistical machine translation. (2) We propose a generalized reordering model for phrase-based statistical machine translation which introduces non-contiguous phrases into bracketing transduction gram-mar and increases its capability of phrase generalization. Phrase-based statistical machine translat...
关键词	统计机器翻译短语对抽取非连续短语短语重排序 Statistical Machine Translation Phrase Pair Extractiion Non-contiguous Phrase Phrase Reordering Model
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6200
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	何彦青. 基于短语的统计翻译中短语对抽取方法和调序模型研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2009.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20051801462809（1599KB）			暂不开放	CC BY-NC-SA