基于多系统融合的统计机器翻译模型及系统研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于多系统融合的统计机器翻译模型及系统研究
其他题名	Research on Statistical Machine Translation Model and System Based on Multiple System Combination
	杜金华
	2008-05-23
学位类型	工学博士
中文摘要	基于多翻译系统融合框架，针对汉英双语语料优化处理、多引擎平台建设以及短语模型优化等主要问题，进行深入细致的分析和研究，提出解决方案，并通过大量的实验进行对比验证。论文的主要工作归纳如下： 1. 提出面向统计机器翻译的语料库建设规范与实现流程，改进基于内容的语料优化方法。 2. 提出多引擎统计机器翻译平台建设及实现流程，并对短语翻译系统的关键模块和平台中与具体系统无关的公共模块进行多种优化处理。我们为统计机器翻译模型和算法研究搭建了一个良好的多引擎实验平台，同时也为面向工程性开发提供了一个转换平台。在基于短语翻译系统的模块优化中，重点对短语翻译模型进行优化。 3. 提出基于位置向量预测的短语翻译系统调序模型。基于短语的统计机器翻译系统的主要问题是短语重排序。本文提出基于短语相对位置和方向关系的位置向量预测模型。 4. 提出基于混淆网络解码的多特征系统融合框架。该框架是基于词级进行系统融合的一种方法，是基于MBR解码和混淆网络解码的多特征融合框架。解码模型采用对数线性模型，以词的后验概率、语言模型、词性语言模型和句子长度惩罚作为特征，使用柱搜索技术对混淆网络进行最优路径搜索。
英文摘要	Under the framework of multiple system combination, this paper mainly analyzes and does research on some key problems such as Chinese-English bilingual corpus processing and optimization, multi-engine platform construction and phrase-based model optimization. Meanwhile, this paper also proposes many related solutions and makes plenty of experiments to verify their effectiveness. The main contributions of this paper are as follows: 1. Study on Chinese-English bilingual corpus construction and realization, and propose a content-based optimization method for bilingual corpus processing 2. Study on multi-engine SMT platform construction and realization, and propose some strategies for phrase-based model optimization and common modules optimization. We construct a multi-engineer experimental platform for research on SMT models and algorithms. Meanwhile, it also could be used as a transferring platform for application development. In the optimization of phrase-based model, we focus on phrase extraction and probability computing optimization. 3. Propose a local prediction re-ordering model based on relative position vector for phrase-based system The major problem of phrase-based SMT is phrase re-ordering. This paper proposes a prediction model based on phrase relative positions and orientations. 4. Propose the framework of multiple system combination based on Confusion Network decoding The proposed framework is based on word-level combination, and uses the Minimum Bayes Risk decoding and Confusion Network decoding techniques. We add the word posterior, language model, POS language model and word penalty as the features into a log-linear model, and then search a best path to output by beam search technique.
关键词	统计机器翻译双语语料库建设多引擎翻译平台相对位置向量重排序模型 Mbr解码混淆网络多系统融合框架 Statistical Machine Translation Bilingual Corpus Construction Multi-engine Translation Platform Relative Position Vector Re-ordering Model Mbr Decoding Confusion Network System Combination Framework
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6063
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	杜金华. 基于多系统融合的统计机器翻译模型及系统研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2008.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20041801462808（1499KB）			暂不开放	CC BY-NC-SA