基于层次短语的统计机器翻译技术及可用性研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于层次短语的统计机器翻译技术及可用性研究
其他题名	Research on Hierarchical phrase-based Translation and Availability of Statistical Machine Translation
	魏玮
	2010-05-28
学位类型	工学博士
中文摘要	近年来，随着机器翻译由基于单词的模型向基于短语的模型发展，统计方法开始占据主导地位。但是由于基于短语的翻译方法受到模型本身的限制，在全局调序、短语匹配和短语切分等方面存在难以克服的缺陷，基于句法的统计机器翻译逐渐成为机器翻译的研究热点。同时，作为机器翻译的孪生兄妹——译文评估，也越来越受到人们的关注。通过评估，可以对译文中的问题进行分类、归纳、总结，分析制约译文质量提高的原因，从而促进了机器翻译的发展。本文的研究是在基于形式化句法的统计机器翻译平台上，逐步地融入不同粒度的语言学结构，增强表达能力，实现了语言学知识在统计机器翻译中的真正应用。同时，通过对译文错误的分析，探讨了统计机器翻译的可用性，并根据实际应用的需要，搭建面向口语翻译的自动评估平台。论文的主要研究工作归纳如下： 1. 提出基于后缀数组的层次短语抽取方法，建立能支撑大规模语料处理的高效率统计机器翻译平台。我们为基于形式化句法的统计机器翻译模型和算法研究搭建了一个良好的实验平台，通过对系统中的各个关键模块进行优化，提出了基于后缀数组的层次短语抽取方法，重点解决翻译模型规模骤增带来的时空消耗问题。该方法分别建立中英句对的线图（chart）结构，并利用高效的搜索算法完成子串的位置标记，提高了模型训练效率。当扩展到大规模语料时，针对解码模块进行扩充和改进，使基于层次短语的中英和英中翻译系统性能均得到较大提高。 2. 提出在层次短语翻译系统中融入不同粒度句法信息的翻译模型，为统计和句法相结合提供一种新的途径。本文首先运用浅层句法分析手段，将基于条件随机场（CRF）的语块分析引入层次短语模型中，搭建基于分层语块分析的统计机器翻译系统。继而融入更深层句法信息——依存树，利用规范子结构约束翻译规则生成，有效地过滤层次短语翻译模型中的冗余信息；同时优化CYK解码，按照自底向上顺序遍历树节点并建立线图索引，直接利用句法结构信息指导翻译解码。实验验证了融合依存树的层次短语翻译系统在翻译性能方面有更显著的优势。 3. 在对书面语/口语统计翻译译文评估和错误分类研究的基础上，提出澄清式口语翻译方法以提高技术可用性。为全面探讨机器翻译的性能和发展水平，提高翻译技术的可用性，本文借鉴错误分析框架，搭建了人工评估的可视化平台，分别针对书面语/口语统计翻译结果，进行译文评估和错误分析。综合译文分析结果和用户的实际需求，探讨通过澄清对话方式纠正口语翻译错误的可行性。该方法充分利用对话参与者自身的上下文理解及交互能力，搭建澄清式对话管理平台，解决口语翻译的瓶颈。 4. 提出面向澄清式口语翻译的置信度计算方法。为保证双方对话的流畅性，需要对每一步的译文质量做出评估，再由对话管理模块启动澄清式对话。对此，本文提出基于循环翻译（round-trip translation，RTT）的翻译置信度计算方法，在没有人工参与的条件下，不仅利用翻译系统外部信息，还融合RTT过程中的各种内部信息，如翻译概率、词语对齐等，完成句子一级的自动评估。更重要的是，本文利用回归学习策略，有效地捕捉RTT过程中...
英文摘要	In recent years, the statistical machine translation (SMT) has become more dominant along with the development of machine translation (MT) from word-based models to phrase-based models. Due to the major problems including global re-ordering, phrase match and partitioning limitation confronted by phrase-based models, syntax-based SMT is gradually becoming an attractive area of MT research. At the same time, MT evaluation, as the twin brother of translation system, has been more and more concerned by researchers in MT area. With the help of evaluation, the translation problems will be classified, generalized and summarized, which can further facilitate the analysis of the factors that restrict the improvement of translation quality, thereby lead to a huge impetus of MT development.. Under the framework of formally syntax-based SMT, the current study utilizes different granularity of linguistic knowledge and achieves the goal of applying syntactic structure in SMT. Meanwhile, the dissertation also discusses the availability of SMT by analyzing the errors of translation system and proposes an automatic evaluation method for spoken language translation. The main contributions of this paper are summarized as follows: 1. Construction and optimization of large scale hierarchical phrase-based (HPB) SMT platform, and development of a hierarchical phrase extraction method based on suffix array. An efficient experimental platform for the study of formal-syntax SMT models and algorithms was constructed in this section. In order to solve the problems of time and space consumption in the process of training translation models, we propose a hierarchical phrase extraction method based on suffix array, through which the training sentences are transferred as chart structures, and marked with the position of the substrings in the light of high efficient search algorithms. Besides, several new technologies are also introduced in decoding module to improve the performance of MT on large-scale training data. 2. Proposal of some strategies for applying different granularity of linguistic knowledge into HPB. The CRF-based chunking method was first introduced into formal-syntax SMT model to set up a statistical machine translation system using hierarchical chunking phrases, which can be regarded as the initial attempt to apply the shallow parsing knowledge into HPB. Then the result of dependency parsing was utilized as syntax knowledge to integrate into HPB. On one hand, the re...
关键词	层次短语后缀数组译文错误分析澄清式口语翻译翻译置信度 Hierarchical Phrase Suffix Array Error Analysis Of Mt Clarification-based Spoken Language Translation Confidence Estimation For Mt
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6252
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	魏玮. 基于层次短语的统计机器翻译技术及可用性研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2010.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20061801462806（1307KB）			限制开放	CC BY-NC-SA