CASIA OpenIR  > 毕业生  > 博士学位论文
人机交互式机器翻译方法研究与实现
黄国平1,2
学位类型工学博士
导师宗成庆
2017-05-26
学位授予单位中国科学院大学
学位授予地点北京
关键词统计机器翻译 人机交互 中文输入法 术语翻译 在线学习
摘要近年来,机器翻译研究取得了长足的进步,译文质量不断提高,在某些特定领域和环境下已经开始投入实际应用。但是,基于翻译记忆的计算机辅助翻译软件在专业翻译市场仍具有得天独厚的优势。这是因为在特定领域中,如果待翻译文本与记忆库中的文本匹配程度很高时,翻译记忆的译文质量明显优于机器翻译的自动译文。大多数情况下,专业译员甚至不想花费太多的时间阅读自动译文。但是,计算机辅助翻译的生产效率也已达到瓶颈。因此,研究人机交互式机器翻译方法和实现技术,以进一步提高人工翻译效率,对于提升机器翻译的译文质量,推动机器翻译技术在专业领域的应用,具有重要的理论意义和应用价值。
本论文首先从考查统计机器翻译和计算机辅助翻译系统的特点出发,研究人机交互式机器翻译方法和实现技术。论文的主要工作和创新点归纳如下:
1. 提出了一种融合统计机器翻译技术的中文输入方法
在实际应用中,人们往往只使用机器翻译系统的自动译文。这种方式的缺点在于,如果自动译文的质量不能满足要求,则高质量的中间结果也一同被舍弃,从而使机器翻译难以有效发挥价值。为此,我们提出了一种融合统计机器翻译技术的中文输入方法。该方法能够充分融合统计翻译中的翻译规则、翻译假设列表和翻译结果候选列表等相关信息,只需较少的按键次数就可以生成准确的译文结果。此外,为了指导统计机器翻译系统生成更适合该输入方法的翻译结果,我们提出了面向输入方法的译文自动评价指标。实验结果表明,该输入方法能大幅减少翻译人员的译文修改强度,显著提高翻译效率和译文质量。同时,提出的自动评价指标能使该输入方法利用更合适的统计翻译结果,进一步提升人工翻译效率,显著改善人机交互体验。
2. 提出了一种基于术语识别边界信息的术语识别和翻译方法
术语翻译对于专业领域的机器翻译至关重要,而现有机器翻译系统往往没专门考虑术语的翻译问题。为了改善专业领域中术语的翻译质量,我们提出了一种基于术语识别边界信息的术语识别和翻译方法。由于当前术语识别的性能还比较低,该方法借助术语识别边界信息建立术语解码方法,充分利用从平行句对和互联网单语语料中挖掘得到的术语翻译知识,包括三个部分:从平行句对中挖掘术语翻译知识的融合双语术语识别的联合词对齐模型,从单语语料中挖掘术语翻译知识的基于双语括号句子的术语翻译挖掘方法,以及基于术语识别边界信息的统计翻译术语解码方法。实验结果表明,我们提出的术语识别和翻译方法能显著提升计算机领域专业术语的翻译准确率,从而有效地改善了统计翻译译文质量。
3. 提出了一种基于随机森林的统计翻译在线学习方法
为使机器翻译系统能够在人机交互过程中有效利用译员已完成的双语句对,实时获取翻译知识并改善自动译文的质量,我们提出了一种基于随机森林的统计翻译在线学习方法。该方法通过在人机交互过程中实时从输入源文和用户反馈构成的平行句对中抽取翻译知识,不断更新基于随机森林的统计翻译模型,从而改善译文的质量。由于低频词和未登录词直接影响词对齐和翻译知识抽取的性能,因此,我们还提出了一种基于锚点的隐马尔可夫增量式词对齐方法。该词对齐方法有效利用互信息和词典等先验知识生成对齐锚点,然后联合执行基于锚点的双语短语划分和隐马尔可夫词对齐算法。模拟实验结果表明,随着用户反馈的积累,统计翻译在线学习方法显著提升了后续相关句子的自动译文质量,且在线学习方法的译文质量可比于同等规模语料的离线学习基线系统的译文质量。人机交互体验得到显著改善。
最后,基于以上提出的方法,我们设计和实现了人机交互式英汉机器翻译系统,并总结了开发过程中遇到的关键问题和应对策略。
其他摘要In recent years, the research on machine translation (MT) has made great progress and the performance of machine translation has been improved a lot. In some specific domains and scenarios, MT has been put into practical application. However, computer-assisted translation (CAT), based on translation memory (TM) rather than MT, still dominates the professional translation market. Occasionally, only the final results of machine translation are displayed to provide references. This is because the quality of TM is still significantly higher than that of MT for those sentences, which have high fuzzy matches in TM database. In most cases, professional translations do not even want to spend time reading automatic translation. In such a scenario, current usage of MT is limited to a great extent. At the same time, the productivity of CAT has reached the bottleneck. Therefore, it is of great theoretical and practical value to research how to combine MT with CAT to further improve the efficiency of human translation and promote the application of MT in specialized areas.
Based on detailed analysis of the advantages and disadvantages of MT and CAT, this thesis attempts to propose and implement approaches to human-computer interaction machine translation. The main contributions are summarized as follows:
1. A Novel Input Method for Translation
In the current CAT environment, translators only use the final result of the underlying MT system. To have an adequate arena for the exercise of MT as well as improve the human-computer interaction experience of MT, in this thesis, we propose a novel input method that makes full use of the knowledge produced by SMT systems, such as translation rules, decoding hypotheses and n-best translation lists. The well-designed input method takes full advantage of useful information of the SMT system. The proposed input method contains two parts: phrase generation model, allowing human translators to type target sentences quickly, and n-gram prediction model, helping users choose perfect MT fragments smoothly. In addition, to tune the underlying SMT system to generate the input method preferable results, we design a new evaluation metric for the MT system. The extensive experiments demonstrate that our methods can greatly reduce keystrokes and translation time, and significantly improve the efficiency of human translation.
2. Flexible Terminology Translation Approaches
Terminology translation is essential for machine translation in specialized areas. However, it’s not usually considered by the current MT systems. In order to improve the quality of terminology translation, we propose flexible terminology translation approaches. The proposed approaches contain three parts: a joint model extracting terminology translation knowledge from parallel sentences by jointly conducting bilingual term detection and word alignment, an approach learning terminology translation knowledge from parenthetical sentences in the Internet, and a terminology translation method combining identified term boundary information. Experiments show that out proposed approaches substantially enhance the performance of vertical terminology translation and sentence translation.
3. Online Random Forests Based Online Learning Method for Translation Model
Professional translators expect that the underlying MT system can learn in real-time in the process of human-computer interaction and improve subsequent translation results. In order to make the most of the up-to-the-minute human translations, we propose an online learning method based on online random forests (ORFs) for translation model. This proposed online learning method incessantly extracts translation knowledge from the single parallel sentence of the user feedback, and update the adopted translation model in real-time to achieve the goal of automatic translation improvement. In addition, in order to extract the translation knowledge of low frequency words and unknown words, we also propose an anchor-based hidden Markov model (HMM) word alignment method. The simulation experiment results demonstrate that our proposed online learning method significantly improves translation quality as the number of feedback sentences increasing, and the translation quality is comparable to that of the off-line baseline system with all training data. The human-computer interaction experience has been improved significantly.
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/14814
专题毕业生_博士学位论文
作者单位1.中国科学院大学
2.模式识别国家重点实验室,中国科学院自动化研究所
推荐引用方式
GB/T 7714
黄国平. 人机交互式机器翻译方法研究与实现[D]. 北京. 中国科学院大学,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
黄国平.人机交互式机器翻译方法研究与实现(5563KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[黄国平]的文章
百度学术
百度学术中相似的文章
[黄国平]的文章
必应学术
必应学术中相似的文章
[黄国平]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。