英汉人名音译方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	英汉人名音译方法研究
其他题名	Research on English-Chinese Name Translieration
	邹波
	2008-05-30
学位类型	工学硕士
中文摘要	人名翻译接收一个源语言表示的人名作为输入，输出该人名以目标语言表示的翻译。在人名翻译过程中，在保持源语言和目标语言发音基本不变的原则下，调整源语言人名使之符合目标语言的语言习惯。人名自动翻译是很多跨语言应用的一个很重要的组成部分。近年来，人名音译的研究受到越来越多的关注，特别是当音译涉及的两种语言的字符集差异比较大的情况（例如：英文和中文这两种语言）。尽管关于中英文跨语言应用有很多，但是对这两种语言之间的自动音译目前还缺乏全面系统的研究。本文主要针对英汉人名音译问题进行研究，系统地比较了几种音译模型在进行英汉人名音译时的性能。主要内容如下：（1）将英汉人名音译问题转化成序列标注问题，并采用基于记忆的学习方法、最大熵模型和条件随机场模型三种机器学习方法进行英汉人名音译。通过实验对比了这几种方法在多种特征集上的音译性能，实验表明，在使用相同特征的条件下，条件随机场模型的性能最好。（2）将基于短语和基于双语N-gram的两种统计翻译模型应用于英汉人名音译，并对比了它们的性能。实验表明，当翻译模型和语言模型从同一个训练语料上获取的时候，基于双语N-gram的音译模型的性能优于基于短语的音译模型。此外，还考察了两种统计音译模型在不同语言模型上的表现，实验结果显示，好的语言模型有很强的重排序作用，能较大地提高英汉人名音译性能。（3）通过对以上五种音译模型在英汉音译问题上的实验结果进行分析，我们发现，以上五种模型的性能虽然有差距，但是差距并不明显，而且它们的结果的重合度很高。另外，正确结果大多数会出现在结果列表中，但是很多出现在靠后的位置。这可能预示着单纯用统计方法进行英汉人名音译在方法上是不足够的，我们需要求助于别的手段获取更好的音译结果。在这个指导思想下，本文设计了网络挖掘和统计音译结合的英汉双语人名音译系统，并实现了其中的统计音译模块。以上工作为网络挖掘和统计音译相结合的英汉双语人名音译系统的研发奠定了基础。
英文摘要	Transliteration is the process of mapping source language phonemes or graphemes into target language approximations. It is useful in many cross-language applications. Recently, there is increasing concerns about automatic transliteration, especially transliteration between the languages with significant distinctions in their character representations, eg. English and Chinese. Although there have been many English-Chinese cross-language applications, the transliteration between the two languages has not been studied comprehensively. This thesis concentrates on English-Chinese name transliteration. We compare the performance of several transliteration methods. The contributions of this work are summarized as follows: (1) Viewing the transliteration problem as a sequence-tagging problem, we tested three machine learning methods: Memory-based learning, Maximum Entropy model and Conditional Random Fields and we compared the performance of each method on different feature sets. Conditional Random Fields proves to give the best performance under identical conditions. (2) We also apply two machine translation approaches: the phrase-based model and the N-gram model to the transliteration problem. The influence of the size of the language models on the transliteration performance is also studied. Experiments show that the reranking capability of language models is crucial to the transliteration performance. (3) The above results show that the performance difference among the statistical models are small, with a significant amount of overlapping in the transliteration results, which usually contains a majority of the correct names and it is only a matter of ranking to select the best one. To compensate for the shortcomings of the statistical models, we design a name transliteration system that integrates statistical models with web mining, and implement the transliteration module within the framework.
关键词	英汉人名音译机器学习统计翻译模型网络挖掘 English-chinese Name Transliteration Machine Learning Statistical Translation Model Web Mining
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7450
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	邹波. 英汉人名音译方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2008.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20052801462806（1233KB）			暂不开放	CC BY-NC-SA