汉英命名实体翻译及对齐方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	汉英命名实体翻译及对齐方法研究
其他题名	Chinese-English Named Entity Trans-lation and Alignment
	陈钰枫
	2008-06-01
学位类型	工学博士
中文摘要	命名实体翻译和双语命名实体对齐旨在实现命名实体在两种语言之间的转换与对应，是机器翻译、跨语言信息检索等多语言信息处理领域的一项重要任务。尤其在机器翻译系统中，命名实体的翻译质量是影响翻译系统性能的重要因素之一。同时，双语命名实体对齐不仅可以生成双语命名实体词典以辅助翻译，而且直接影响到统计机器翻译训练过程中短语对抽取的质量。因此，命名实体翻译及对齐方法的研究对于提高机器翻译系统的性能至关重要，具有重要的理论意义和实用价值。本论文从命名实体本身的特性出发，结合各种机器翻译方法，通过大量的数据分析和实验，对汉英命名实体翻译及双语实体对齐方法进行了深入的研究和探索。论文的主要工作归纳如下：（1）在对命名实体翻译特点进行分析和归纳的基础上，提出了命名实体内部词汇对齐方法，并建立了命名实体翻译框架。由于不同类别的命名实体具有其自身的翻译特点和规律，要达到较好的实体翻译效果，必须从各类实体的翻译特点出发，充分挖掘可利用的实体信息。本文基于大规模的汉英双语命名实体语料，分析了人名、地名和机构名的翻译特点，并提出了命名实体的内部词汇对齐方法；然后针对这三类实体的内部对齐信息（音译和意译规律等）进行了统计分析，比较了各类实体的翻译重点；在此基础上建立了命名实体翻译的整体框架。（2）根据机构名的结构特点，提出了基于结构的汉语机构名翻译方法。在所有命名实体中，机构名是粒度最大、组成结构最复杂、变化形式最多的一类实体。如何充分利用机构名的内在结构特点进行翻译是本论文研究的重点之一。首先，本文给出了一种“语块”定义，以语块为单位对机构名进行结构上的分解；然后根据其语义关系和位置规律将机构名划分为三类构成语块，并通过这种语块结构描述了机构名翻译的所有模式；最后，依照语块翻译的排序规律，采用层次化的同步上下文无关文法的推导过程实现了机构名的翻译。该方法在机构名翻译的词序调整方面有很大的优势，可以获得较好的翻译效果。实验证明，该模块加入到基于短语的统计机器翻译系统中后，有效地提高了翻译系统的性能。（3）基于双语命名实体对齐的理论推导框架，实现了双语实体的三种对齐方式。在实验分析的基础上，提出了双语实体识别与对齐相结合的实现方法。在双语实体对齐任务中，双语实体的识别效果与对齐性能密切相关。因此，我们有必要将双语实体识别过程和对齐过程放在同一个理论框架下，分析二者相互影响的因素。为此，本论文首先给出了双语实体对齐任务的理论推导，并通过一系列条件假设和问题转换，建立了双语实体对齐的三种任务，然后分别实现了这三种对齐方式。通过实验我们发现，双语实体的识别错误极大地限制了对齐性能的提高，但是普遍采用的对齐特征却无法有效地克服实体识别错误带来的消极影响。因此，综合分析和考虑各种存在的问题，本文提出了双语实体识别与对齐相结合的实现方法：引入修正对齐方法将双语实体识别和对齐两个过程有机地结合在一起。（4）根据双语命名实体识别与对齐的特点，提出了基于翻译比率和类别约束的双语实体对齐方法。通过大规模的语料分析，我们发现一个命名实体的翻译方式（音译或意译）与实体的类别密切相关，其中意译和音译之间的比例关系（我们定义意译方式占整体翻译的比例为翻译比率）在不同类别的实体之间差异很大。同时，每一个命名实体翻译对的类别应该是一致的。基于以上的分析，我们提出了一种基于翻译比率和类别约束的双语实体对齐模型，该模型包括基本对齐和修正对齐。实验证明了该对齐模型不仅显著地提高了汉英实体的对齐性能，而且有效提高了汉英实体识别的准确率和召回率，尤其对实体类别的判断能力有了较大的提高。
英文摘要	Named entities (NE), especially named persons, locations and organizations, convey essential meaning in human languages. Therefore, NE translation and bilingual NE alignment is very important in multilingual language processing, such as machine translation and cross-lingual information retrieval. Especially in a statistical machine translation (SMT) system, NE translation is an important factor reinforcing the system performance. Moreover, bilingual NE alignment, which extracts NE pairs from bilingual corpus, not only constructs a bilingual NE dictionary that assists machine translation, but also has an effect on the quality of phrase pair extraction in SMT training process. The research work on NE translation and bilingual NE alignment proves to be crucial for improving the per-formance of machine translation. The main contributions and novelties are summarized as follows: (1) Study on NE translation properities of different NE types and two approaches to NE internal word alignment, as well as an NE translation framework (2) Study on a structure-based model for Chinese organization name translation Firstly, the inherent structures of organization names are analyzed by an appropriate chunk-unit, which reveals that the components of organization names follow a definite formula and allows the designation of the three types of chunks. Therefore, a hierarchical synchronous CFG (context-free grammar) derivation is proposed to implement the organiztion name translation. The experimental results prove that the proposed model translates the Chinese organization name into English with a good performance and demonstrates a significant improvement in the quality of translation when it is integrated into a statistical machine translation system. (3) Study on a theoretical framework of bilingual NE alignment and different align-ment strategies After constructing a general theoretical framework of bilingual NE alignment, we propose three alignment strategies accordingly, and then implement them respectivelly. In the experiments, we discover that NE recognition errors compounded in the NE alignment stage have much negative effect on the final output. Therefore, a refinement alignment approach is introduced to recover from the error propagation, which is able to identify and align bilingual NEs jointly. (4) Study on a novel bilingual NE alignment model with translation ratio and NE type constraint Based on bilingual NE corpus, it is observed that how a given NE is translated either semantically or phonetically depends greatly on its associated entity type, and entities within an aligned pair should share the same type. Accordingly, we propose a novel bilingual NE alignment model that combines basis alignment and refinement alignment.The experimental results show that the novel alignment model achieves a significant improvement of the Chinese-English NE alignment quality, as well as the performance of NE recognition.
关键词	机器翻译命名实体命名实体翻译双语命名实体对齐 Machine Translation Named Entity Named Entity Translation Bilingual Named Entity Alignment
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6111
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	陈钰枫. 汉英命名实体翻译及对齐方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2008.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20051801462800（1131KB）			暂不开放	CC BY-NC-SA