汉英实体翻译与实体对抽取技术研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	汉英实体翻译与实体对抽取技术研究
其他题名	Chinese-English Named Entity Translation
	陆敏
	2006-06-19
学位类型	工学硕士
中文摘要	命名实体翻译以及双语命名实体对的抽取，在统计机器翻译、跨语言检索等领域有重要作用，因而越来越受到研究人员的重视。由于是新兴的方向，命名实体翻译以及双语对应抽取技术还很不成熟，有许多问题需要研究解决。本文首先设计了一个命名实体翻译获取的整体框架，将翻译和抽取有机地结合在一起，并重点研究了其中的机构名翻译以及双语可比语料中的命名实体对抽取方法。主要内容归纳如下：（1）设计了一个汉英命名实体翻译与抽取结合的整体框架在这个框架中，汉语命名实体通过翻译模块被直接翻译成对应的英文命名实体；或者，产生出一些翻译候选，然后通过网络检索模块对这些候选重新评估，得到符合大众习惯的正确英文命名实体。另一方面，也可以从互联网上获取双语语料（包括双语可比语料、中英文混合语料等），从这些语料中抽取出汉英命名实体对应，得到一个双语命名实体列表，来辅助翻译模块进行翻译。（2）设计实现了一种规则约束下的汉英机构名翻译方法这种方法针对汉英机构名的翻译特点提取了一系列基于关键词的翻译规则，然后将这些规则用到统计机器翻译的训练和解码当中。具体地，将翻译规则和其它一些统计模型融合在最大熵机器翻译模型的框架之下进行机构名翻译。这些统计模型包括：统计机器翻译中常用的4种短语翻译模型、短语惩罚模型、词汇映射模型、置换模型。实验显示，翻译规则在训练和解码过程中都起到了积极的作用，这种方法在各项评测指标上都优于另外两个baseline系统。（3）设计实现了一种基于多特征的可比语料库中命名实体对抽取方法这种方法融合了命名实体内部以及外部的多种特征从可比语料库中抽取双语命名实体对，这些特征包括：音译特征、上下文特征、翻译特征和词长特征。在特征得分的计算过程中，本文充分利用了三种命名实体各自的特性，尤其在翻译特征得分的计算当中考虑了词语在翻译时位置上的对应关系。实验显示，内部特征和外部特征都在双语命名实体抽取过程中发挥了积极的作用，并且本文的翻译得分计算方法效果明显好于已有的不考虑词语位置的翻译得分计算方法。
英文摘要	Named entity translation and bilingual named entity extraction are very important in many tasks of natural language processing, such as machine translation, cross-lingual information retrieval, etc. Therefore, they attract more and more attention from researchers. As new areas, the technologies of named entity translation and extraction are not fully developed, with many problems to be studied and solved. This dissertation designs a framework for obtaining translations of named entities, which combines translation and extraction together, and concentrates on the research on the method of organization translation and the method of extracting named entity pairs from bilingual comparable corpus. (1) We design a framework for Chinese-English entity translation and extraction. In this framework, Chinese named entities are directly translated into corresponding English named entities by translation module; or some translation candidates are generated and evaluated by network module to obtain correct English translations. On the other side, we also extract Chinese-English named entity pairs from Internet bilingual corpora, including bilingual comparable corpora, Web corpora etc. Therefore, named entity translation lists can be constructed to assist translation. (2) We design and realize a rule-constrained Chinese-English organization translation method. In the method, a series of keyword-triggered translation rules are generated according to the characteristics of Chinese-English organization name translation, and are used in the training and decoding process of statistical organization translation. In more detail, we integrate translation rules and some other statistic models under the framework of the maximum entropy statistic machine translation. These statistical models include four types of phrase translation models, word penalty, lexical mapping model and permutation model. The results of experiment show that translation rules play positive roles in both training and decoding, and the rule-constrained Chinese-English organization translation method is better than the two baseline systems. (3) We design and realize a multi-feature based Chinese-English named entity extraction method from bilingual comparable corpora. This method integrates features inside and outside named entities to extract bilingual named entity pairs from comparable corpora. These features include transliteration feature, contextual feature, word translation feature and length feature. In the process of calculating feature scores, we make full use of the characteristics of the three types of named entities. Especially, we consider the changing of word order in translation when computing the translation feature's score. Experiment results show that all features are useful and the method of calculating the translation score gets a better performance than the method which does not consider word order.
关键词	命名实体命名实体翻译双语命名实体对抽取可比语料 Named Entity Named Entity Translation Bilingual Named Entity Extraction Comparable Corpus
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7418
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	陆敏. 汉英实体翻译与实体对抽取技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2006.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20042801462804（919KB）			暂不开放	CC BY-NC-SA