CASIA OpenIR  > 毕业生  > 硕士学位论文
汉英实体翻译与实体对抽取技术研究
Alternative TitleChinese-English Named Entity Translation
陆敏
Subtype工学硕士
Thesis Advisor赵军
2006-06-19
Degree Grantor中国科学院研究生院
Place of Conferral中国科学院自动化研究所
Degree Discipline模式识别与智能系统
Keyword命名实体 命名实体翻译 双语命名实体对抽取 可比语料 Named Entity Named Entity Translation Bilingual Named Entity Extraction Comparable Corpus
Abstract命名实体翻译以及双语命名实体对的抽取,在统计机器翻译、跨语言检索等领域有重要作用,因而越来越受到研究人员的重视。由于是新兴的方向,命名实体翻译以及双语对应抽取技术还很不成熟,有许多问题需要研究解决。 本文首先设计了一个命名实体翻译获取的整体框架,将翻译和抽取有机地结合在一起,并重点研究了其中的机构名翻译以及双语可比语料中的命名实体对抽取方法。主要内容归纳如下: (1) 设计了一个汉英命名实体翻译与抽取结合的整体框架 在这个框架中,汉语命名实体通过翻译模块被直接翻译成对应的英文命名实体;或者,产生出一些翻译候选,然后通过网络检索模块对这些候选重新评估,得到符合大众习惯的正确英文命名实体。另一方面,也可以从互联网上获取双语语料(包括双语可比语料、中英文混合语料等),从这些语料中抽取出汉英命名实体对应,得到一个双语命名实体列表,来辅助翻译模块进行翻译。 (2) 设计实现了一种规则约束下的汉英机构名翻译方法 这种方法针对汉英机构名的翻译特点提取了一系列基于关键词的翻译规则,然后将这些规则用到统计机器翻译的训练和解码当中。具体地,将翻译规则和其它一些统计模型融合在最大熵机器翻译模型的框架之下进行机构名翻译。这些统计模型包括:统计机器翻译中常用的4种短语翻译模型、短语惩罚模型、词汇映射模型、置换模型。实验显示,翻译规则在训练和解码过程中都起到了积极的作用,这种方法在各项评测指标上都优于另外两个baseline系统。 (3) 设计实现了一种基于多特征的可比语料库中命名实体对抽取方法 这种方法融合了命名实体内部以及外部的多种特征从可比语料库中抽取双语命名实体对,这些特征包括:音译特征、上下文特征、翻译特征和词长特征。在特征得分的计算过程中,本文充分利用了三种命名实体各自的特性,尤其在翻译特征得分的计算当中考虑了词语在翻译时位置上的对应关系。实验显示,内部特征和外部特征都在双语命名实体抽取过程中发挥了积极的作用,并且本文的翻译得分计算方法效果明显好于已有的不考虑词语位置的翻译得分计算方法。
Other AbstractNamed entity translation and bilingual named entity extraction are very important in many tasks of natural language processing, such as machine translation, cross-lingual information retrieval, etc. Therefore, they attract more and more attention from researchers. As new areas, the technologies of named entity translation and extraction are not fully developed, with many problems to be studied and solved. This dissertation designs a framework for obtaining translations of named entities, which combines translation and extraction together, and concentrates on the research on the method of organization translation and the method of extracting named entity pairs from bilingual comparable corpus. (1) We design a framework for Chinese-English entity translation and extraction. In this framework, Chinese named entities are directly translated into corresponding English named entities by translation module; or some translation candidates are generated and evaluated by network module to obtain correct English translations. On the other side, we also extract Chinese-English named entity pairs from Internet bilingual corpora, including bilingual comparable corpora, Web corpora etc. Therefore, named entity translation lists can be constructed to assist translation. (2) We design and realize a rule-constrained Chinese-English organization translation method. In the method, a series of keyword-triggered translation rules are generated according to the characteristics of Chinese-English organization name translation, and are used in the training and decoding process of statistical organization translation. In more detail, we integrate translation rules and some other statistic models under the framework of the maximum entropy statistic machine translation. These statistical models include four types of phrase translation models, word penalty, lexical mapping model and permutation model. The results of experiment show that translation rules play positive roles in both training and decoding, and the rule-constrained Chinese-English organization translation method is better than the two baseline systems. (3) We design and realize a multi-feature based Chinese-English named entity extraction method from bilingual comparable corpora. This method integrates features inside and outside named entities to extract bilingual named entity pairs from comparable corpora. These features include transliteration feature, contextual feature, word translation feature and length feature. In the process of calculating feature scores, we make full use of the characteristics of the three types of named entities. Especially, we consider the changing of word order in translation when computing the translation feature's score. Experiment results show that all features are useful and the method of calculating the translation score gets a better performance than the method which does not consider word order.
shelfnumXWLW1121
Other Identifier200428014628047
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/7418
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
陆敏. 汉英实体翻译与实体对抽取技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2006.
Files in This Item:
File Name/Size DocType Version Access License
CASIA_20042801462804(919KB) 暂不开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[陆敏]'s Articles
Baidu academic
Similar articles in Baidu academic
[陆敏]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[陆敏]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.