CASIA OpenIR  > 毕业生  > 硕士学位论文
汉英实体翻译与实体对抽取技术研究
其他题名Chinese-English Named Entity Translation
陆敏
学位类型工学硕士
导师赵军
2006-06-19
学位授予单位中国科学院研究生院
学位授予地点中国科学院自动化研究所
学位专业模式识别与智能系统
关键词命名实体 命名实体翻译 双语命名实体对抽取 可比语料 Named Entity Named Entity Translation Bilingual Named Entity Extraction Comparable Corpus
摘要命名实体翻译以及双语命名实体对的抽取,在统计机器翻译、跨语言检索等领域有重要作用,因而越来越受到研究人员的重视。由于是新兴的方向,命名实体翻译以及双语对应抽取技术还很不成熟,有许多问题需要研究解决。 本文首先设计了一个命名实体翻译获取的整体框架,将翻译和抽取有机地结合在一起,并重点研究了其中的机构名翻译以及双语可比语料中的命名实体对抽取方法。主要内容归纳如下: (1) 设计了一个汉英命名实体翻译与抽取结合的整体框架 在这个框架中,汉语命名实体通过翻译模块被直接翻译成对应的英文命名实体;或者,产生出一些翻译候选,然后通过网络检索模块对这些候选重新评估,得到符合大众习惯的正确英文命名实体。另一方面,也可以从互联网上获取双语语料(包括双语可比语料、中英文混合语料等),从这些语料中抽取出汉英命名实体对应,得到一个双语命名实体列表,来辅助翻译模块进行翻译。 (2) 设计实现了一种规则约束下的汉英机构名翻译方法 这种方法针对汉英机构名的翻译特点提取了一系列基于关键词的翻译规则,然后将这些规则用到统计机器翻译的训练和解码当中。具体地,将翻译规则和其它一些统计模型融合在最大熵机器翻译模型的框架之下进行机构名翻译。这些统计模型包括:统计机器翻译中常用的4种短语翻译模型、短语惩罚模型、词汇映射模型、置换模型。实验显示,翻译规则在训练和解码过程中都起到了积极的作用,这种方法在各项评测指标上都优于另外两个baseline系统。 (3) 设计实现了一种基于多特征的可比语料库中命名实体对抽取方法 这种方法融合了命名实体内部以及外部的多种特征从可比语料库中抽取双语命名实体对,这些特征包括:音译特征、上下文特征、翻译特征和词长特征。在特征得分的计算过程中,本文充分利用了三种命名实体各自的特性,尤其在翻译特征得分的计算当中考虑了词语在翻译时位置上的对应关系。实验显示,内部特征和外部特征都在双语命名实体抽取过程中发挥了积极的作用,并且本文的翻译得分计算方法效果明显好于已有的不考虑词语位置的翻译得分计算方法。
其他摘要Named entity translation and bilingual named entity extraction are very important in many tasks of natural language processing, such as machine translation, cross-lingual information retrieval, etc. Therefore, they attract more and more attention from researchers. As new areas, the technologies of named entity translation and extraction are not fully developed, with many problems to be studied and solved. This dissertation designs a framework for obtaining translations of named entities, which combines translation and extraction together, and concentrates on the research on the method of organization translation and the method of extracting named entity pairs from bilingual comparable corpus. (1) We design a framework for Chinese-English entity translation and extraction. In this framework, Chinese named entities are directly translated into corresponding English named entities by translation module; or some translation candidates are generated and evaluated by network module to obtain correct English translations. On the other side, we also extract Chinese-English named entity pairs from Internet bilingual corpora, including bilingual comparable corpora, Web corpora etc. Therefore, named entity translation lists can be constructed to assist translation. (2) We design and realize a rule-constrained Chinese-English organization translation method. In the method, a series of keyword-triggered translation rules are generated according to the characteristics of Chinese-English organization name translation, and are used in the training and decoding process of statistical organization translation. In more detail, we integrate translation rules and some other statistic models under the framework of the maximum entropy statistic machine translation. These statistical models include four types of phrase translation models, word penalty, lexical mapping model and permutation model. The results of experiment show that translation rules play positive roles in both training and decoding, and the rule-constrained Chinese-English organization translation method is better than the two baseline systems. (3) We design and realize a multi-feature based Chinese-English named entity extraction method from bilingual comparable corpora. This method integrates features inside and outside named entities to extract bilingual named entity pairs from comparable corpora. These features include transliteration feature, contextual feature, word translation feature and length feature. In the process of calculating feature scores, we make full use of the characteristics of the three types of named entities. Especially, we consider the changing of word order in translation when computing the translation feature's score. Experiment results show that all features are useful and the method of calculating the translation score gets a better performance than the method which does not consider word order.
馆藏号XWLW1121
其他标识符200428014628047
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/7418
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
陆敏. 汉英实体翻译与实体对抽取技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2006.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
CASIA_20042801462804(919KB) 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[陆敏]的文章
百度学术
百度学术中相似的文章
[陆敏]的文章
必应学术
必应学术中相似的文章
[陆敏]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。