汉语词与句子切分技术及机器翻译评估方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	汉语词与句子切分技术及机器翻译评估方法研究
其他题名	Approaches to Chinese Word Analysis;Utterance Segmentation and Automatic Evaluation of Machine Translation
	刘丁
	2004-06-01
学位类型	工学硕士
中文摘要	本论文以统计模型为基础，在参考了大量前人工作的基础上，对汉语词法分析、口语句子切分和机器翻译评估进行了较为深入的探讨和研究。汉语词法分析是大部分中文处理的第一步，其重要性不言而喻；句子切分是语音翻译中连接语音识别和文本翻译的桥梁，无论语音识别和文本翻译单独的效果有多么好，这座桥没搭好，综合的性能依然无法提高；机器翻译的自动评估是构建机器翻译系统中很重要的辅助工作，其可以加速翻译系统的开发速度，缩短其开发周期。简言之，这三方面同属于自然语言处理的基础的研究领域，其效果直接影响到高层应用的水平。在词法分析上，我们利用隐马尔可夫模型(HMM)提出了一种融和了分词、词性标注和命名实体识别的一体化诃法分析方法。最初我们用基于类别的 }tMM，其优点是对词的覆盖面广，系统开销小；缺点是不能精确地预测词的出现概率。为了提升模型的准确率，我们引入基于词汇的HMM，并将两者有机地结合，并用一个“词到字”的概率平滑方法对基于词的HMM进行平滑。实验结果显示，我们的混合模型由于综合考虑到了字、词、词性以及命名实体的知识，在切分的准确率和召回率上都明显优于单纯基于类别或者基于词的。HMM。此外在分词系统的实现上，我们借助对通用分词系统APCWS的整体框架和各功能模块的介绍，讨论了如何有效地存储和加载数据等一些技术细节问题。在口语句子切分上，我们提出了基于双向N元模型和最大熵模型的句子切分算法，这种算法由于通过最大熵有机地将正、逆向N元切分结合起来，综合考虑到了切分点左、右的上下文，从而得到了很好的切分效果。我们在中、英文语料上训练我们的模型并作测试，结果显示其在性能上明显优于基本的正向N 元切分。在此基础上，我们分析并对比了各模型的切分结果，从而验证了我们当初对于模型的预计：其一方面保存了正向N元算法的正确切分，一方面用逆向N元算法有效地避免了正向算法的错误切分。在机器翻译的自动评估上，我们首先介绍了两种常用的基于参考译文的评估算法BLEU和N工ST，然后给出了一种基于N元模型的句子流畅度评估方法E3。这种方法不需要借助任何参考译文，它通过区别地对待句子中不同的词的转移概率，达到了很好的评估效果。综上所述，本文针对汉语词法分析、口语句子切分和机器翻译评估提出了以统计模型为基础的创新方法，它们不仅仅在科学方法上有重要的参考价值，对于实
英文摘要	This thesis proposed our novel statistical approaches on Chinese word analysis, utterance segmentation and automatic evaluation of machine translation (MT). Word analysis is the first step for most application based on Chinese language technologies; utterance segmentation is the bridge which connects speech recognition and text translation in a speech translation system; automatic evaluation of machine translation (MT) system can speed the research and development of a MT system, reduce its developing cost. In short, the three aspects all belong to the basic research area of Natural Language Processing (NLP) and have significant meaning to many important applications such as text translation, speech translation and so on. In Chinese word analysis, we proposed a novel unified approach based on HMM, which efficiently combine word segmentation, Part of Speech (POS) tagging and Named Entity (NE) recognition. Our first model is a class-based HMM. So as to increase its accuracy, we introduce into the word-based HMM and combine it with the class-based HMM. At last we used a "word-to-character" smoothing method for predicting the probability of those words which don' t occur in the training set. The experimental results show that our combined model, by comprehensively considering the information of Chinese characters, words, POS and NE, achieved much better performance in the precision and recall of the Chinese word segmentation. Based on the knowledge of our combined model, we described the details in implementing the general word segmentation system APCWS. We discussed some technical problems in the data saving and loading, and described our modules of knowledge management and word lattice construction. In utterance segmentation, this paper proposed a novel approach which was based on a bi-directional N-gram model and Maximized Entropy model. This novel method, which effectively combines the normal and reverse N-gram algorithm, is able to make use of both the left and right context of the candidate site and achieved very good performance in utterance segmentation. We conducted experiments both in Chinese and in English. The results showed the effect of our novel method was much better than the normal N-gram algorithm. Then by analyzing the experimental results, we found the reason why our novel method achieved better results: it on one hand retained the correct segmentation of the normal N-gram algorithm, on the other hand avoided the incorrect segmentation by making use of reverse N-gram algorithm. In automatic evaluation of MT systems, we first introduced two classic methods on automatic evaluation which relied on reference translations. Then we proposed our novel sentence fluency evaluation method based on N-gram model. This method, called as E3, doesn't need any reference translations and achieved very well evaluation performance by discrimi
关键词	机器翻译
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6753
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	刘丁. 汉语词与句子切分技术及机器翻译评估方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2004.