基本名词短语识别的关键技术研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基本名词短语识别的关键技术研究
其他题名	Approaches to Base Noun Phrase Chunking
	徐昉
	2007-06-19
学位类型	工学硕士
中文摘要	基本名词短语(base noun phrase, base NP)识别是自然语言处理领域中一项重要的基础性研究课题，其目的是从文本中提取简单、非嵌套的名词短语，不含有其他子名词短语。Base NP含有丰富的句法和语义信息， base NP识别的结果可服务于信息检索、机器翻译等诸多自然语言处理任务，并且base NP识别是典型的分类问题和序列标注问题，机器学习方法是处理这一问题的重要手段，从事这一问题的研究可以进一步推动机器学习方法的研究，因此，这项研究具有较大的应用价值和理论意义。本文的工作重点是利用组合分类器研究base NP识别的新方法，主要工作归纳如下： (1) 在宾州汉语树库（5.0版）的基础上建立中文base NP 语料库，并从英文语法分析语料中提取更大规模的英文语料。 (2) 结合规则和初级分类器概率信息的组合分类器方法，构造汉语base NP分析器。初级分类器选择支持向量机(support vector machine, SVM)和条件随机场(conditional random fields, CRF)，为了从两者比较的不同结果中发掘出正确结果，我们针对汉语base NP的语法结构特点制定了有效的规则，并且考虑CRF模型提供的后验概率信息，消除初级分类器中的歧义和错误。通过利用不同规模的语料测试，该方法提高了系统的整体识别结果。 (3) 采用一种新的错误驱动的组合分类器方法实现汉语base NP识别。在对比两种不同类型的分类器 — 基于转换的方法(TBL)和CRF分类结果的基础上，再利用SVM学习其中的错误规律，对两种分类器产生的不同结果进行纠错，从而达到提高系统整体性能的目的。通过在base NP语料集上进行汉语base NP识别交叉验证的实验，这种方法与单独使用TBL、CRF和SVM方法相比较，实验结果都有所提高，F值达到89.72% ，相对于文中提到的其他方法，最大提高值达2.35%。 (4) 研究基于多种融合算法的组合分类器在base NP识别中的理论可行性和实际效果。在融合算法中，我们充分利用分类器提供的概率信息，并且设计不同特征集上的分类器。汉英语料库的测试结果显示，引入概率信息和多特征集的策略能够改善 base NP的识别效果。在众多融合算法中，加入概率信息的投票法（VotPro 法）取得了最好的效果。
英文摘要	In the field of natural language processing (NLP), chunking is defined as extracting the non-overlapping segments from a stream of text data. The Task of base noun phrase (base NP) chunking is focus on recognizing those simple and non-recursive noun phrases with no other noun phrase descendants. Base NP chunking is considered significant and challenging in NLP with high-level theoretical merit and application value. Many of the tasks in Information Retrieval, Information Extraction, Question Analysis, can be performed adequately by identifying the noun phrases, verb phrases, etc. and the relationships between these entities. First of all, based on the previous work on English and Chinese base NP corpus, we made use of available standard tools together with some manual adjustments from handcrafted rules to transfer Upenn Chinese Treebank 5.0 into the form required for base NP chunking. Besides, we also constructed a large English base NP chunking corpus with about 3,000,000 English tokens, whiche was extracted from the parsing corpus from the famous Upenn Linguistics Data Consortium. With these data applicable for training and testing, we choose Support Vector Machine (SVM) and Conditional Random Fields (CRF) for our chunking task. The remaider of the thesis will chiefly focus on the classifier combination approaches to Base NP chunking.We will discuss the following three sub-topics: First of all, we propose a hybrid approach to chunking Chinese base NPs, which combines SVM and CRF models. In order to compare the result from two chunkers, we used a discriminative post-processing method, whose criterion is the conditional probability generated from the CRF chunker. Given the special structures of Chinese base NP and complete analyses of those results, we also customize some handcrafted grammar rules to resolve ambiguities and prune errors. According to our overall experiments, the method achieved a higher accuracy in the final results. Secondly, in order to overcome some shortcoming of the methods metioned above, we continued with an error-driven combination approach to chunking Chinese base NP, which combines TBL (Transformation-based Learning) and CRF model. In order to analyze the result from two classifiers and improve the performance of the base NP chunkers, an error-driven SVM classifier was designed to learn the errors found by comparison between the former two classifiers and modify those errors. Our method achieved a higher accuracy in the final results with F-measure of 89.72% and improvement of 2.35% at most。 In summary, we put forward some challenging topic of the furture work on base NP chunking and other related shallow parsing methods. In our point of view, the research of sequence labelling problem in natural language processing is significant in the development of this field.
关键词	自然语言处理基本名词短语识别浅层句法分析序列标注组合分类器方法 Natural Language Processing Base Noun Phrase Chunking Shallow Parsing Sequence Labelling Multiple Classifier System
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7412
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	徐昉. 基本名词短语识别的关键技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2007.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20042801462806（1167KB）			暂不开放	CC BY-NC-SA