依存句法分析方法研究与系统实现

CASIA OpenIR > 毕业生 > 博士学位论文

	依存句法分析方法研究与系统实现
其他题名	Research on Methods of Dependency Parsing and System Implementation
	鉴萍
	2010-05-27
学位类型	工学博士
中文摘要	句法分析是自然语言处理的关键任务之一。它通过消除自然语言句子的结构歧义，为进一步的自然语言处理提供帮助。句法分析建立在词法分析之上，主要研究词和短语如何形成正确的句子，词和短语在句子中起什么作用以及它们之间的关系等。句法结构有多种分析形式，其中短语结构句法分析是传统的研究方向，而依存语法在自然语言结构理解上体现出来的优势，以及依存句法分析所发现的句子中词与词之间的关系对语义理解和其它应用任务实现的极大意义，使依存句法分析在自然语言处理领域得到了越来越多的重视和广泛的应用。本论文围绕如何提高依存句法分析算法的速度和准确率等关键问题，进行了深入研究和实践，主要工作和创新点包括： (1) 提出了一种基于序列标注模型的分层式依存句法分析方法。在依存句法分析领域，基于图的分析方法和基于转换的分析方法是两种主流的数据驱动方法。二者分别以整个句子和一个词对为搜索最优结构的基本单位，是依存分析的两个极端。我们则引入了一个处于中间位置的结构——依存层，来建立句法分析模型。在层内，通过穷尽搜索得到层最优子结构；在层与层之间，建立的依存结构确定性地传递。该方法具有比典型的基于图的模型更低的算法复杂度，与基于转换的方法相比，又在一定程度上缓解了确定性过程的贪婪性。此外，该方法使用序列标注技术进行层依存结构的搜索，证明了序列标注模型可以完全替代层次分析模型解决句法分析等结构预测问题。自底向上的分层结构、相邻关系分析模式和条件随机场序列标注技术的应用使本方法在保证与主流方法可比分析精度的前提下，具有非常高的分析效率，特别是用于标注大规模语料时，其优势将非常突出。 (2) 针对汉语长句，提出了一种简单高效的二次依存分析方法。汉语长句的分析一直是句法分析研究中的难题，而基于标点的长句切分是一个有效的解决办法。传统的将标点分类的切分方法最大障碍是分类精度无法达到令人满意的程度，导致切分错误，依存精度无法提高。本文提出的两步分析法，不使用任何标点符号分类器或根结点查找器，而是在所有的逗号、分号和冒号位置切分句子并引入二次分析来修正句子切分错误。实验证明，该方法能很好地把握汉语长句的整体结构，在确定性句法分析模型上依存错误率和根识别错误率分别降低约10%和16%，分析速度也有较大的提高。应用于基于图的分析模型和我们提出的基于序列标注的分层式模型，该方法同样有效。并同时发现原句法分析系统确定性越强则两步分析对其性能提升的幅度越大，进一步明确了上述三种分析模型本质上的区别和联系。 (3) 提出了一种融合双向标注结果的汉语最长短语识别方法，并将识别结果用于汉语依存句法分析。汉语短语的普遍嵌套给句法分析带来巨大困难。如果能准确地分离出句子中的最长短语成分，将很大程度地降低嵌套短语给句法分析带来的干扰。采用基于分类器的确定性标注方法，其结果能够显示最长短语识别在汉语句子正反两个方向上的互补性。基于这一现象，本文利用确定性的双向标注技术来识别汉语最长名词短语和介词短语，并提出了一种基于“分歧点”的概率融合策略。实验表明...
英文摘要	Syntactic parsing is one of the fundamental problems in Natural Language Processing (NLP). It eliminates the structural ambiguities in natural languages for advanced NLP tasks. Taking morphological analysis as foundation, parsing is a task of studying how words and phrases making up a sentence and their roles and relations in the sentence. As one of the syntactic formation, phrase structure parsing is the traditional effort. However, the superiority on natural language structure understanding and the significant meaning for semantic analysis and other purposes make dependency parsing gaining more and more attention in recent years. The thesis commits itself to find the way to increase the speed and accuracy of de-pendency parsing. The novelties and main contributions are summarized as follows: (1) A layer-by-layer dependency parser based on sequence labeling models is proposed. Graph-based models and transition-based models are two dominant data-driven paradigms in the dependency parsing community. The unit they calculate to find the op-timal structure is the whole sentence and a couple of words respectively, which implies that these two kinds of methods represent the two extremes for optimal structure search-ing. In this thesis, we adopt a moderate structure for parser modeling: a dependency layer. Inside the layer the dependency graphs are searched exhaustively while between the lay-ers the parser state transfers deterministically. Taking the dependency layer as the parsing unit, the proposed parser has a lower computational complexity than graph-based models and alleviates the error propagation that transition-based models suffer from. Furthermore, the parser adopts the sequence labeling models to find the optimal graph of the layer which demonstrates that the sequence labeling techniques are also competent for hierar-chical structure analysis. Layer-based framework, neighboring relation analysis mecha-nism and CRF-based labeling offer the proposed approach desirable accuracies and espe-cially a fast parsing speed, which will be quite helpful for large scale corpora analysis. (2) A two-pass dependency parsing approach for long Chinese sentences is presented. Sentence segmentation is one of the effective avenues to handle the problems in long sentence parsing. Traditional approaches use classified punctuations as the divider. But the poor classifying accuracy of the punctuations shackles the improvement of the final parsing performance. In the propose...
关键词	依存句法分析高效率汉语长句切分最长短语识别 Dependency Parsing Efficient Chinese Long Sentence Segmentation Maximal-length Phrase Identification
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6248
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	鉴萍. 依存句法分析方法研究与系统实现[D]. 中国科学院自动化研究所. 中国科学院研究生院,2010.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20061801462804（1073KB）			限制开放	CC BY-NC-SA