CASIA OpenIR  > 毕业生  > 硕士学位论文
汉语句法分析方法研究
其他题名Approaches to Syntactic Parsing of Chinese
李幸
学位类型工学硕士
导师宗成庆
2005-05-01
学位授予单位中国科学院研究生院
学位授予地点中国科学院自动化研究所
学位专业模式识别与智能系统
关键词句法分析 Natural Language Processing
摘要句法分析是自然语言处理中的关键性问题之一,其主要任务就是自动识别句子的句法结构,即句子包含的句法单位以及这些句法单位相互之间的关系。句法分析问题的解决对于机器翻译、自然语言理解、信息抽取和自动文摘等自然语言处理系统都有着极其重要的意义。在基于统计的句法分析方法中,最关键的两个问题是句法分析算法和歧义消解模型的设计,他们决定着句法分析系统的效率和分析正确率。本文从事的工作则从这些方面入手,实现了一个高效的中文句法分析器,主要研究工作如下: 1. 在句法分析算法方面,对传统的句法分析算法从处理策略,算法的时间和空间复杂度等方面进行了综合分析和比较。并在此基础上,详细研究了Chart算法的一个改进算法——“角色反演算法”。针对该算法,本文在两方面提出进一步的改进。首先改进了算法中采用的静态数据表的构造方法,使得该算法能处理的原始输入词性标记从最小的句法单元——词,扩展到更高一级的句法单元——短语和句子,以很小的额外空间消耗为代价,提高了算法的处理能力和效率。然后,引入规则的概率信息对静态表排序,有利于后续分析的搜索和剪枝过程。 2. 针对复杂长句句法分析的困难,通过分析标点符号在长句构成上的作用和规律,针对长句提出了一种分层的句法分析方法。该方法把标点符号分为分割标点和普通标点两类,根据分割标点将复杂长句分割为句子单元序列独立进行第一级分析,然后把第一级分析得到的结果作为第二级分析的输入,最终输出结果为完整的句法分析树。另外,通过提取含有所有两类标点符号的文法规则,在一定程度上帮助了句法结构歧义的消解。实验证明该算法大大降低了长句分析的时间复杂度,并且比传统的一遍搜索方法的正确率和召回率均提高了7%。 3. 在歧义消解模型方面,在传统的概率上下文无关文法(PCFG)模型的基础上,提出了一个包含内部成分结构信息的PCFG模型,并进一步引入中心词信息,得到包含内部结构成分信息和中心词信息的词汇化PCFG模型。并且,本文提出了根据内部成分结构标记确定中心词的方法,此方法比传统的中心词确定方法具有更高的正确性和直观性。
其他摘要The main contributions aresummarized as follows: 1. In parsing algorithm, most traditional parsing algorithms are analyzed and compared mainly in the processing strategy, time consumption and space consumption. A “role inverse algorithm” which is an improved version of Chart parsing algorithm is studied detailedly. Based on this algorithm, this thesis proposed two aspects of improvement. Firstly, the static rule tables are extended, so that the original input of the algorithm can extend from words to phrases and sentences. In this way, the processing ability and efficiency of the algorithm will be improved. Secondly, the probabilities of grammar rules are used to sort the rule tables, which will avail the latter pruning. 2. In order to solve the difficulty of parsing long Chinese sentences, the usage and function of Chinese punctuations are studied in syntactic parsing and a hierarchical parsing approach is proposed. It differentiates from most of the previous approaches mainly in two aspects. Firstly, Chinese punctuations are classified as ‘divide’ punctuations and ‘ordinary’ ones. Long complex sentences which include the categories of ‘divided’ punctuations are broken into suitable units, so the parsing will be carried out in two stages. This ‘divide-and-rule’ strategy greatly reduces the difficulty of acquiring the boundaries of sub-sentences and syntactic structure of sub-sentences or phrases simultaneously in once-level parsing strategy of most of previous approaches. Secondly, a grammar rules system including all punctuations is built to be used in parsing and disambiguating sentences. Experiments show that our approach can significantly reduce the time consumption and numbers of ambiguous edges of traditional methods, and the accuracy and recall rate of traditional method is increased by 7% by our method, when parsing long complex sentences. 3. In parsing module for disambiguation, based on a classical probability context-free grammar (PCFG) module, the inner structure information is incorporated into PCFG module to form a new module. The head word information is further introduced into above module, then a lexical PCFG module which includes inner structure information and head word information is constructed. At last, this thesis proposed a new method to find the head words, which is simple and has greatly higher accuracy than most of other methods.
馆藏号XWLW883
其他标识符200228014603551
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/6884
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
李幸. 汉语句法分析方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2005.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[李幸]的文章
百度学术
百度学术中相似的文章
[李幸]的文章
必应学术
必应学术中相似的文章
[李幸]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。