英文摘要 | In the field of natural language processing (NLP), chunking is defined as extracting the non-overlapping segments from a stream of text data. The Task of base noun phrase (base NP) chunking is focus on recognizing those simple and non-recursive noun phrases with no other noun phrase descendants. Base NP chunking is considered significant and challenging in NLP with high-level theoretical merit and application value. Many of the tasks in Information Retrieval, Information Extraction, Question Analysis, can be performed adequately by identifying the noun phrases, verb phrases, etc. and the relationships between these entities. First of all, based on the previous work on English and Chinese base NP corpus, we made use of available standard tools together with some manual adjustments from handcrafted rules to transfer Upenn Chinese Treebank 5.0 into the form required for base NP chunking. Besides, we also constructed a large English base NP chunking corpus with about 3,000,000 English tokens, whiche was extracted from the parsing corpus from the famous Upenn Linguistics Data Consortium. With these data applicable for training and testing, we choose Support Vector Machine (SVM) and Conditional Random Fields (CRF) for our chunking task. The remaider of the thesis will chiefly focus on the classifier combination approaches to Base NP chunking.We will discuss the following three sub-topics: First of all, we propose a hybrid approach to chunking Chinese base NPs, which combines SVM and CRF models. In order to compare the result from two chunkers, we used a discriminative post-processing method, whose criterion is the conditional probability generated from the CRF chunker. Given the special structures of Chinese base NP and complete analyses of those results, we also customize some handcrafted grammar rules to resolve ambiguities and prune errors. According to our overall experiments, the method achieved a higher accuracy in the final results. Secondly, in order to overcome some shortcoming of the methods metioned above, we continued with an error-driven combination approach to chunking Chinese base NP, which combines TBL (Transformation-based Learning) and CRF model. In order to analyze the result from two classifiers and improve the performance of the base NP chunkers, an error-driven SVM classifier was designed to learn the errors found by comparison between the former two classifiers and modify those errors. Our method achieved a higher accuracy in the final results with F-measure of 89.72% and improvement of 2.35% at most。 In summary, we put forward some challenging topic of the furture work on base NP chunking and other related shallow parsing methods. In our point of view, the research of sequence labelling problem in natural language processing is significant in the development of this field. |
修改评论