CASIA OpenIR  > 毕业生  > 硕士学位论文
大规模平衡语料的收集分析及文本分类方法研究
其他题名Collection and Analysis of Large-Scale Balance-Corpus and Approach to Text Categorization
陈克利
学位类型工学硕士
导师宗成庆
2004-06-01
学位授予单位中国科学院研究生院
学位授予地点中国科学院自动化研究所
学位专业模式识别与智能系统
关键词平衡语料 语料库 文本分类 特征抽取 Balance Corpus Corpora Text Categorization Feature Extraction
摘要语料库和词典是进行自然语言处理研究的重要资源。语言学的研究必须以语 言事实作为依据,语言现象的复杂性决定了要全面的了解其特点必须有大规模语 料库的支持,否则只能是无源之水、无本之木。尤其是随着统计模型在自然语言 处理领域的应用,大规模语料库的作用更加突出,无论对于语言现象本身的研究, 还是对于信息检索、机器翻译、文本分类、自动分词等应用系统的研究和开发, 都具有非常重要的意义。同样,词典开发不仅是自然语言处理研究的基础性工作, 也是字典编纂、语言教学等工作的重要环节。医此,本论文从事的大规模平衡语 料的收集和分析工作,以及在该工作基础上开展的文本分类技术研究,具有重要 的理论意义和实用价值。 本文首先在欧共体项目(LC-STAR)的资助下,开展了大规模汉语平衡语 料的收集与分析工作,其主要目的是建立一个反映现代汉语语言特点的、适用于 汉语语言分析、语音识别和语音合成的汉语标注语料库,并以此为基础建立一部 相应的信息词典。该工作主要包括:(1)在对大规模汉语平衡语料收集方法进行 调研和分析的基础上,收集并标注了规模达3087万字的汉语平衡语料:(2)以 收集的语料为基础,建立了一个大规模(10多万词)的现代汉语信息词典,词 条标注信息包括词性、注音、词频和专用词的领域信息等。 以上述工作为基础,我们对文本分类方法进行了深入的研究,主要创新包括: 第一,在特征权重计算方面,通过对常见特征权重算法的分析和比较,提出 了在TF*IDF算法中用TF的n次方代替TF,并引入DBV变量的处 理方法,使得该算法的F1-Measure测度提高了4~5%。 第二,类似于上面的处理方法,在TF*IWF算法中用TF的n次方代替TF, 并引入DBV变量,使得该算法的F1.Measure测度提高了12.28%。 第三,在特征向量抽取方面,在Rocchio分类器上对常见的特征向量抽取方 法进行了全面对比,然后提出了将TF*IDF算法用于特征抽取的处理 方法,并通过实验证明,该方法在不同数目关键词下的分类效果均优 于其它常见的特征向量抽取算法。
其他摘要Corpus and lexicon are important linguistic resources for Natural Language Processing. Linguistic research should be based on linguistic facts, and large-scale corpus is necessary for probing into linguistic research which is complex. Especially. along with wide application of statistical model in Natural Language Processing, large-scale corpus is playing a more important, role. Large-scale copus is important to, not only research of linguistic phenomena, but also systems for Information Retrieval Machine Translation, Text Classification, or Automatic POS-Tagging and so on. In_ addition, Chinese dictionary is the base of Natural Language Processing, as well as a necessary part of creation of Chinese characters dictionary and linguistic teaching Therefore, the collection and analysis of large-scale balance-corpus in our work, on which text classification is based, has theoretic significance and practical value. Our work supported by European Union's project of LC-STAR, which include collection and analysis of a large-scale balance-corpus, aims to build a Chinese tagged corpus and an information lexicon for Speech Recognition and Speech Systhesis. The main work can be conluded as: (1) After investigating and analyzing the strategies for large-scale Chinese balance-corpus, we have collected and tagged a Chinese corpus consisting of 30.87M Chinese characters; (2) Based on the colleated corpus, we have created a Chinese information lexicon consisting of 103192 words (including POS-tag, phonesization, words' frequency and domain information for application words). We have done some research work on Text Classification based on all the above Our inovation in Text Classification can be concluded as follows: 3 About feature weighting, we have analyzed the advantages and disadvantages of common feature weighting algorithms, and introduced two improvements into TF*IDF which is among common feature weighting alogrithms. The two improvements are replacing TF with its nth root and introducing DBV into the expression. F1-M of classifier has been improved by 4~5%, so the effectiveness has been proved. 4 Similarly, we have introduced the two improvements into TF*IWF feature weighting algorithm, resulting in 12.28% improvement of F1-M. 5 About feature extraction, we have compared several common feature extraction algorithms, and presented to introduce TF*IDF algorithm for feature extraction. Our comsequent expriments have proved this algorithm more effective than the other.
馆藏号XWLW761
其他标识符761
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/6752
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
陈克利. 大规模平衡语料的收集分析及文本分类方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2004.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[陈克利]的文章
百度学术
百度学术中相似的文章
[陈克利]的文章
必应学术
必应学术中相似的文章
[陈克利]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。