面向自动化学科中文期刊论文的文本挖掘系统

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向自动化学科中文期刊论文的文本挖掘系统
其他题名	The Text Mining System Subjected to Automatic Journal Articles in Chinese
	刘禹
	2012-05-24
学位类型	工程硕士
中文摘要	“自动化学科创新思想与方法研究”课题对影响国内自动化学科发展的因素进行系统分析，并利用各因素之间的相互联系构建自动化学科知识体系，通过对已有思想与方法的形成和发展规律进行总结，对学科发展方向进行前瞻性预测。该课题的最终目标是在学科知识体系的基础上开发学科知识服务网络平台，为相关领域的研究人员和技术人员提供知识服务，进而推动知识创新。知识要素（包括研究对象、研究方法、研究工具、研究人员、研究机构等）是建设学科知识体系的基本要素，因此知识要素获取是该课题的首要环节。本文以课题中的知识要素获取需求为研究课题，在大量文献调研和实验的基础上设计和实现了用于知识要素抽取的文本挖掘系统，并在项目中得到很好的应用。论文的主要工作和贡献如下： ①文本分类和特征词选择技术在数据清洗中的应用。本文实现了文本分类的文档向量模型(VSM)，将其用于区分自动化学科和非自动化学科的文献；提出了基于卡方拟合优度的特征词选择方法(chifit)，该方法能够使用较低的特征维度达到较好的分类效果。 ②提出了基于编辑距离二次计算的关键词语义聚类算法。项目数据中有大量文献关键词在形态上相似且语义上相同，该算法充分利用这一特性将语义聚类问题转换成形态聚类问题。 ③提出并实现了知识族谱构建方案。该方案把与被查询知识点在时间上可能存在继承、发展、演变关系的知识点以亲疏程度和时间切片为依据展现出来，用以辅助用户进行文献检索和知识理解。 ④提出了基于距离属性的二叉分裂算法。该算法属于分裂式层次聚类算法，算法的执行过程即是层序建立二叉树的过程，叶子结点就是最终的聚类。该算法有效解决了人物名称与机构名称对齐问题。 ⑤提出了基于图聚类的人名消歧算法。汉语中存在大量人名重复现象，给准确统计学者的学术成果带来困难。该算法将名字视为图上的结点，根据两个结点之间的属性相似情况，决定是否加边，最后根据图的连通特性，将每一个连通分量视为指向同一人物实体的聚类。 ⑥提出了一种无监督的机构名称归一化算法，该算法充分利用同一个人物实体所涉及的机构名称之间的关系，提取一级机构名称，不需要事先准备规范化的机构名称列表，也不需要定义复杂的机构名称结构规则。关键词：文本挖掘，文本分类，文本聚类，命名实体识别与消歧，知识服务
英文摘要	The research of innovative ideology and methodology in automation discipline aims to give a systematic analysis of the factors which play important roles in the development of domestic automation discipline. It also aims to explore the relationship among those factors to build a knowledge system, whose ultimate goal is to develop a network platform offering knowledge services to potential users. Factors that include research objects, researchers, institutions, methods, theories, tools, etc. are so vital to the knowledge system that it is of great prominence to retrieve them precisely. This paper designs and implements a text mining system focusing on information extraction. The main contributions are summarized as follows: ① The application of text categorization and feature word selection technique in data cleaning. The vector space model approach is implemented to predict articles’ categories. A feature selection method named chifit is proposed, which can achieve higher precision with lower feature dimension. ② A method that reduces the problem of semantic clustering to morphological similarity computation is proposed to resolve keywords clustering. ③ A novel scheme “knowledge pedigree” is proposed and implemented to facilitate users in literature research and knowledge understanding. ④ A divisive clustering approach is used for person-institution alignment. This method is very similar to constructing a binary tree in a level-order traverse. ⑤ In order to evaluate the scholars’ academic influence precisely, a clustering approach based on graph is presented for person name disambiguation. ⑥ An unsupervised institution name normalization method is proposed, fully exploring the institution data within each person entity. Key Words: text mining, text classification, text clustering, name entity recognition and disambiguation, knowledge service
关键词	文本挖掘文本分类文本聚类命名实体识别与消歧知识服务 Text Mining Text Classification Text Clustering Name Entity Recognition And Disambiguation Knowledge Service
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7632
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	刘禹. 面向自动化学科中文期刊论文的文本挖掘系统[D]. 中国科学院自动化研究所. 中国科学院研究生院,2012.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_2009M801462901（6060KB）			暂不开放	CC BY-NC-SA