基于reduct理论的数据描述

CASIA OpenIR > 毕业生 > 博士学位论文

	基于reduct理论的数据描述
其他题名	Data Description Based on Reduct Theory
	赵岷
	2004-05-01
学位类型	工学博士
中文摘要	数据描述是一类特殊的数据挖掘任务：根据用户需求，将定义在符号域上的信息系统(数据集合)约简为人可阅读的具有不同简洁程度的文本，同时，对约简过程中产生的例外进行分析。这个任务与认知心理学的“规则+例外”原理一致，它有三个要点：(1)根据用户需求获得解答；(2)不同简洁程度的文本； (3)例外分析。本文使用粗糙集理论中的：reduct理论作为工具，形式化的阐述数据描述任务，并解决相应的计算问题。直接使用正区域与边缘区域描述规则与例外不符合人的认知，为了准确刻画 “规则+例外’’模型，本文将其修改为认知正区域与认知边缘区域。由于正区域是reduct理论的基础，并且对给定信息系统唯一，而认知正区域不满足唯一性条件，因此，我们重新定义与证明了基于正区域定义的所有概念与性质。用户通常希望给定需求下描述尽量简洁，我们用基于认知正区域的reduct定义文本粒子，作为数据集合的简洁描述。传统粗糙集理论的研究一般不关心边缘区域的结构，而“例外”与边缘区域密切相关。因此，我们详细的研究了边缘区域的结构与性质，以了解“例外” 空间的结构，为例外分析奠定基础。为了有效鉴别例外，我们设计了一种特殊的差别矩阵来分析边缘区域的结构和例外的形成过程，并提出基于core属性的例外鉴别方法。 Core与reduct是Reduct理论中的两个基本概念。Core有一个重要性质：如果一个属性是core属性，从信息系统中删除这个属性，必然导致边缘区域的改变。这个性质是计算例外的基础。此外，reduct与core之间存在一种特殊的关系：基于给定信息系统的reduct构成的新信息系统，其中所有属性均为core属性。这暗示着，如果我们能够计算出信息系统的reduct，那么从这个reduct中逐步删除属性，即可生成不同简洁程度的文本与派生例外。对大规模数据构成的信息系统，使用该方法进行数据描述的先决条件是寻找快速的可以根据需求计算 reduct的算法。本文在分析前人算法的基础上，发现计算reduct空间的中间表示是影响算法效率的关键，为此，提出了一种对样本个数呈线性的计算reduct的树表示算法，并且，在这种表示下，计算reduct理论的其他概念同样有效。我们证明，这个算法对reduct完备，且与基于属性序reduct算法等价。本论文的主要成果是： 1．基于树表示的reduct、core等基本概念的快速计算方法，其复杂性与样本个数呈线性关系。 2．提出“认知
英文摘要	Data description is a sort of special task of data mining：for given user's requirements，transforming the information system(data set)defined on symbol domain into human-readable texts with different concise degree，and at the same time， analyzing exceptions produced in the procedure of transforming．This task conforms to the principle of"rule-plus-exception"in cognitive psychology．It has three points： (1)finding solutions according to user's requirements，(2)obtaining texts with different concise degree，and(3)analyzing exceptions．We employ the reduct theory in rough set theory as the tool to．formalize the problems of data description，and design the corresponding algorithms． It is inadequate for using the notions of "positive region" and "boundary region" to represent rules and exceptions directly．We modify the two notions into"cognitive positive region" and " cognitive boundary region" separately so as to depict the rule-Plus-exception model accurately．Since positive。region serves as the basis of reduct theory and is unique for given information system，while cognitive positive region does not satisfy the condition of uniqueness，we redefine and prove all notions and properties originally defined based on positive region．Usually，users hope to obtain concise description with respect to the given requirement，so we define the text granule by the notion of"reduct"based on cognitive positive region，as the concise description of data set． Traditionally，researches on rough set theory do not pay much attention to the structure of boundary region．However,"exception"relates closely to boundary region．Hence，we investigate the structure and properties of boundary．region in order to gain insights into the structure of exception space and ground exception analysis． For the sake of identifying exceptions effectively, we design a special discernibility matrix to analyze the structure of boundary region and the process of producing exceptions，and present the approach of identifying exceptions based on"core"． "Core"and"reduct"are two basic notions in reduct theory．Core has an important property：if a core attribute is removed from the given information system， then the boundary region of the information system will change．This property is the basis of computing exceptions．In addition，there is a critical relation between reduct and core：if a new information system is constructed based on a reduct of given information system，then attributes in the new information system all belong to its core．Which implies，if a reduct is computed from the given information system，then removing attributes in the reduct step by step can produce texts with different concise degree and corresponding exceptions．The precondition for employing the above approach to describe large．scale information systems is to find fast algorithms for comput
关键词	数据描述 “规则+例外”模型 Reduct 例外分析粒度计算 Data Description Rule-plus-exception Model Reduct Exception Analysis Granular Computing
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/5800
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	赵岷. 基于reduct理论的数据描述[D]. 中国科学院自动化研究所. 中国科学院研究生院,2004.