CASIA OpenIR  > 毕业生  > 博士学位论文
基于互信息的代价缺失学习在不平衡数据中的研究
Alternative TitleCost-Free Learning in the Class Imbalance Problem based on Mutual Information
张晓晚
Subtype工学博士
Thesis Advisor胡包钢
2014-05-20
Degree Grantor中国科学院大学
Place of Conferral中国科学院自动化研究所
Degree Discipline模式识别与智能系统
Keyword不平衡数据 代价缺失学习 代价敏感学习 互信息 拒识 图形化评估方法 Class Imbalance Cost-free Learning Cost-sensitive Learning Mutual Information Abstaining Graphical Evaluation Method
Abstract对于不平衡数据的学习,根据其是否要求代价信息参与计算,可分为代价敏感学习和不需要代价的学习方法。本篇论文将所有不需要代价参与计算的学习统一定义为代价缺失学习。若代价未知,则代价敏感学习不适用,可采用代价缺失学习,如抽样法和一些基于准则的方法。然而,现有的代价敏感学习和代价缺失学习都不能处理错误信息和拒识信息未知的带拒识分类问题。因此,基于信息理论,本文提出一种全新的代价缺失学习策略,最大化数据的真实类别和预测类别之间的归一化互信息。该方法能够自动平衡各类错误和拒识,可处理二值分类和多值分类问题,包括无拒识和带拒识的情况。本篇论文的贡献主要包括以下几个部分: 1.针对不平衡数据学习中代价未知的情形,考虑到代价敏感学习方法的局限性,以及现有的不平衡数据学习方法在处理带拒识分类问题时表现的不足,本文提出一种通用的代价缺失学习的策略。胡已提出互信息分类器,并通过数值实验验证了互信息在处理不平衡数据时具有自动保护少数类样本的优势,但其方法没有应用于真实数据。本文基于真实的不平衡数据进行学习,利用归一化互信息所具有的自动识别错误类别和拒识类别的能力,以归一化互信息作为学习目标,简单且直接地利用传统的分类器处理不平衡数据。在代价未知的情形下,对于二值分类和多值分类问题,包括无拒识和带拒识的情况,自动平衡各类错分和拒识,得到合理的分类结果。因此,本文提出的代价缺失学习策略可弥补现有的不平衡数据学习方法的不足,并有效地解决传统的分类方法在不平衡数据学习中产生的问题。 2.在代价敏感学习中,代价未知是一个常见且公认的难题。因此很多方试图回避明确的代价,或尝试去学习代价。如果引入拒识,由于现有的学习方法都不能学习到合理的拒识信息,代价未知问题变得更加复杂。采用本文提出的代价缺失学习策略,通过最优化学习目标,可得到两种有意义的最优参数。对于带拒识的二值分类和多值分类问题,可以自动地得到最优的拒识阈值。针对无拒识二值分类问题,本文与Elkan关于代价与决策阈值的经典工作建立关联,得到“等价”错分代价;针对带拒识二值分类问题,胡已指出代价敏感学习中存在参数冗余问题。本文利用无拒识分类问题中得到的“等价”错分代价作为先验知识,解决参数冗余问题,得到“等价”拒识代价。其中,拒识阈值和“等价”代价不由人为指定,完全取决于数据的分布和基础分类器,因此具有客观性。同时,这些“等价”代价可以为代价敏感学习中主观代价的确定提供客观的参考,将代价缺失学习与代价敏感学习建立联系。 3.图形化评估方法可以形象且直观地分析分类器的性能。本文首次对ROC空间带拒识情况给出几何解释,进一步探讨了拒识在PR 空间、代价空间的关联。同时从图形化评估曲线上可以看出,无拒识与Chow 的拒识是通常带拒识问题的特例。这项工作为分类性能分析提供明确的图形化解释,并为用户利用图形化方法交互式地调整参数提供便利。
Other AbstractIn the context of learning in the class imbalance problem, the approaches fall into two categories: cost-sensitive learning (CSL) and the learning that does not require any cost information. In this thesis, we define cost-free learning (CFL) as the learning approach that seeks optimal classification results without requiring any cost information. If the costs are unknown, CSL can not work, while CFL can be applied, such as sampling and some criteria-based approaches. However, to our best knowledge, none of the existing CSL and CFL approaches are able to process the abstaining classifications properly when no information is given about errors and rejects. Based on information theory, we propose a novel CFL which seeks to maximize normalized mutual information of the targets and the decision outputs of classifiers. While the degree of class imbalance is changing, the proposed strategy is able to balance the errors and rejects accordingly and automatically. Using the strategy, we can handle binary/multi-class classifications with/without abstaining. The main contributions of this thesis are as follows: 1.In the situation that costs are unknown in the class imbalance problem, this thesis presents a general CFL strategy by studying the limitations of CSL and the deficiencies of the existing learning methods in dealing with abstaining classifications. Hu[1] has proposed mutual information classifier and verified its effectiveness by applying numerical data. In this thesis, we focus on learning from real data sets. Using normalized mutual information (NI) as the learning target, we conduct the learning from conventional classifiers, and adopt them for simple and direct implementations. The proposed CFL strategy is able to balance the errors and rejects accordingly and automatically. Using the strategy, we can handle binary/multi-class classifications with/without abstaining. Therefore, the proposed strategy can not only compensate the limitations of the existing imbalanced learning methods, but also solve the problems caused by applying conventional classification algorithms into the class imbalance learning. 2.The issue that costs are unknown is not unusual in real-world applications, and it is regarded as the challenge in the field of CSL. Various methods intent to avoid the specific values of costs, or they attempt to learn the costs. However, the problem will become more complex when reject option is involved, since existing learning methods can not get p...
shelfnumXWLW1988
Other Identifier201018014628073
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/6582
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
张晓晚. 基于互信息的代价缺失学习在不平衡数据中的研究[D]. 中国科学院自动化研究所. 中国科学院大学,2014.
Files in This Item:
File Name/Size DocType Version Access License
CASIA_20101801462807(3357KB) 暂不开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[张晓晚]'s Articles
Baidu academic
Similar articles in Baidu academic
[张晓晚]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[张晓晚]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.