排序学习的理论和方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	排序学习的理论和方法研究
其他题名	On the theory and algorithm of ranking
	夏粉
	2008-05-28
学位类型	工学博士
中文摘要	随着信息技术的迅速发展和互联网规模的不断扩大，互联网已经成为了全球最大、最广泛使用的信息库，如何有效检索这些海量信息成为当前重要的研究课题，因而信息检索(Information Retrieval, IR) 技术越来越受到人们的重视。在目前绝大多数的信息检索系统中，其检索出来的信息（如文档等）都以排序的方式返回给用户，因此，信息检索模型研究的核心问题也就归结为如何高效地为信息进行排序，实质上是要解决“排序学习”问题。排序学习主要存在两大类问题：序回归问题，序列学习问题。序回归问题旨在预测样本的等级，而序列学习问题旨在为目标对象集合确定一个顺序。本文提出了一种排序学习的学习框架，针对两种排序学习问题分别给出了相应的统计意义下的学习定义，借助统计学习理论工具对排序学习进行了理论分析，并提出了一系列有效算法，主要的工作和贡献如下： 1. 在序回归的学习理论方面，建立了统计意义下的序回归的定义，提出了一种将序回归约简为多分类的框架，指明了序回归与分类之间的关系。基于此框架，提出一种解决序回归的样本加权方法，并证明了加权样本集上的风险是序回归风险的上界。 2. 在序回归的算法设计方面，从三个角度提出了有效的算法： (a) 从序回归的约简框架出发，提出了一种优化加权样本集风险的算法，与传统的多类Logistic Boosting 算法和主流的域值模型算法相比，该算法能够处理加权样本集的优化问题，能够使用样本空间多维信息，在一些数据集上取得了明显优异的效果。 (b) 从样本对之间偏好关系出发，定义了一种新的集合上的序不纯度度量，提出了一种新的决策树算法（排序树），排序树继承了决策树算法的众多优点，同时推广了决策树算法的使用范围。 (c) 从排序所需的特征出发，提出了一种多特征的序回归学习算法（包含两个部分，首先递归地从训练样本中提取特征，然后使用排序树来组合这些特征生成排序规则），该算法克服了主流的域值模型算法通常只用到样本空间的一维特征，无法充分挖掘样本空间的信息的缺点，实验结果表明，该算法在一些数据集上抽取了多个特征，取得了比只用一个特征的阈值模型算法好的效果。 3. 在序列学习的学习理论方面，首次提出了统计意义下的序列学习定义，即输入空间为待排对象集合，输出空间为待排对象所有可能的排列，假设空间为评分函数与排序函数的复合函数，损失函数为定义在预测排列和真实排列上的实值函数。根据这个定义，从损失函数的角度提出了分析学习算法好坏的四个评价标准，即：统计方面的一致性，有限样本方面的稳健性，数学方面的连续、可微和凹凸性，计算方面的有效性，该评价准则为排序学习奠定了坚实的理论基础，为分析现有算法以及提出新的算法提出了理论指导。最后，还提出了损失函数一致性的充分条件。 4. 在序列学习算法设计方面，以序列学习理论为指导，提出了一种新的损失函数，该损失函数具有一致性，良好的稳健性，连续、可微、凸函数，以及线性的计算复杂度。由于损失函数的优良特性，它对应的优化问题不但容易求解，并且其解明显优于其他算法。 5. 在实际的序列学习应用中，我们发现序列顶部的信息要比序列其他位置重要，以及数据集常以偏序形式给出，针对这两个特点，我们为前者提出了一种top-k损失函数以及相应的算法，为后者提出了一种从偏序数据中学习排序规则的迭代算法。在国际序列学习的标准数据集上，验证了所提出算法的优异性能。
英文摘要	Ranking problems can be categorized into two classes, ordinal regression and learning to rank. The former aims to predict samples of ordinal scale. The latter aims to predict a ranked list for a set of items. This paper proposes a framework for the ranking problem, where two ranking problems are defined in terms of machine learning language, respectively. Base on this framework, we investigate the theory of the ranking problem and propose some effective algorithms. Overall, the work and contributions are the following: 1. On the theory level of ordinal regression, we give a formal definition for the ordinal regression and propose a reduction framework which transforms the ordinal regression to the stand multi-classification. The reduction framework not only indicates the relation between the ordinal regression and the classification, but also leads to a new weighting mechanism to solve the ordinal regression. We also prove that the risk of the ordinal regression is bounded by that of the weighted training set. On the algorithm level of ordinal regression, we propose three effective algorithms: (1) an algorithm to optimize the risk of the weighted training set; (2) a new decision tree algorithm, i.e., Ranking Tree; (3) a multi-feature ordinal regression algorithm. 2. On the theory level of learning to rank, we propose a supervised learning definition for it. On this definition, we investigate statistical properties of the surrogate loss functions in terms of consistency and soundness, which provides a theoretical foundation for analyzing the strength and weakness of existing approaches to learning to rank, and for devising more advanced algorithms. On the algorithm level of learning to rank, we propose a novel surrogate loss function for learning to rank, which is convex, sound, statistically consistent, and computationally efficient. The corresponding optimization problem is not only easy to solve, but also has a better solution than others in terms of ranking accuracy. 3. In the practical application of learning to rank, we find that items on the top of a ranked list are more important than those on other positions and the data set is usually with partial order relation. Therefore, we propose a new top-k loss function and develop an iterative optimization algorithm which can not only learn the ranking function but also mine the "true" ranked lists from the partial constraints. Experiments on a benchmark data set validate the usefulness of the proposed algorithm.
关键词	机器学习排序学习代价敏感学习一致性 Machine Learning Ranking Learning Cost-sensitive Learning Consistency
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6082
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	夏粉. 排序学习的理论和方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2008.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20051801462808（1388KB）			暂不开放	CC BY-NC-SA