大规模数据集下推荐算法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	大规模数据集下推荐算法研究
其他题名	Automated Recommendation With Large Data Sets
	陈才
	2012-08-22
学位类型	工学硕士
中文摘要	近年来，推荐服务已经成为亚马逊、Netflix，Flicker，Delicious等一类用户虚拟生活平台上的一项基本服务。在这些虚拟生活平台上，用户可以对平台上的物品进行各种操作，如在亚马逊上用户可以对书籍浏览、购买，在Flicker上用户可以对图片浏览、评论、打分、分享等等。由于物品数目过大，用户无法逐个浏览，平台自动地向用户推荐物品就显得尤为重要，推荐的效果直接影响着用户体验。平台上积累的大量用户和物品间的交互数据也使得这种推荐变得可行。基于这些数据，目前研究领域已经提出很多成熟的推荐算法。随着互联网的广泛普及，近些年来，这些虚拟生活平台上用户数目急剧增长，同时用户参与度也不断加深，这些因素使得这些平台上数据的规模越来越大。如淘宝网上用户规模超过2亿，京东用户规模超过1亿，优酷用户规模超过3亿。这种大规模数据集对算法提出了新的要求：（1）时间效率，由于数据规模大，要求算法高效；（2）空间效率，由于一般推荐算法非常耗费存储，需要存储大量中间结果，海量数据时，要求空间效率足够高；（3）算法效果，海量数据下算法面临着更加严重的数据稀疏性问题，要求算法效果足够好。要解决上面列出大规模数据集下对算法的挑战，必须使用分布式算法。而近年来，特别是Hadoop流行之后，集群技术得到快速发展。各种成熟的集群技术使得推荐算法的分布式实现变得简单可行。为此，本文提出了一个分布式的超大数据集下的推荐算法。该算法首先通过模糊切分方法将原始数据集切分成若干个子数据集，然后独立的计算出每个子数据集的矩阵分解结果，最后组合这些结果并生成最终推荐。算法可以非常方便的在各种并行平台（如Hadoop）上实现。本文（1）详细介绍了该并行算法；（2）在Hadoop平台上实现了算法；（3）通过在真实的数据集上实验，验证算法的时间效率、空间效率、推荐效果（4）详细分析了算法效果与可扩展性并总结了下一步工作方向。
英文摘要	In recent years, the automated recommendation service has become an essential service in virtual life platforms, such as Amazon, Netflix, Flicker, Delicious, etc. In such platforms, a user can take a variety of actions on the items provided by the platform. For example, in Amazon, a user can browser, review, buy books; in Flicker, a user can browser, comment, rate, and share photos. Since the number of items is so large that the users can not browser them one by one, automated recommendation becomes very important. The performance of the recommendation directly affects the user experience. The rapidly accumulating interactive data between users and items that the platform collected also makes this kind of recommendation possible. As a response, the research community has proposed many recommendation algorithms. As the Internet usage deepens, the number of users in these platforms becomes larger and larger. At the same time, the user community becomes more active. Because of these factors, these website have collected huge amounts of data. For example, Taobao.com has more than 200 million users; 360buy.com has more than 100 million users; and Youku.com has more than 300 million users. Massive data sets bring about new challenges: (1) time efficiency, because the data size is so large, the algorithm must be time efficient; (2) space efficiency, considering the traditional recommendation algorithm usually need much space to storage the immediate result, as the data size is large, the algorithm should be space efficient; (3) performance, the data sparsity problem would be more serious in the massive data set environment. To tackle these problems, in this thesis we propose a scalable distributed recommendation algorithm which can used for very large datasets. The algorithm first partitions the original dataset into many sub-datasets with a fuzzy partition method. It then performs matrix factoring on each sub dataset, and at last combines these results and get the final recommendation. The proposed algorithm can be easily implemented in may parallel platforms, such as the popular Hadoop platform. In this thesis, we (1) provide a detail introduction of the proposed algorithm, (2) implement the algorithm in Hadoop platform, (3) validate the efficacy and the time and space efficiency of the proposed algorithm based on experiments, (4) analyze the algorithm and introduce the feature works.
关键词	推荐系统大数据矩阵分解 Hadoop平台 Recommender System Big Data Matrix Factoring Hadoop
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7651
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	陈才. 大规模数据集下推荐算法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2012.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20092801462802（1136KB）			暂不开放	CC BY-NC-SA