CASIA OpenIR  > 毕业生  > 硕士学位论文
大规模数据集下推荐算法研究
其他题名Automated Recommendation With Large Data Sets
陈才
学位类型工学硕士
导师曾大军
2012-08-22
学位授予单位中国科学院大学
学位授予地点中国科学院自动化研究所
学科专业模式识别与智能系统
关键词推荐系统 大数据 矩阵分解 Hadoop平台 Recommender System Big Data Matrix Factoring Hadoop
中文摘要近年来,推荐服务已经成为亚马逊、Netflix,Flicker,Delicious等一类用户虚拟生活平台上的一项基本服务。在这些虚拟生活平台上,用户可以对平台上的物品进行各种操作,如在亚马逊上用户可以对书籍浏览、购买,在Flicker上用户可以对图片浏览、评论、打分、分享等等。由于物品数目过大,用户无法逐个浏览,平台自动地向用户推荐物品就显得尤为重要,推荐的效果直接影响着用户体验。平台上积累的大量用户和物品间的交互数据也使得这种推荐变得可行。基于这些数据,目前研究领域已经提出很多成熟的推荐算法。 随着互联网的广泛普及,近些年来,这些虚拟生活平台上用户数目急剧增长,同时用户参与度也不断加深,这些因素使得这些平台上数据的规模越来越大。如淘宝网上用户规模超过2亿,京东用户规模超过1亿,优酷用户规模超过3亿。这种大规模数据集对算法提出了新的要求:(1)时间效率,由于数据规模大,要求算法高效;(2)空间效率,由于一般推荐算法非常耗费存储,需要存储大量中间结果,海量数据时,要求空间效率足够高;(3)算法效果,海量数据下算法面临着更加严重的数据稀疏性问题,要求算法效果足够好。 要解决上面列出大规模数据集下对算法的挑战,必须使用分布式算法。而近年来,特别是Hadoop流行之后,集群技术得到快速发展。各种成熟的集群技术使得推荐算法的分布式实现变得简单可行。 为此,本文提出了一个分布式的超大数据集下的推荐算法。该算法首先通过模糊切分方法将原始数据集切分成若干个子数据集,然后独立的计算出每个子数据集的矩阵分解结果,最后组合这些结果并生成最终推荐。算法可以非常方便的在各种并行平台(如Hadoop)上实现。本文(1)详细介绍了该并行算法;(2)在Hadoop平台上实现了算法;(3)通过在真实的数据集上实验,验证算法的时间效率、空间效率、推荐效果(4)详细分析了算法效果与可扩展性并总结了下一步工作方向。
英文摘要In recent years, the automated recommendation service has become an essential service in virtual life platforms, such as Amazon, Netflix, Flicker, Delicious, etc. In such platforms, a user can take a variety of actions on the items provided by the platform. For example, in Amazon, a user can browser, review, buy books; in Flicker, a user can browser, comment, rate, and share photos. Since the number of items is so large that the users can not browser them one by one, automated recommendation becomes very important. The performance of the recommendation directly affects the user experience. The rapidly accumulating interactive data between users and items that the platform collected also makes this kind of recommendation possible. As a response, the research community has proposed many recommendation algorithms. As the Internet usage deepens, the number of users in these platforms becomes larger and larger. At the same time, the user community becomes more active. Because of these factors, these website have collected huge amounts of data. For example, Taobao.com has more than 200 million users; 360buy.com has more than 100 million users; and Youku.com has more than 300 million users. Massive data sets bring about new challenges: (1) time efficiency, because the data size is so large, the algorithm must be time efficient; (2) space efficiency, considering the traditional recommendation algorithm usually need much space to storage the immediate result, as the data size is large, the algorithm should be space efficient; (3) performance, the data sparsity problem would be more serious in the massive data set environment. To tackle these problems, in this thesis we propose a scalable distributed recommendation algorithm which can used for very large datasets. The algorithm first partitions the original dataset into many sub-datasets with a fuzzy partition method. It then performs matrix factoring on each sub dataset, and at last combines these results and get the final recommendation. The proposed algorithm can be easily implemented in may parallel platforms, such as the popular Hadoop platform. In this thesis, we (1) provide a detail introduction of the proposed algorithm, (2) implement the algorithm in Hadoop platform, (3) validate the efficacy and the time and space efficiency of the proposed algorithm based on experiments, (4) analyze the algorithm and introduce the feature works.
馆藏号XWLW1825
其他标识符200928014628027
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/7651
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
陈才. 大规模数据集下推荐算法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2012.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
CASIA_20092801462802(1136KB) 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[陈才]的文章
百度学术
百度学术中相似的文章
[陈才]的文章
必应学术
必应学术中相似的文章
[陈才]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。