英文摘要 | In recent years, the automated recommendation service has become an essential service in virtual life platforms, such as Amazon, Netflix, Flicker, Delicious, etc. In such platforms, a user can take a variety of actions on the items provided by the platform. For example, in Amazon, a user can browser, review, buy books; in Flicker, a user can browser, comment, rate, and share photos. Since the number of items is so large that the users can not browser them one by one, automated recommendation becomes very important. The performance of the recommendation directly affects the user experience. The rapidly accumulating interactive data between users and items that the platform collected also makes this kind of recommendation possible. As a response, the research community has proposed many recommendation algorithms. As the Internet usage deepens, the number of users in these platforms becomes larger and larger. At the same time, the user community becomes more active. Because of these factors, these website have collected huge amounts of data. For example, Taobao.com has more than 200 million users; 360buy.com has more than 100 million users; and Youku.com has more than 300 million users. Massive data sets bring about new challenges: (1) time efficiency, because the data size is so large, the algorithm must be time efficient; (2) space efficiency, considering the traditional recommendation algorithm usually need much space to storage the immediate result, as the data size is large, the algorithm should be space efficient; (3) performance, the data sparsity problem would be more serious in the massive data set environment. To tackle these problems, in this thesis we propose a scalable distributed recommendation algorithm which can used for very large datasets. The algorithm first partitions the original dataset into many sub-datasets with a fuzzy partition method. It then performs matrix factoring on each sub dataset, and at last combines these results and get the final recommendation. The proposed algorithm can be easily implemented in may parallel platforms, such as the popular Hadoop platform. In this thesis, we (1) provide a detail introduction of the proposed algorithm, (2) implement the algorithm in Hadoop platform, (3) validate the efficacy and the time and space efficiency of the proposed algorithm based on experiments, (4) analyze the algorithm and introduce the feature works. |
修改评论