基于历史数据的用户行为预测研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于历史数据的用户行为预测研究
	邱泽宇1,2
	2017-05-26
学位类型	工学硕士
中文摘要	随着计算技术的不断发展及互联网应用的日益普及，各行各业产生了大量信息数据，如何有效利用和挖掘过往数据，学术界和工业界开展了大量的探索与尝试。基于电商历史交易数据可以辅助提高用户体验，降低商家成本；基于用户历史支付数据可以辅助评估金融风险，建立信用机制；基于学术研究历史数据可以辅助衡量科技水平，估计科技未来发展。这些应用直接或间接的依赖于用户行为的分析与建模。因此基于历史数据的用户行为预测研究成为人们争相关注的研究热点。受到数据种类多样、种类间关系复杂、结构化程度低、信息采集难度大等因素影响，历史数据普遍存在数据稀疏、噪声干扰等问题，使得基于历史数据的用户行为预测变得十分有挑战性。本文针对基于历史数据的用户行为预测研究中的多信息联合建模问题和关联信息辅助建模问题提出了自己的解决思路与方案，提高了用户行为预测的准确率。具体而言，本文的主要工作与贡献如下： 1.针对多种历史用户信息联合建模问题，本文提出了基于异构图挖掘的学术机构影响力预测算法。为了挖掘历史信息中多种因素及其关系对未来用户行为的影响，本文构建了一种异构图用于归纳与描述多种影响因素及其关联关系，然后对该异构图进行特征抽取与描述，并结合梯度上升回归树模型对用户未来进行预测。具体地，由于学术机构的影响力与学术机构发表的论文数量成正比，因此通过预测学术机构未来一段时间发表的论文数量可以较好地评估学术机构影响力变化情况。传统的论文数量预测方法大多基于学术机构历史发表论文数量，但是学术机构发表论文受多种因素的影响，例如权威学者数量、机构研究方向等等。利用本文提出的算法能够较好地结合多种因素对学术机构未来发表论文数量（影响力）进行预测与评估。在多个重要会议数据上的实验结果和KDDCup2016学术机构影响力预测比赛结果（第二赛季第一，共341只队）都表明该算法是一种有效的建模多种因素及其关系辅助预测学术机构影响力的算法。 2.针对用户行为预测研究中存在的历史数据不足问题，本文提出了基于多时间窗口关联信息挖掘的重复购买用户预测方法。为了预测被促销吸引来的新用户是否可能成为商店的重复购买用户，本文预测方法从数据扩充、特征设计、模型选择三个方面入手，引入丰富的用户和商店关联信息辅助重复购买用户预测。来自电子商务的关联数据十分稀疏，而且伴随着大量噪声，为此本文提出了基于多时间窗口的用户行为扩充方法，在放松用户行为时间约束的情况下，丰富了关联信息和用户行为数据。进一步，本文分析设计了多种关联信息特征，并结合两种不同的模型进行重复购买用户预测。其中，梯度上升决策树模型能够较好地建立不同特征组合与重复购买用户之间的关系，改进的因子分解机模型能够充分地利用特征之间两两组合的信息增益辅助重复购买用户预测。利用本文提出的方法能够较好地实现对商店新用户中重复购买用户的识别与预测。在天猫商城用户数据的预测实验结果和IJCAI2015重复购买用户预测的比赛结果（5/753）都验证了该方法能够有效地引入关联信息辅助重复购买用户预测。
英文摘要	With the rapid development of information technology, massive data has been collected in various domains. Both academia and industry have extensively explored possible approaches to effectively exploiting such data. Transaction data of E-commerce reveals customers' preferences or even potential demands. Exploiting them could help improve user experience and reduce operating cost of merchants. History of online payments has already been widely utilized to evaluate the financial risk and further construct credit mechanism. Academic research records, in a similar manner, can be a good indicator of the development of technology. Such applications rely more or less on user behavior modeling and prediction, which makes the study of user behavior prediction based on the historical data come into spotlights. The real-world data of user behavior often suffers from issues such as sparsity, noise and complexity, making it challenging to model and predict user behavior. Placing emphasis on multiple data as well as associated data modeling, we propose several solutions. Our work and contributions can be summarized as follows: 1.To address the problem of multiple data joint modeling, we propose an algorithm to predict the influence of research institutions via heterogeneous graph mining. Existing approaches often fail to model multiple factors and their relationships simultaneously, and we build a heterogeneous graph to solve the problem. The heterogeneous graph can clearly describe multiple data and their relationships. By carefully designing features describing the graph and feeding them into gradient boost regression tree model (GBRT), we can predict user behavior in the future. Specifically, we apply the algorithm to forecast the number of papers published by research institutions, which is a strong indicator positively correlate to the influence of institutions. Most traditional methods focus on previously published papers, but the other factors, such as famous scholars, research domains and so on are ignored. The proposed algorithm tackles this problem based on heterogeneous graph mining. Experiments on predicting future publications in several important conferences show that the algorithm is effective in modeling multiple data and their relationships to predict influence of research institutions.Moreover, we have won the first place among 341 teams in the second phase of the research institutions influence prediction competition in KDDCup2016 by using this algorithm. 2.To address the issue of insufficient data in user behavior modeling, we propose a user behavior prediction framework based on associated information modeling. In order to deal with data insufficiency of new buyers attracted by the promotion, we identify repeat buyers by using various associated information of users and merchants. Since user behavior data in E-commerce are very sparse and noisy, we adopt multiple windows sliding in the time domain for data augmentation and smoothing. On this basis, we design multiple features to describe the characteristics of users and merchants related information. Then two algorithms are adopted to recognize potential repeat buyers: 1) gradient boosting decision tree (GBDT) and 2) ensemble factorization machines (e-FM). The first one can be used to maximize the benefit of feature combinations and the second one makes full use of pairwise information from features. The proposed framework is evaluated by the data of the T-Mall commerce platform, and the results of experiments show that the framework is effective in mining user-and-merchant associated information for repeat buyer prediction. Our framework also wins the fifth place among 753 teams of the repeat buyer prediction competition in IJCAI2015.
关键词	用户行为预测多因素建模关联信息异构图多时间窗口
学科领域	模式识别
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/14717
专题	毕业生_硕士学位论文
作者单位	1.中国科学院自动化研究所模式识别国家重点实验 2.中国科学院大学
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	邱泽宇. 基于历史数据的用户行为预测研究[D]. 北京. 中国科学院研究生院,2017.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
邱泽宇_硕士学位论文_基于历史数据的用户（1858KB）	学位论文		限制开放	CC BY-NC-SA