基于用户行为预测和强化学习的推荐策略研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于用户行为预测和强化学习的推荐策略研究
	张志远
	2024-05
页数	88
学位类型	硕士
中文摘要	在互联网信息爆炸和用户个性化需求日益增长的背景下，推荐策略成为了提供精准服务的关键技术。强化学习作为一种能够处理复杂决策问题的方法，因其优秀的环境适应性和长期规划能力而在近年来被广泛应用于推荐系统，用于进一步优化推荐策略的性能。近年来，通过将序列推荐建模为马尔可夫决策过程，利用强化学习的序贯决策能力优化用户长期体验和平台整体收益，如继Netflix设计了基于强化学习的个性化内容推荐策略后，YouTube、阿里巴巴、腾讯、Spotify等公司的许多研究团队进行了积极的探索。然而，现有强化学习推荐策略仍面临奖励稀疏、状态动作空间复杂度高、用户兴趣动态变化、推荐延迟大以及内容编码精度不足等问题。针对上述挑战，本文对强化学习推荐策略进行了深入的研究，取得的主要研究成果如下：首先，针对用户兴趣动态变化且受短期随机因素影响的问题，本文提出一种显示集成用户行为模型的强化学习推荐策略框架。该框架将用户视为具有动态兴趣的主动智能体，将其行为模型集成到强化学习算法中。这不仅为智能体提供了用户行为监督训练信号，还引导其学习更准确的用户兴趣、奖励和状态转移关系，从而学到更优的推荐策略。本文采用循环随机状态模型（Recurrent State Space Model，RSSM）对用户行为进行深入建模，综合考虑了用户的长期偏好和短期随机因素的影响，以实现对用户兴趣的精准挖掘。美团外卖数据集上的离线实验结果显示，应用该推荐策略以后，平台收入指标平均提升3.55%，用户体验指标平均提升1.75%。其次，针对串行计算的RSSM用户行为模型计算效率和响应速度低下与奖励稀疏的问题，本文构建了基于Transformer和RetNet的并行RSSM模型（T-RSSM与R-RSSM）。通过设计适用于延迟优化的并行计算架构和多步Q更新训练方式，显著提升了模型的计算效率，同时缓解了奖励稀疏问题对策略学习的影响，使推荐策略的性能得到显著提高。美团外卖数据集上的离线实验结果显示，应用该推荐策略以后，平台收入指标平均提升4.31%，用户体验指标平均提升2.42%。同时，相较于RSSM模型，基于R-RSSM模型的改进使得推理速度提升了21.8%。网易RL4RS数据集的结果进一步表明了本文所提方法的优异性能。此外，针对传统深度学习内容编码器难以获取准确语义表征信息的问题，本文引入了大语言模型以进一步增强推荐策略。通过将大语言模型应用于Actor-Critic框架中，利用其强大的语义理解能力增强Critic的值估计，不仅可以引导Actor学到更优的推荐策略，而且避免了大语言模型直接部署于在线推荐策略带来的高计算资源需求、推理延迟增加等问题，实现了更加精准和个性化的推荐。Movielens 1M数据集上的实验结果表明，该策略在理解项目内容与用户兴趣方面更优，实现了更精准的电影推荐。本文的理论研究展示了用户行为预测和强化学习在推荐策略中的应用潜力，实验验证了所提策略多种数据集上的有效性，均取得了优于现有最优基线的表现。美团平台的线上应用进一步证明了该推荐框架在实际场景中的优异表现和商业价值，为数亿用户提供了个性化的推荐服务。本研究为推荐策略的发展提供了新的视角和解决方案，为实现高效、个性化推荐提供了有力的技术支持。
英文摘要	In the context of the explosive growth of information on the internet and the increasing demand for personalized services, recommendation strategies have become a key technology for providing precise services. Reinforcement learning (RL), as a method capable of handling complex decision-making problems, has been widely applied in recommendation systems in recent years due to its excellent adaptability and long-term planning capabilities. By modeling sequential recommendations as a Markov decision process, RL leverages its sequential decision-making abilities to optimize long-term user experience and overall platform revenue. For example, after Netflix designed a personalized content recommendation strategy based on RL, research teams from companies such as YouTube, Alibaba, Tencent, and Spotify have actively explored similar approaches. However, existing RL-based recommendation strategies still face challenges such as sparse rewards, high complexity of state-action space, dynamic changes in user interests, significant recommendation delays, and insufficient precision in content encoding. This paper delves into RL-based recommendation strategies and presents the following key research findings: Firstly, to address the issue of dynamic user interests influenced by short-term stochastic factors, this paper proposes a reinforcement learning recommendation framework that explicitly integrates user behavior models. This framework treats users as proactive agents with dynamic interests, integrating their behavior models into the RL algorithm. This not only provides the agent with supervised training signals from user behavior but also guides it to learn more accurate user interests, rewards, and state transition relationships, thereby developing better recommendation strategies. By adopting the Recurrent State Space Model (RSSM) for in-depth modeling of user behavior, considering both long-term preferences and short-term stochastic factors, the framework achieves precise extraction of user interests. Offline experimental results on the Meituan Waimai dataset show that the application of this recommendation strategy increases platform revenue metrics by an average of 3.55%. Secondly, to address the low computational efficiency and response speed of the RSSM user behavior model in serial computation and the issue of sparse rewards, this paper constructs parallel RSSM models based on Transformer and RetNet (T-RSSM and R-RSSM). By designing a parallel computing architecture suitable for delay optimization and a multi-step Q-update training method, the model’s computational efficiency is significantly improved, and the impact of sparse rewards on policy learning is mitigated, thereby enhancing the performance of the recommendation strategy. Offline experimental results on the Meituan Waimai dataset show that applying this recommendation strategy increases platform revenue metrics by an average of 4.31%. Additionally, to address the difficulty of traditional deep learning content encoders in obtaining accurate semantic representations, this paper introduces large language models (LLMs) to further enhance recommendation strategies. By applying LLMs within the Actor-Critic framework, leveraging their strong semantic understanding to enhance the value estimation of the Critic, the framework not only guides the Actor to develop better recommendation strategies but also avoids the high computational resource demands and increased inference delay associated with directly deploying LLMs in online recommendation strategies, achieving more precise and personalized recommendations. Experimental results on the Movielens 1M dataset show that this strategy excels in understanding item content and user interests, leading to more accurate movie recommendations. The theoretical research in this paper demonstrates the potential of user behavior prediction and RL in recommendation strategies, and the experiments verify the effectiveness of the proposed strategies across various datasets, outperforming existing optimal baselines. The online application on the Meituan platform further proves the excellent performance and commercial value of this recommendation framework in real-world scenarios, providing personalized recommendation services to hundreds of millions of users. This research offers new perspectives and solutions for the development of recommendation strategies and provides strong technical support for achieving efficient and personalized recommendations.
关键词	强化学习推荐系统用户行为建模
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57579
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	张志远. 基于用户行为预测和强化学习的推荐策略研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
_____________1_ (2).（3505KB）	学位论文		限制开放	CC BY-NC-SA