Courses on Deep Reinforcement Learning (DRL) and DRL papers for recommender system
- Reinforcement Mechanism Design for e-commerce. Qingpeng Cai, Aris Filos-Ratsikas, Pingzhong Tang, Yiwei Zhang. WWW 2018. paper
- A Reinforcement Learning Framework for Explainable Recommendation. Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao Wu, Xing Xie. ICDM 2018. paper
- Aggregating E-commerce Search Results from Heterogeneous Sources via Hierarchical Reinforcement Learning. Ryuichi Takanobu, Tao Zhuang, Minlie Huang, Jun Feng, Haihong Tang, Bo Zheng. WWW 2019. paper
- Reinforcement Knowledge Graph Reasoning for Explainable Recommendation. Yikun Xian, Zuohui Fu, S. Muthukrishnan, Gerard de Melo, Yongfeng Zhang. SIGIR 2019. paper
- A Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation. Xueying Bai, Jian Guan, Hongning Wang. NeurIPS 2019. paper
- DRCGR: Deep reinforcement learning framework incorporating CNN and GAN-based for interactive recommendation. Rong Gao, Haifeng Xia, Jing Li, Donghua Liu, Shuai Chen, and Gang Chun. ICDM 2019. paper
- End-to-End Deep Reinforcement Learning based Recommendation with Supervised Embedding. Feng Liu, Huifeng Guo, Xutao Li, Ruiming Tang, Yunming Ye, Xiuqiang He. WSDM 2020. paper
- Reinforced Negative Sampling over Knowledge Graph for Recommendation. Xiang Wang, Yaokun Xu, Xiangnan He, Yixin Cao, Meng Wang, Tat-Seng Chua. WWW 2020. paper
- An MDP-Based Recommender System. Guy Shani, David Heckerman, Ronen I. Brafman. JMLR 2005. paper
- Usage-Based Web Recommendations: A Reinforcement Learning Approach. Nima Taghipour, Ahmad Kardan, Saeed Shiry Ghidary. Recsys 2007. paper
- Reinforcement Learning based Recommender System using Biclustering Technique. Sungwoon Choi, Heonseok Ha, Uiwon Hwang, Chanju Kim, Jung-Woo Ha, Sungroh Yoon. arxiv 2018. paper
- Deep Reinforcement Learning based Recommendation with Explicit User-Item Interactions Modeling. Feng Liu, Ruiming Tang, Xutao Li, Weinan Zhang, Yunming Ye, Haokun Chen, Huifeng Guo, Yuzhou Zhang. arxiv 2018. paper
- End-to-End Deep Reinforcement Learning based Recommendation with Supervised Embedding. Feng Liu, Huifeng Guo, Xutao Li, Ruiming Tang, Yunming Ye, Xiuqiang He. WSDM 2020. paper (paywall)
- A Reinforcement Learning Framework for Relevance Feedback. Ali Montazeralghaem, Hamed Zamani, James Allan. SIGIR 2020. paper (paywall)
- Reinforcement Learning to Rank with Pairwise Policy Gradient. Jun Xu, Zeng Wei, Long Xia, Yanyan Lan, Dawei Yin, Xueqi Cheng, Ji-Rong Wen. SIGIR 2020. paper
- Leveraging Demonstrations for Reinforcement Recommendation Reasoning over Knowledge Graphs. Kangzhi Zhao, Xiting Wang, Yuren Zhang, Li Zhao, Zheng Liu, Chunxiao Xing, Xing Xie. SIGIR 2020. paper
- Interactive Recommender System via Knowledge Graph-enhanced Reinforcement Learning. Sijin Zhou, Xinyi Dai, Haokun Chen, Weinan Zhang, Kan Ren, Ruiming Tang, Xiuqiang He, Yong Yu. SIGIR 2020. paper
- Adversarial Attack and Detection on Reinforcement Learning based Recommendation System. Yuanjiang Cao, Xiaocong Chen, Lina Yao, Xianzhi Wang, Wei Emma Zhang. SIGIR 2020. paper
- Reinforcement Learning based Recommendation with Graph Convolutional Q-network. Yu Lei, Hongbin Pei, Hanqi Yan, Wenjie Li. SIGIR 2020. paper
- Nonintrusive-Sensing and Reinforcement-Learning Based Adaptive Personalized Music Recommendation. D Hong, L Miao, Y Li. SIGIR 2020. paper
- Keeping Dataset Biases out of the Simulation: A Debiased Simulator for Reinforcement Learning based Recommender Systems. Jin Huang, Harrie Oosterhuis, Maarten de Rijke, Herke van Hoof. RecSys 2020. paper
- Learning to Collaborate in Multi-Module Recommendation via Multi-Agent Reinforcement Learning without Communication. Xu He, Bo An, Yanghua Li, Haikai Chen, Rundong Wang, Xinrun Wang, Runsheng Yu, Xin Li, Zhirong Wang. RecSys 2020. paper
- User Response Models to Improve a REINFORCE Recommender System. Minmin Chen, Bo Chang, Can Xu, Ed Chi. WSDM 2021. paper
- Reinforcement Recommendation with User Multi-aspect Preference. Xu Chen, Yali Du, Long Xia, Jun Wang. WWW 2021. paper
- Towards Content Provider Aware Recommender Systems: A Simulation Study on the Interplay between User and Provider Utilities. Ruohan Zhan, Konstantina Christakopoulou, Ya Le, Jayden Ooi, Martin Mladenov, Alex Beutel, Craig Boutilier, Ed H. Chi, Minmin Chen. WWW 2021. paper
- Generative Inverse Deep Reinforcement Learning for Online Recommendation. Xiaocong Chen, Lina Yao, Aixin Sun, Xianzhi Wang, Xiwei Xu, Liming Zhu. CIKM 2021. paper
- Explore, Filter and Distill: Distilled Reinforcement Learning in Recommendation. Ruobing Xie, Shaoliang Zhang, Rui Wang, Feng Xia, Leyu Lin. CIKM 2021. paper Choosing the Best of All Worlds: Accurate, Diverse, and Novel Recommendations through Multi-Objective Reinforcement Learning. Duan Stamenkovi, Alexandros Karatzoglou, Ioannis Arapakis, Xin Xin, Kleomenis Katevas. WSDM 2022. paper
- Reinforcement Learning over Sentiment-Augmented Knowledge Graphs towards Accurate and Explainable Recommendation. Sung-Jun Park, Dong-Kyu Chae, Hong-Kyun Bae, Sumin Park, and Sang-Wook Kim. WSDM 2022.
- Virtual-Taobao: Virtualizing Real-World Online Retail Environment for Reinforcement Learning paper
- PyRecGym: a reinforcement learning gym for recommender systems paper
- Environment Reconstruction with Hidden Confounders for Reinforcement Learning based Recommendation∗ paper
- A General Offline Reinforcement Learning Framework for Interactive Recommendation Teng Xiao, Donglin Wang∗ paper
- X. Xin, A. Karatzoglou, I. Arapakis, and J. M. Jose, Self-supervised reinforcement learning for recommender systems in Proc. SIGIR, 2020, pp. 931–940. paper
- Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems paper
- We formalize the feed streaming recommendation as a Markov decision process (MDP), and design a Q-Network to directly optimize the metrics of user engagement. To avoid the problem of instability of convergence in offline Q-Learning, we further introduce a S-Network, which simulates the environments, to assist the policy learning.
- Further integrated hierarchical LSTM with temporal cell into QNetwork to characterize fine-grained user behaviors
- Our approach augments standard recommendation models with two output layers: one for selfsupervised learning and the other for RL. The RL part acts as a regularizer to drive the supervised layer focusing on specific rewards (e.g., recommending items which may lead to purchases rather than clicks) while the self-supervised layer with cross-entropy loss provides strong gradient signals for parameter updates.
- Based on such an approach, we propose two frameworks namely Self-Supervised Qlearning (SQN) and Self-Supervised Actor-Critic (SAC).
- The offline RL have been widely proposed by in the robot-control domains [20] but the RL case remains the under discovered one. The [21] outlines the following problems with offline-RL approaches for the RecSim.
- Unknown logging policy – the feedback coming from the set of the mixture of the unknown policies.
- Discrete stochastic policy – RS requires generating a top-k item list according to a discrete policy, creating the mismatch between the test and train datasets.
- Extrapolation error – the error between the dataset and the real-world state-action interactions.
Environment Reconstruction with Hidden Confounders for Reinforcement Learning based Recommendation
- Proposing the approach to learn the environment together with a hidden confounder that adopts the mutli-agent generative adversarial imitation learning framework.
- propose a novel environment reconstruction method to tackle the practical situation where hidden confounders exist in the environment.
- Tested on private datasets on Didi Chuxing
MaHRL: Multi-goals Abstraction based Deep Hierarchical Reinforcement Learning for Recommendations. Dongyang Zhao, Liang Zhang, Bo Zhang, Lizhou Zheng, Yongjun Bao, Weipeng Yan. SIGIR 2020. paper
- Propose a deep hierarchical reinforcement learning based recommendation framework, which consists of two components, i.e., high-level agent and low-level agent.
- The high-level agent catches long-term sparse conversion signals, and automatically sets abstract goals for low-level agent, while the low-level agent follows the abstract goals and interacts with real-time environment
Reinforcement Learning to Optimize Lifetime Value in Cold-Start Recommendation. Luo Ji, Qi Qin, Bingqing Han, Hongxia Yang. CIKM 2021. paper
- we model the process as a Partially Observable and Controllable Markov Decision Process (POC-MDP), and propose an actor-critic RL framework (RL-LTV) to incorporate the item lifetime values (LTV) into the recommendation. In RL-LTV the critic studies the historical trajectories and predic t the future LTV of fresh items.
- Incorporated the item LTVs into the online ranking by RL-LTV.
- State: observable part of current item life stage, including the item's time on the market
- Action: preference score for a specific item a_t = [y_rl.,t; p_t]. p the discount degree of averaged paid price.
- Reward: the return of the next-step IPV (item page view) with investment of the current recommended PV (page view).
- Long-term Gain and Discount factor.
- LSTM to encode the sequential infromation of historical observation into continious hidden state.
Toward Pareto Efficient Fairness-Utility Trade-off in Recommendation through Reinforcement Learning. Yingqiang Ge, Xiaoting Zhao, Lucia Yu, Saurabh Paul, Diane Hu, Chu-Cheng Hsieh, Yongfeng Zhang. WSDM 2022. paper
- we propose a fairness-aware recommendation framework using multi-objective reinforcement learning (MORL)
- In the area of fairness-aware recommendation, the methods can be roughly divided into three categories: pre-processing, inprocessing and post-processing algorithms
- Formulate fairness-aware task as a Multi-Objective Markov Decision Process (MOMDP), with one recommendation objective, e.g.,CTR, and one fairness objective, e.g., item exposure fairness (our method is able to generalize to more recommendation objectives as well as more fairness objectives)
- Fairness in recommendations: gender, race, item popualrity, user feedback biases. Individaul fairness and group fairness.
- State: the representation of user’s most recent positive interaction history 𝐻𝑡 with the recommendation system
- Action: recommendation list with K items to a user u at time t with current state s_t
- Reward: vector-reward function, each component correspond to one certain objective – utility ojective and fairness objective.
- Discount rate: discount rate measuring the precense of long-term rewards
- Concatnate the sate represtantion s_t with the vector w and train neurtal network on this join feature space. Used GRU for user representation.
- Conditioned critic concatenate the state representation 𝑠𝑡 with the vector 𝝎 as well as the embedding of action 𝑎𝑡, and require the output to be a Qvalue-vector with the size equal to the number of objectives
- Tested on Movielens 100k, CIAO, Etsy
- Evalutaion metric: Recall, F1 Score, NDCG, Popularity rate (ratio the number of popular itmes in the recommenation list to the total number of items in the list)
Self-Supervised Reinforcement Learning for Recommender Systems. Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Joemon Jose. SIGIR 2020. paper
- State: a continuous state space to describe the user states. This is modeled based on the user (sequential) interactions with the items.
- Action: The action a of the agent is the selected item to be recommended. In off-line data, we can get the action at timestamp t from the user-item interaction (i.e., at = xt+1).
- Reward: flexible scheme
- Propsoed self-supervised Actor-Critic. The learned Q-values are unbiased and learned from positive user-item interactions (feedback). As a result, these values can be regarded as an unbiased measurement of how the recommended item satisfies our defined rewards. In this model "actor" is self-supervised.
- Utilize the cross-entropy loss as the self-supervised loss
- Used offline metrics: hit-rate (click;purchase) and ndcg
- Tested on RetailRocket and RC15 datasets
Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application. Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, Yinghui Xu. KDD 2018. paper
- We formally define the concept of search session Markov decision process (SSMDP) to formulate the multistep ranking problem
- Multi-step ranking problem fromulated through the concept of search session Markov decision process (SSMDP)
- Tackle high-reward variance and unbalanced reward distribution of SSMDP. Reward reward variance is high because the deal price m(h) normally varies over a wide range. And immediate reward distribution of (s, a) is unbalanced because the conversion events lead by (s, a) occur much less frequently than the two other case.
- State: Non-terminal state set that conmtains all continuation events. In the simulation, is represented by a 180-dim feature vector extracted from the last 4 item pages of the current search session
- Action: Action space, that contains all possible ranking functions of the search engine.
- Reward is designed to encourage more succesfull transactions
- The objective of the agent is to find an optimal parameter which maximizes the expectation of the T -step returns along all possible trajectories
- Rely on deterministic policy gradient (DPG) algorithm to lean an optimal ranking policy in SSMDP.
- Data on the internal Taobao data
- Item is represented by feature vector and a ranking action of the search engine is a n-dim weight vector.
- Network parameters are optimized by Adam
Text-Based Interactive Recommendation via Constraint-Augmented Reinforcement Learning. Ruiyi Zhang, Tong Yu, Yilin Shen, Hongxia Jin, Changyou Chen, Lawrence Carin. NeurIPS 2019. paper
- Employ an RL-based formulation for sequential recommendation of items to users, utilizing user feedback in natural language
- The user then views the recommended item and gives feedback in natural language, describing the desired aspects that the current recommended item lacks. The system then incorporates the user feedback and recommends (ideally) more-suitable items, until the desired item is found.
- Using feedback from users as constraints, and formulate text-based interactive recommendation as a constrained policy optimization problem
- Applied contraint MDP with following constraint functions: (i) sequentially added via natural-language feedback; (ii) parameterized by a neural network with better generalization.
- State: state of sequential recommender implemented via GRU state tracker
- Actions: recommended item at time t
- Reward: interaction with text-based recommendation. Can be a metric reward (e.g., BLEU) or a learned reward function with general discriminator.
- To obtain a good initialization of our model, some components are pre-trained in a supervised manner
Model-Based Reinforcement Learning for Whole-Chain Recommendations. Xiangyu Zhao, Long Xia, Yihong Zhao, Dawei Yin, Jiliang Tang. arxiv 2019. paper
- Proposed a multi-agent model-based RL-based approach (DeepChain), which can capture the sequential correlation among different scenarios and jointly optimize multiple recommendation strategies. 3 scenairos of RA and each one serves as a recommendation strategy that recommends items to a user
- Formulate the recommendation tasks within multiple consecutive scenarios as a whole-chain recommendation problem, and leverage multi-agent reinforcement learning (MARL) to jointly optimize multiple recommendation strategies, which is capable of maximizing the overall performance of the whole recommendation session
- State: chronologically sorted sequence of a user’s historical clicked or purchased items before time t, which represents the user’s preference at time t.
- Action space: recommending a list of relevant items corresponding to state s_t
- Reward: corresponding skip, purhcase, click, leave.
- Discount factor: When γ = 1, all future rewards can be fully counted into the current action; when γ = 0, only the immediate reward is considered
- Actor-Critic infastructure. The goal of the actors is to suggest recommendations based on users’ historical browsing behaviors (state), which should address two challenges: (i) how to capture users’ dynamic preference in one recommendation session, and (ii) how to generate recommendations according to the learned users’ preference.
- A recurrent neural network (RNN) with Gated Recurrent Units (GRU) to capture users’ sequential browsing behaviors.
- Global critic control all RAs to work cooperatively to maximize the global performance. Specifically, the input of critic is the current state-action pair (st , at ), and the output is an action-value Q(st , at ) that evaluates the future cumulative rewards starting from the current state and action
- Actor inputs the current state s and aims to output a deterministic action (or recommending a deterministic page of M items. Tackling 3 problems: 1/ setting an initial preference at the beginning of a new recommendation session, 2/ learning the real-time preference in the current session, which should capture the dynamic nature of user’s preference in current session and user’s preferable item display patterns in a page, and 3/ jointly generating a set of recommendations and displaying them in a 2-D page
- The Critic inputs only this state-action pair rather than all potential state-action pairs. o leverage an approximator to learn an action-value function Q(s, a), which is a judgment of whether the action a (or a recommendation page) generated by Actor matches the current state
- GRU architecture ot capture user's initial preference.
- The inputs of GRU are user’s last clicked/purchased items {e1, · · · , eN } (sorted in chronological order) before the current session, while the output is the representation of users’ initial preference by a vector
- Test offline and online. W&D, DFM, GRU, DDPG, MA
- No code
- Proprietary JD.com dataset
Stabilizing Reinforcement Learning in Dynamic Environment with Application to Online Recommendation. Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, Hai-Hong Tang. KDD 2018. paper
- Addressed the reward estimation in dynamic environments problem. Propose the stratified sampling replay, to replace the traditional experience replay that is one of the frequently used technologies in deep reinforcement learning.
- In order to alleviate the sample variance problem, stratified sampling replay strategy is proposed to replace the original experience replay. Firstly, from a long period of time, we select some attributes on which the customer distributions seems to be stable along time, such as gender, age, and geography.
- Taobao offers a product called tip, which is actually quite a shopping guide to provide such options. Usually, a tip can be displayed as a text box with a lot of phrases
- State. A state is expected to express the customer’s features, preference, and recent behavior. So the state includes: the customer’s gender, age, purchasing power and some other basic features, the features of the current query, the customer preference for tip, as well as some more real-time features, such as the features of tip and goods that the customer recently clicks, etc.
- Action - an ID that can be mapped to a type of tip
- Reward - Customers’ click on a tip indicates that this display is valuable, implying that a positive reward should be given at this time, which aims to lead the system to provide more helpful guide to customers. The sooner the user clicks, the more valuable this display should be, as the tip is designed to guide users to find the needed goods faster
- Strategies proposed in this paper can improve the estimation of reward, thus effectively increasing the accuracy of recommendations, however, the distribution of observed data in real-world environments is still wide
Deep Reinforcement Learning for Page-wise Recommendations. Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, Jiliang Tang. RecSys 2018. paper
- we propose a principled approach to jointly generate a set of complementary items and the corresponding strategy to display them in a 2-D page
- propose a novel page-wise recommendation framework based on deep reinforcement learning, DeepPage, which can optimize a page of items with proper display based on real-time feedback from users
- state space: defined as user’s current preference, which is generated based on user’s browsing history, i.e., the items that a user browsed and her corresponding feedback
- action space: page of M item to a user based on the current state s
- reward: recommending a page of items to a user, the user browses these items and provides her feedback. She can skip (not click), click, or purchase these items, and the agent receives immediate reward r(s, a) according to the user’s feedback. The reward r of one skipped/clicked/purchased item is empirically set as 0, 1, and 5, respectively
- transition: the state transition from s to s′ when RA takes action a
- discount factor: when we measure the present value of future reward
- Specifically, we model the recommendation task as a MDP in which a recommender agent (RA) interacts with environment E (or users) over a sequence of time steps
- Used Actor-Critic architecture. The inputs a state-action pair and outputs the Q-value correspondingly, and makes use of the optimal action-value function Q∗(s, a)
-
Test both in online and offline environment. Online test: to test DeepPage in online environment where RA interacts with users and receive real-time feedback for the recommended items from user. Offline test: to test DeepPage based on user’s historical browsing data
-
Precision@20, Recall@20, F1-score@20 [12], NDCG@20 [16] and MAP
-
Compared with CF, FM, GRU, DQN, DDPG
Reinforced Negative Sampling over Knowledge Graph for Recommendation. Xiang Wang, Yaokun Xu, Xiangnan He, Yixin Cao, Meng Wang, Tat-Seng Chua. WWW 2020. paper
- existing negative sampling strategies, either static or adaptive ones, are insufficient to yield high quality negative samples both informative to model training and reflective of user real needs.
- most interactions are in the form of implicit feedback, e.g., clicks and purchases, which provide only the signals on positive feedback. This brings in the fundamental challenge to recommender model learning — how to distill negative signals from the positiveonly data — which is also known as the one-class problem
- propose a new negative sampling model, KGPolicy (short for Knowledge Graph Policy Network), which employs a reinforcement learning (RL) agent to explore KG to discover high-quality negative examples. At the core is the designed exploration operation, which navigates from the positive item
- devise a neighbor attention module to achieve this goal, which specifies varying importance of one- and two-hop neighbors conditioned on the positive user-item pair, so as to adaptively capture personal tastes on KG entities and yield potential item
- pairwise BPR loss as the objective function to optimize and learn the parameters. Specifically, it assumes that, for a target user, her historical items reflecting more personal interest should be assigned higher prediction scores, than that of unobserved items
- introduced multi-hop path. For a positive (u,i) interaction, we can traverse paths rooted at node i, terminate at a unobserved item j, and view multi-hop connections as the relationships between i and j.
- Enumerating possible paths towards all unobserved items is infeasible in large-scale KGs, since it requires labor-intensive feature engineering, being memory-consuming to store these paths and time-consuming to distill useful signals. Used RL for taht
- state: at step t, the state st conditioned on the target user u is defined as a triple (u, et ), where et is the node the sampler visits currently.
- action: all possible exploration operations starting from node et, excluding the historical trajectory.
- reward: measures the quality of the proposal item et the pairwise preference (u,i, j) at step t. Spit into predication reqard and similarity reward.
- evaluation metric: recall@K and ndcg@K.
- 3 research questions: RQ1: How does KGPolicy perform, compared with existing methods w.r.t. two dimensions — the effectiveness of negative sampling and the usage of knowledge entires? RQ2: How do different components (e.g., number of exploration operations, reward functions) affect KGPolicy? RQ3: Can KGPolicy provide in-depth analyses of negative samples?
- used Amazon-book, Last-FM, and Yelp2018 datasets.
- [code] (https://github.com/xiangwang1223/kgpolicy)
KERL: A Knowledge-Guided Reinforcement Learning Model for Sequential Recommendation. Pengfei Wang, Yu Fan, Long Xia, Wayne Xin Zhao, Shaozhang Niu, Jimmy Huang. [paper] (https://dl.acm.org/doi/pdf/10.1145/3397271.3401134)
- propose a novel Knowledge-guided Reinforcement Learning model (KERL for short) for fusing KG information into a RL framework for sequential recommendation
- For sequence-level reward, we borrow the BLEU metric and measure the overall quality of the recommendation sequence
- Consider a sequential interaction scenario in recommender systems, where a user selects interested items at different time steps.
- In addition to users’ interaction histories, a knowledge graph (KG) G is available for our task, where each record is a knowledge triple involving two entities with some relation
- The environment’s state can be considered to contain all useful information for sequential recommendation, including interaction history and KG.
- The action behavior is modeled by a policy π(st ), which defines a function that takes the state st as input and outputs a distribution over all the possible items.
-
State representation: combination of output of GRU network, Knowledge-level representation
-
To make a good trade-off between exploitation and exploration, we consider modeling two kinds of knowledge-based preference for a user, namely current knowledge-based preference (short as current preference, the average pooling method to aggregate all the KG embeddings of the historical items that a user has interacted wit) and future knowledge-based preference (short as future preference, Multi-Layer Perception network).
-
Reward decomposes by Sequence-level Reward and Knowledge-level Reward
-
For sequential-based models, comparing with the shallow model FPMC, the deep models DREAM and GRU4Rec obtain a better performance on all evaluation metrics
-
The major novelty of KERL is that it formalizes the sequential recommendation task in a MDP setting. We incorporate knowledge information into both state representations and reward functions.
Keeping Dataset Biases out of the Simulation.A Debiased Simulator for Reinforcement Learning based Recommender Systems. Jin Huang, Harrie Oosterhuis, Maarten de Rijke, Herke van Hoof RecSys 2020 paper
- Applying RL4Rec online comes with risks: exploration may lead to periods of detrimental user experience. Moreover, few researchers have access to real-world recommender systems. Simulations have been put forward as a solution where user feedback is simulated based on logged historical user data, thus enabling optimization and evaluation without being run online. While simulators do not risk the user experience and are widely accessible, we identify an important limitation of existing simulation methods.
- Furthermore, online deployment takes time, costs money, and many researchers – both in academia and industry – simply do not have access to an actual platform with live users.
- To evaluate the effects of bias on RL4Rec simulations, we propose a novel evaluation approach for simulators that considers the performance of policies optimized with the simulator
- Fully synthetic approach oversimplifies user's behavior.
- Two influential types of biases are popularity bias (user interacts only with popular items -> long-tail distribuition) and positivity bias (user likes the items that he/she likes the most)
- States: historical interactions of users till the t-th turn of interaction, consisting of the recommended items and the corresponding feedback.
- Actions: Items to recommend
- Rewards: After receiving action at , consisting of item it being recommended by the RS, the (simulated) user gives feedback ft ∈ {0,1} (i.e.,skip or click) on this item.
- Transition probability: After a user has provided a feedback, tell the next action
- Discount factor: Balancing factor between the immediate and future rewards.
- RL4Rec simulator components: user model, item model, user-choice model. The user model and item model aim to capture user preference for items, while the user-choice model simulates user feedback when an item is recommended by an RS.
Pseudo Dyna-Q: A Reinforcement Learning Framework for Interactive Recommendation. Lixin Zou, Long Xia, Pan Du, Zhuo Zhang, Ting Bai, Weidong Liu, Jian-Yun Nie, Dawei Yin. WSDM 2020. paper
- Applying reinforcement learning (RL) in recommender systems is attractive but costly due to the constraint of the interaction with real customers, where performing online policy learning through interacting with real customers usually harms customer experiences.
Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning. Jun Feng, Heng Li, Minlie Huang, Shichen Liu, Wenwu Ou, Zhirong Wang, Xiaoyan Zhu. WWW 2018. paper
- Multi-scenario ranking as a fully cooperative, partially observable, multi-agent sequential decision problem.
- Propose a novel model named Multi-Agent Recurrent Deterministic Policy Gradient (MA-RDPG) which has a communication component for passing messages, several private actors (agents) for making actions for ranking, and a centralized critic for evaluating the overall performance of the co-working actors.
- Agents are collaborate with each other by sending the global action-value function and passing messages that encodes historical information across scenarios. Each ranking strategy in one scenario is treated as agent. Each agent takes local observations (user behavior data) and makes local actions for ranking items with its private actor network. . Different agents share a global critic network to enable them to accomplish the same goal collaboratively. The critic network evaluates the future overall rewards starting from a current state and taking actions. The agents communicate with each other by sending messages.
- Collaborative optimization of the key metrics (CTR, GMV, CVR)
- Model evaluation results on Taobao data
- The environment is partially observable, and each agent only receives a local observation instead of observing the full state of the environment
- In multi-agent settings the state of the environment is global, shared by all agents while the observation, the action and the intermediate reward are all private, only processed by the agent itself.
- Communication component The message encodes the local observation and a message from the environment. The component generates the next message based on the previous message and observation. For this purpose the LSTM architecture is applied.
Scenario | Feature | Description |
---|---|---|
Main search | CTR | An CTR estimation using logistic regression, considering features of users, items and their interactions |
Main search | Rating Score | Average user ratings on a certain item |
Main search | Search popularity | Popularity of the item shop |
In-shop search | Latest Collection | Whether an item is the latest collection or new arrivals of the shop |
In-shop search | Sales volume | Sales volume of an in-shop item |
- Environment - online E-commerce platform
- Agents - one is search engine for main search and the other is in-shop search. At each step, one of the search engines returned a ranked list of products according to the ranking algorithm. The two agents work together to maximize the overall performance
- States are partially observable.
- To rank items, the ranking algorithm computes an inner product of the feature value vector and the weight vector. Changing an action means to change the weight vector for the ranking features.
- Reward. We design the rewards by considering not only purchase behaviors but also other user behaviors. If a click happens, there is a positive reward of 1. If there is no purchase nor click, a negative reward of −1 is received. If a user leaves the page without buying any product, there is a negative reward of −5.
- Baseline models: Empirical weights, learning to Rank
- No code
- Performance is measured in terms of the GMV gap in both search scenarios.
DJ-MC: A Reinforcement-Learning Agent for Music Playlist Recommendation. Elad Liebman, Maytal Saar-Tsechansky, Peter Stone. AAMAS 2015. paper
- Formulate the selecting which sequence of songs to play as a Markov Decision Process, and demonstrate the potential effectiveness of a reinforcement-learning based approach in a new practical domain
- DJ-MC is able to generate personalized song sequences within a single listening session of just 25–50 songs.
- Playlist are of the length
k
- Finite set of
n
musical tracks - States - ordered sequence set of songs played
- Actions - next song to play
- Reward - utility of listeners pleasure hearing song
a
in states
. Divide into two types of rewards - reward function over song and reward function over transition. - Deterministic transition function P
- The agent must internally represent states and actions in such a way that enables generalization of the listener’s preferences
- Use descriptors for song representations. Each song can be factored as a vector of scalar descriptors that reflect details about the spectral fingerprint of the song, its rhythmic characteristics, its overall loudness, and their change over time.
- To represent different listener types, we generate 10 different playlist clusters by using k-means clustering on the playlists
- Used Million Song Dataset and Yes.com and Last.fm dataset for experimenting setup
- For each experiment, we sample a 1000-song corpus from the Million Song Dataset.
- Architecture contains of two components: learning the listeners parameters and planning sequence of the songs
- Use tree-search heuristic for planning
- It repeats this process as many times as possible, finding the randomly generated trajectory which yields the highest expected payoff
- Compare agent towards two alternative baselines: random choice and greedy algorithm that always plays the song with highest song rewards.
DRN: A Deep Reinforcement Learning Framework for News Recommendation. Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, Zhenhui Li. WWW 2018. paper
- In order to better model the dynamic nature of news characteristics and user preference, propose to use Deep Q-Learning (DQN) framework
- Consider user return as another form of user feedback information, by maintaining an activeness score for each user
- Apply a Dueling Bandit Gradient Descent (DBGD) method for exploration by choosing random item candidates in the neighborhood of the current recommender. Can avoid using totally unrelated items. The baseline for exploration: epsilon-greedy and upper confidence bound. The agent generates a recommendation list L using current network Q and another list L~ using explore network Q~. The parameters W~ can be obtained by adding small noise to W. Then with a probabilistic interleave to generate the merged recommendation list L^ using L and L~.
- News recommendation system:
-
- content-based systems that maintain news term-frequency and user profile (based on historical news). Then the recommender select the one that is the most similar
-
- collaborative filtering models - utilize past ratings of current user or similar users.
-
- combination of these two
-
- Challenges: 1) online learning are needed due to the highly dynamic nature of news characteristics and user preference 2) click / no click labels will not capture full feedback towards news. 3) traditional recommender methods tend to recommend similar items and will narrow down user's choice.
- Multi-layer DQN is used to predict the reward
- Push: In each timestamp (t1, t2, t3, t4, t5, ...), when a user sends a news request to the system, the recommendation agent G will take the feature representation of the current user and news candidates as input, and generate a top-k list of news to recommend L. L is generated by combining the exploitation of current model and exploration of novel items
- Feedback: User u who has received recommended news L will give their feedback B by his clicks on this set of news.
- Minor update: After each timestamp with the feature representation of the previous user U and new list L, the feedback B, agent G will update the model by comparing the recommender performance of exploitation network Q and exploration network Q~. If Q~ gives better result the current network will be updated towards Q~.
- Major update: After certain period of time agent G will use the feedback B and user activeness stored in the memory to update the network Q. When each update happens, agent G will sample a batch of records to update the model. Major updates happens after a certain time interval.
- Feature construction:
- News features: OHE features that describe whether a certain property appears in this piece of news
- User features: mainly describes the features (i.e., headline, provider, ranking, entity name, category, and topic category) of the news that the user clicked in 1 hour, 6 hours, 24 hours, 1 week, and 1 year respectively
- User new features: the interaction between user and one certain piece of news, i.e., the frequency for the entity (also category, topic category and provider) to appear in the history of the user’s readings
- Context features: the context when a news request happens, including time, weekday, and the freshness of the news
- State features: User features and context features
- Action features: User new features and context features
- Divide the Q-function into value function V (s) and advantage function A(s, a), whereV (s) is only determined by the state features, and A(s, a) is determined by both the state features and the action features
- Use survival model to model user return and user activities
- Closed dataset
- No code
- Metrics: Precision@k, NDCG
Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning. Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, Dawei Yin. KDD 2018. paper
- Collaborative filtering and content-based models may fail given the dynamic nature of user's preferences. The current RL systems designed to maximize the short-term rewards
- Propose a framework DEERS to model positive and negative feedback simultaneously.
- State space: Browsing history of user before time t
- Action space: Recommend items to a user at time t.
- Rewards: immediate reward (click, skip and etc)
- Transition probability - probability of state transition from
- Discount factor - factor to measure the present value of future response
- USe DQN ads a base for DEERS model. Instead of just concatinating clicked/ordered items as a positive state s+, introduced the RNN with a GRU architecture to capture user's sequential preferences. Feed positive inputs and negative inputs separately as an input layer. Separate first hidden layers
- Add regularization term to maximize the difference between the target and competitor items
- Utilized the experienced replay and introduced separated evaluation and target networks, which can help smooth the learning and avoid the divergence of of parameters
- Closed dataset on JD.com data
- Select MAP and NDCG@40 as a key metrics to evaluate the data
Top-K Off-Policy Correction for a REINFORCE Recommender System. Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed H. Chi. WSDM 2019. paper, video
- Leverage on a policy-based algorithm, REINFORCE.
- Live experiment on Youtube data.
- State - user interest, context
- Action - items available for recommendations. Constrain by sampling each item according to the softmax policy.
- Reward - clicks, watch time. As the recommender system lists the page of k-items to a user in time, the expected reward of a set equal to the sum of an expected reward of each item in the set.
- Discount rate - the trade-off between an immediate reward and long-term reward
- Rely on the logged feedback of auctions chosen by a historical policy (or a mixture of policies) as it's not possible to perform online updates of the policy.
- Model state transition with a recurrent neural network (tested LSTM, GRU, Chaos Free RNN)
- Feedback is biased by previous recommender. Policy not update frequently that also cause bias. Feedback could come from other agents in production.
- Agent is refreshed every 5 days
- Used target (on-policy) and behavioral policy (off-policy) to downweitgth non-target policies.
- The epsilon-greedy policy is not acceptable for Youtube as it causes inappropriate recommendations and bad user experience. Employ Boltzman exploration.
- Target policy is trained with weighted softmax to max long term reward.
- Behavior policy trained using state/action pairs from logged behavior/feedback.
- On an online experiment, the reward is aggregated on a time horizon of 4-10 hours.
- Parameters and architecture are primarily shared.
- RNN is used to represent the user state
- Private dataset
- Code reproduction by Bayes group [code] (https://github.com/awarebayes/RecNN)
Generative Adversarial User Model for Reinforcement Learning Based Recommendation System. Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, Le Song. ICML 2019. paper
- Environment: will correspond to a logged online user who can click on one of the k items displayed by the recommendation system in each page view (or interaction);
- State S - ordered sequence of historical tasks
- Actions - the subset of k items chosen by a recommender from list to display to the user. Actions are represent the subsets of all sets of k items of the total item space I at time t.
- State transition - the user behavior model which returns the transition probability for given the previous state and the set of items A to display to the system.
- Reward - user's utility or satisfaction after making her choice in the state.
- Policy - the probability of displaying a subset of at current user state
- Inspired by GAN estimate the behavioral function and reward function in a min-max fashion. Given the trajectory of T observed actions and the corresponding clicked items learn jointly behavioral and reward function.
Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems. Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, Dawei Yin. KDD 2019. paper
- Introduced an RL framework - FeedRec to optimize long-term user engagement. Based on Q-network which designed in hierarchical LSTM and S-network, that simulates the environment
- S-network assists Q-Network and voids the instability of convergence in policy learning. Specifically, in each round of recommendations, aligning with the user feedback, S-network generates the user's response, the dwell time, the revisited time, and flag that indicates that the user will leave or not the platform.
- The model versatile both instant (cliks, likes, ctr) and delayed (dweel time, revisit and etc)
- Trained on internal e-commerce dataset with pre-trained embeddings, which is learned through modeling user's cliking streams with skip-gram.
- State - the user's browsing history
- Action - the finite space of items
- Transition - probability of seeing state after taking action at
- Reward as a weighted sum of different metrics. Give some instantiations of reward function, both instant and delayed metrics.
- The primary user behaviors, such as clicks, skip, the purchase is tracked separately with different LSTM pipelines as where different user's behavior is captured by the corresponding LSTM-layer to avoid intensive behavior dominance and capture specific characteristics.
- Environment reconstruction with hidden confounders for reinforcement learning based recommendation. Wenjie Shang, Yang Yu, Qingyang Li, Zhiwei Qin, Yiping Meng, Jieping Ye. KDD 2019. paper
- Exact-K Recommendation via Maximal Clique Optimization. Yu Gong, Yu Zhu, Lu Duan, Qingwen Liu, Ziyu Guan, Fei Sun, Wenwu Ou, Kenny Q. Zhu. KDD 2019. paper
- Large-scale Interactive Recommendation with Tree-structured Policy Gradient. Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, Yong Yu. AAAI 2019. paper
Virtual-Taobao: Virtualizing real-world online retail environment for reinforcement learning. Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, An-Xiang Zeng. AAAI 2019. paper
- Instead of training reinforcement learning in Taobao directly, we present our approach: first build Virtual Taobao, a simulator learned from historical customer behavior data through the proposed GAN-SD (GAN for Simulating Distributions) and MAIL (multi-agent adversaria limitation learning), and then we train policies in Virtual Taobao with no physical costs in which ANC (Action Norm Constraint) strategy is proposed to reduce over-fitting.
- Comparing with the traditional supervised learning approach, the strategy trained in Virtual Taobao achieves more than 2% improvement of revenue in the real environment.
- Agent - the search engine
- Environment - the customers
- Commodity search and shopping process can see as a sequential decision process. Customers decision process for engine and customers. The engine and customers are the environments of each other
Hierarchical Reinforcement Learning for Course Recommendation in MOOCs Jing Zhang, Bowen Hao, Bo Chen, Cuiping Li, Hong Chen, Jimeng Sun. AAAI 2019. paper
- Although existing attention-based models improve the recommendation performance, it still poses unsolved challenges. Firstly, when a user enrolled diverse courses, the effects of the courses that indeed reflect the user’s interest in the target course will be diluted by many irrelevant courses.
- We formalize the revising of a user profile to be a hierarchical sequential decision process.
- We propose a hierarchical reinforcement learning algorithm to revise the user profiles, which enables the model to remove the noise courses without explicit annotations.
- Adopt the attention mechanism to estimate an attention coefficient a u it for each historical course e u t when recommending c_i.
- Attention coefficient is parametrizied as a function with an embeddings vector as an input.
- Decompose the task into low-level and high-level tasks
- High-level action: binary value to represent whether to revise the whole profile of a user or not. Low-level action: a binary value to represent whether to remove the historical course e_u_t or not.
- Reward: The reward is a signal to indicate whether the performed actions are reasonable or not.
- firstly pre-train the basic recommendation model based on the original dataset, then we fix the parameters of the basic recommendation model and pre-train the profile reviser to automatically revise the user profiles; finally, we jointly train the models together
Deep Reinforcement Learning for List-wise Recommendations Xiangyu Zhao, Liang Zhang, Long Xia, Zhuoye Ding, Dawei Yin, Jiliang Tang, 2017 paper
-
Utilized the Actor-critic framework to work with the increasing number of items for recommendations.
-
Proposed an online environment simulator to pre-train parameters offline and evaluate the model before applying.
-
Recommender agent interacts with the environment (users) choosing items over a sequence of steps.
-
State - browsing history of the user, i.e., the previous N items that a user browsed before. Users browsing history stored in a memory M. However a better way is to consider only N previous clicked/ordered items. The positive items represent key information about user's preference.
-
Action - a list of items recomended to a user at time t based on the current state.
-
Reward - clicks, order, etc
-
Transition probability - Probability of state transition from to
-
Each time observe a K items in temporal order.
-
Utilized DDPG algorithm with experienced decay to train the parameters of the framework.
-
MAP and NDCG to evaluate performance.
-
No code
-
Private dataset
Deep Reinforcement Learning based Recommendation with Explicit User-Item Interactions Modeling Feng Liu, Ruiming Tangy, Xutao Li, Weinan ZhangzYunming Ye, Haokun Chenz, Huifeng Guoyand Yuzhou Zhangy, 2019 paper
-
The DRR framework treats recommendations as a sequential decision making procedure and adopts "Actor-critic" reinforcement-learning scheme to model interactions between the user and recommender items. The framework treats recommendations as a sequential decision-making process, which consider both immediate and long-term reward.
-
State - User's positive interaction history with recommender as well as her demographic situation.
-
Actions - continuous parameter vector a. Each item has a ranking score, which is defined as the inner product of the action and the item embedding.
-
Transitions - once the user's feedback is collected the transition is determined
-
Reward - clicks, rates, etc
-
Discount rate - the trade-off between immediate reward and long-term reward
-
Developed three state-representation models:
- DRR-p Utilize the product operator to capture the pairwise local dependencies between items. Compute pairwise interactions between items by using the element-wise product operations.
- DRR-u In addition to local dependencies between items, the pairwise interactions of user-item are also taken into account.
- DRR-ave. Average pooling over DRR-p. The resulting vector is leveraged to model the interactions with the input user. Finally, concat the interaction vector and embedding of the user. On offline evaluation, DRR-ave outperformed DRR-p and DRR-u.
-
Epsilon-greedy exploration in the Actor-network.
-
Evaluated on offline datasets (MovieLens, Yahoo! Music, Jester)
-
Online evaluation with environment simulator. Pretrain PMF (probabilistic matrix factorization) model as an environment simulator.
Reinforcement Learning for Slate-based RecommenderSystems: A Tractable Decomposition and Practical Methodology. Eugene Ie†, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal1, Rui Wu1, Heng-Tze Cheng1, Morgane Lustman, Vince Gatto3, Paul Covington, Jim McFadden, Tushar Chandra, and Craig Boutilier paper video
- Developed a SLATEQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates
- LTV of a slate can be decomposed into a tractable value function of its component item-wise LTV
- Introduced a recommender simulation environment, RecSim that allows the straightforward configuration of an item collection (or vocabulary), a user (latent) state model and a user choice model.
- Session optimization
- State - static users features as demographics, declared interests, user context, summarizaiton of the previous history
- Actions - the set of all possible slates.
- Transition probability
- Reward - measurement of user engagement.
- In the MDP model each user should be viewed as a separate environment or separate MDP. Hence it critical to allow generalization across users, since few if any users generates enough experience to allow reasonable recommendations.
- Combinatorial optimizaton problem - find the slate with the maximum Q-value. SlateQ allows to decompose into combination of the item-wise Q-values of the consistent items. Tried top-k, greedy, LP-based methods.
- Items and user interests as a topic modeling problem
- As a choice model, use an exponential cascade model that accounts for document position in the slate. This choice model assumes "attention" is given to one document at a time, with exponentially decreasing attention given to documents as a user moves down the slate.
- A user choice model impacts which document(if any) from the slate is consumed by the user. Made an assumption that a user can observe any recommender document's topic before selection but can't observe its quality before consumption
- Users interested in document d defines the relative document appeal to the user and serves the basis of the choice function.
- Models are trained periodically and pushed to the server. The ranker uses the latest model to recommend items and logs user feedback, which is used to train new models. Using LTV labels, iterative model training, and pushing can be viewed as a form of generalized policy iteration.
- A candidate generator retrieves a small subset (hundreds) of items from a large corpus that best matches a user context. The ranker scores/ranks are candidates using a DNN with both user context and item features as input. It optimizes a combination of several objectives (e.g., clicks, expected engagement, several other factors).
RECSIM : A Configurable Simulation Platform for Recommender Systems Ie E, Hsu C, Mladenov M, Jain V, Narvekar S, Wang J, Wu R, Boutilier C 2019 paper
- RECSIM is a configurable platform that allows the natural, albeit abstract, specification of an environment in which a recommender interacts with a corpus of documents (or recommendable items) and a set of users, to support the development of recommendation algorithms.
- The user model samples users from a prior distribution over (configurable) user features: these may include latent features such as personality, satisfaction, interests; observable features such as demographics; and behavioral features such as session length, visit frequency, or time budget
- The document model samples items from a prior over document features, which again may incorporate latent features such as document quality, and observable features such as topic, document length and global statistics (e.g., ratings, popularity)
- User response is determined by user choice model. Once the document is consumed the user state undergoes a transition through a configurable (user) transition model
- The user model assumes that user have a various degrees of interests in topic (with some prior distribution) ranking from -1 to 1.
Contextual Meta-Bandit for Recommender Systems Selection Marlesson R. O. Santana, Luckeciano C. Melo, Fernando H. F. Camargo, Bruno Brandão, Anderson Soares, Renan M. Oliveira, Sandor Caetano
- In this work, we propose an approach of model selection, interpreted as a multi-armed bandit problem. Our goal is to create a meta-bandit to choose which model to run depending on the current context.
- Shifting interest can be context-dependent.
- For each interaction the environment provides an observation. The meta-bandit uses it to select on of the recomenders and let the selected one decide the action. The environment receives this action and calculates the reward.