Netflix developed a new machine learning algorithm based on reinforcement learning to create an optimal list of recommendations considering a finite time budget for the user. In a recommendation use case, often the factor of finite time to make a decision is ignored; Netflix added this dimension to its recommendation system and in general in decision problems, in addition to the relevance of the recommendations.
The evaluation problem can be measured in an amount of time because the user takes some time to choose what he or she wants to see: reading trailers, watching the preview, and so on, and different shows require different amounts of evaluation time. This time budget can be considered in the recommendation system, so the recommendation model should construct the recommendation list considering the relevance of the items and the evaluation cost for the user.
The user’s time budget may not be directly observable (like the preferences), but the goal of the recommendation algorithm is to construct a recommendation list that has a higher chance of engagement for the user. The recommender system needs to learn the user’s time budget in addition to the user’s latent preferences.
A typical recommendation system works with a bandit-style approach to list construction, and it constructs a slate of K items as follows:
Image source: A bandit style recommender system for slate construction
The item scorer scores all the available N items and may use the slate constructed so far as additional context. The scores are then passed through a sampler that selects an item from the available items. This is how the scorer and the sampling step, the main components of a recommender system, insert an element at slot K in the slate.
The recommendation system presents a one-dimensional list of K elements to the user (in a simplified setting); the user has a time budget modeled as a positive real number. The problem can be modeled with two features: relevance and cost. When the user evaluates an item on the list the cost (time) is consumed and when the budget ends the user can’t evaluate other items suggested. Each item has a probability between 0 and 1 to be consumed, and the probability of choosing an item depends not only on the relevance of the item but also on the number of items the user is able to examine.
A recommendation system trying to maximize the user’s engagement with the slate needs to pack in as many relevant items as possible within the user budget, by making a trade-off between relevance and cost.
The problem is connected to the 0/1 Knapsack problem in theoretical computer science: the goal is to find the subset of items with the highest total utility such that the total cost of the subset is not greater than the budget. The 0/1 knapsack problem is NP-Complete; there are many approximate solutions. Netflix proposes to model the budget-constrained recommendation as a Markov Decision process and use a Reinforcement Learning to find a solution. In the Markov Decision process, the key concept is the current state and the action taken by the agent.
In this problem, the recommender system is the agent and the interaction between the user and the recommender system is modeled as the environment. The recommender system (agent) constructs the list of recommendations by repeatedly selecting the k items deemed appropriate at each slot. The environment (interaction between user and recommendation system) is characterized by the time budget and the items examined in the list at particular steps. In the real world. the user’s time budget is unknown and can be estimated by the user's historical data (e.g. how long the user scrolled before abandoning the historical data logs).
The reinforcement learning algorithm used for this problem is based on estimating the return, specifically using Temporal Difference learning to estimate the value function.
The simulation is very helpful in studying and better understanding the problem. With simulation, various recommendation algorithms are trained and compared. To evaluate the performances, a simple metric is used as is the average number of successes of the generated stater referred to as play rate. In addition to play rate, it is important to consider the effective slate size: one of the ways to improve the play rate is to construct later effective slates; this metric is important to understand the mechanism of recommendation algorithms.
Thanks to the flexibility of simulation and setting of simulation parameters, the Netflix team learned how to construct optimal slates in an on-policy manner using SARSA algorithm. Comparing the RL model and contextual bandit, the performance is much better for the reinforcement learning approach. In particular, for play rate, the result is a statistically significant increase in play rate for small- to medium-budget users.
The on-policy learning is easy to simulate but is difficult to execute in realistic recommender settings because the data are generated by a different policy (behavior policy), and the goal is to learn a new (better) policy from this data. In this case, Q-learning is the technique that allows the learn optimal values function in an off-policy setting.
Q-learning and SARSA of on-policy setup are compared, and the result is that Q-learning seems to be generating very large effective slate sizes without much difference in play rate. This result is interesting and still unclear; this is why it needs more investigation to be fully understood.