Practical Diversified Recommendations on YouTube with Determinantal Point Processes

PDF Publication Title:

Practical Diversified Recommendations on YouTube with Determinantal Point Processes ( practical-diversified-recommendations-youtube-with-determina )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 005

Practical Diversified Recommendations on YouTube with Determinantal Point Processes CIKM’18, October 2018, Turin, Italy then: 4.3 Training Approach Our training set consists of approximately forty thousand exam- ples sampled from one day of data collected from the YouTube mobile homepage feed. Each training example is a single homepage feed impression: a single instance of a user going to the YouTube mobile homepage and being presented with an ordered list of rec- ommended videos. For each such impression we have a record of which videos a user enjoyed, which constitutes the set Y . We note that there is a partial-label bias associated with training the model from such data, since we only observe users’ interactions with the videos that we chose to present them with in the past, rather than interactions with videos chosen uniformly at random. Generally, we use the same style of solutions to this issue as have been used in the past when training pointwise models, such as using an ε -greedy exploration strategy. For the basic kernel described in the previous section there are only two parameters, α and σ , so we can simply do a grid search to find the values that maximize the cumulative gain (Equation 2). Figure 3 shows the cumulative gain obtained for various choices of α and σ . The darker the color, the worse the result. Interestingly, one can observe the catastrophic cliff in the upper right quadrant, and the subsequent plateau. This has to do with the DPP kernels for training examples becoming increasingly non-PSD. Recall that as α grows the off-diagonal of L grows, increasing the chance of a non-PSD L. Since the off-diagonal also grows somewhat with σ, large α,σ combinations result in non-PSD matrices for many of the training examples. Intuitively then, it might seem like the entire upper right corner of the plot should have low cumulative gain values, rather than the low values being concentrated in the observed band. However, also recall that we project any non-PSD matrices back to the PSD space. This projection is not linear with respect to α or σ and so the quality of the matrices after projection cannot be expected to necessarily correlate with our intuition about those parameters. Overall, we find that the highest cumulative gain is achieved in the mid-range for σ and in the upper half of the range for α . The L kernels produced by these parameters are mostly PSD, so only an occasional training example’s kernel requires projection. 4.4 Deep Gramian Kernels As discussed earlier, one of the main advantages of using DPPs over heuristics is that DPPs allow us to build a system that scales gracefully in complexity over time. We argued that the complexity of a heuristic scales poorly, because to tune it we have to do a grid search over its parameters, and so the runtime for training a heuristic is exponential in the number of parameters. In this section, we discuss how, with DPPs, we can move beyond grid searches to efficiently train a model with many parameters. There is a substantial body of work on learning DPP kernel matrices that are parameterized in a variety of ways [3, 12, 14, 20, 21, 26, 27]. Such work usually seeks to maximize the log-likelihood of the training data. More concretely, suppose that: • the parameters of L are some length-r vector w, and • we have M training examples, each consisting of: 1) a set of N items, and 2) the subset Y of these items that a user interacted with. 􏰻L11 L12􏰼 LY=L21 L22 . (8) The determinant of this submatrix is: det(LY ) = L11L22 − L12L21. So, it is a product of item qualities minus scaled item similarities. The expression for determinant is more complex for larger subma- trices, but a similar intuition holds there as well. In the following sections, we discuss various ways in which L can be constructed from the readily available system inputs, such as the pointwise item quality scores, q, that Section 3.2 described. 4.2 Kernel Parameterization In our current deployment, as shown in Figure 2, diversification happens fairly late in the pipeline, so a typical input set size is N = 100. For each of these N videos, we have two main input features: a personalized quality score q, and a sparse embedding φ which is derived from the topical content of the video. These features are generated by completely independent subsystems. By stacking our diversification system on top of them we take advantage of the continuous improvements of these subsystems. For the initial introduction of DPPs into our system, we first used a relatively simple parameterization of the N × N DPP kernel matrix L: Lii=qi2 (9) 􏰹 Dij􏰺 Lij=αqiqjexp−2σ2 ,fori􏰎j. (10) EachDij isacomputedfromφi andφj;Section5describestheexact embedding φ and distance function that we found worked best in practice. The α and σ are free variables. Note that this formulation is identical to a standard (Gaussian) radial basis function (RBF) kernel when α = 1. For smaller values, α ∈ [0, 1), the off-diagonal of the matrix is scaled down, which essentially corresponds to considering all items to be more diverse. For larger values, α > 1, the off-diagonal of the matrix is scaled up, with the opposite effect of making all items more similar. As α grows, the probability of small sets grows while the probability of large sets shrinks. Thus, a large α is a good fit for user behavior in the setting where a relatively small subset of the videos in a feed are interacted with (|Y | is small). It is valuable for us to be able to use large α , because, as we will see in Section 4.3, it provides a better fit to real user data. However, there is one technical issue with allowing α > 1. Recall from the Equation 6 that to have a proper DPP, the kernel matrix L must be positive semi-definite (PSD). The PSD condition ensures that the determinants of all submatrices of L are non-negative. This is important because the probability of a set Y is proportional to det(LY ), and negative “probabilities” do not make sense. If we allow α > 1 though, this has the potential to make L non-PSD. In practice, we resolve this issue by simply projecting any non-PSD matrix that a large α value produces back to the space of PSD matrices. (The projection is simple: we compute L’s eigendecomposition and replace any negative eigenvalues with zeros.)

PDF Image | Practical Diversified Recommendations on YouTube with Determinantal Point Processes

PDF Search Title:

Practical Diversified Recommendations on YouTube with Determinantal Point Processes

Original File Name Searched:

cikm2018.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)