Practical Diversified Recommendations on YouTube with Determinantal Point Processes

PDF Publication Title:

Practical Diversified Recommendations on YouTube with Determinantal Point Processes ( practical-diversified-recommendations-youtube-with-determina )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 006

CIKM’18, October 2018, Turin, Italy Mark Wilhelm, Ajith Ramanathan, Alex Bonomo, et al. Figure 3: Cumulative gain for a grid of α and σ values on a dataset from the YouTube mobile homepage feed. Let L(w) be the N × N kernel matrix induced by the parameters w. Then the log-likelihood of the training data is: j=1 where Yj is the subset of items from training example j that the user interacted with. The ability to use log-likelihood as an objective function allows us to learn DPP parameters with more sophisticated (and more efficient) methods than grid search. We have begun to explore learning a kernel with many more parameters than the α and σ of the previous section, by using gradient descent on LogLike. We still use as input the φ embeddings that characterize video content. For the personalized video quality scores though, rather than a scalar score qi , we are able to get from existing infrastructure an entire vector of quality scores qi , so we use this vector to make our model more general. (Each entry of the vector qi captures some aspect of what might make a video a good choice for a user.) The full kernel L(φ,q) that we learn from this input data can be expressed in the following manner: Lij = f(qi)д(φi)Tд(φj)f(qj)+δ1i=j , (13) where f and д are separate stacks in a neural network. (δ is simply a regularization parameter that we have for now fixed at a small value.) Note that the quantity f (qi ) is a scalar, while д(φi ) is a vector. The neural network for computing f is relatively shallow, while д’s network is deeper, and effectively re-embeds φ in a space which better describes utility correlation of videos (see Figure 4). We also note that, unlike the basic kernel parameterization discussed earlier, where large values of α could result in non-PSD L, this more complex parameterization is actually guaranteed to always produce PSD matrices without need for projection. This follows from the Figure 4: Architecture example for a deep DPP kernel. fact that this particular construction of L makes it a Gramian matrix, and all such matrices are PSD. To learn all of the parameters of the neural network for com- puting f and д, we optimize LogLike from Equation 11 using Ten- sorflow [1]. The resulting deep DPP models have already shown utility improvements in live experiments (See the Deep DPPs entry in Table 1). However, these deeper models change the ranking sub- stantially enough from the un-diversified baseline that secondary business metrics begin to be significantly impacted, requiring addi- tional tuning. 4.5 Efficient Ranking Algorithm with DPP In this section, we describe how we use at serving time the DPP pa- rameters that were learned as described in Section 4.3 or Section 4.4. That is, when a user goes to the YouTube mobile homepage, how does the DPP decide which videos go at the top of their recom- mendations feed? For any given user, the underlying parts of the YouTube system infrastructure send to the DPP layer of the system the personalized quality scores q and video embedding vectors φ for a set of N videos. We construct a DPP kernel L from these scores and embeddings and the learned parameters as described in the previous section. We then fix some window size k ≪ N , and we ask the DPP for a high-probability set of k videos. We put these videos at the top of the feed, then again ask the DPP for a high-probability set of k videos from the remaining N − k unused videos. These videos become the next k in the feed. We repeat this process until we have ordered the entire feed of N videos. The idea behind constructing sub-windows of the data with stride size k is that the repulsion between two similar items reduces as the distance between them in the feed increases. That is, having video 1 and video 100 be similar is not as detrimental to user en- joyment as having video 1 and video 2 be similar. In practice, for ordering a feed where N consists of several hundred videos, we use sub-windows where k is a dozen or so videos. When we “ask the DPP for a high-probability set of k videos”, what we are actually doing is asking for the size-k set Y that has the highest probability that the user interacts with every one of those k items.1 This corresponds to the following maximization 1One could consider alternative quantities, like the probability that a user interacts at least one item in a given subset. We plan to consider such alternative formulations in future work. LogLike(w) = 􏰾M j=1 log(PL(w )(Yj )) (11) = 􏰾M 􏰌log(det(L(w)Yj )) − log(det(L(w) + I))􏰍 , (12)

PDF Image | Practical Diversified Recommendations on YouTube with Determinantal Point Processes

PDF Search Title:

Practical Diversified Recommendations on YouTube with Determinantal Point Processes

Original File Name Searched:

cikm2018.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)