Deep Neural Networks for YouTube Recommendations

PDF Publication Title:

Deep Neural Networks for YouTube Recommendations ( deep-neural-networks-youtube-recommendations )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 007

of continuous features was critical for convergence. A con- tinuous feature x with distribution f is transformed to x ̃ by scaling the values such that the feature is equally distributed in [0,1) using the cumulative distribution, x ̃ = 􏰸x df. −∞ This integral is approximated with linear interpolation on the quantiles of the feature values computed in a single pass over the data before training begins. In addition to the raw normalized feature x ̃, we also input powers x ̃2 and √x ̃, giving the network more expressive power by allowing it to easily form super- and sub-linear functions of the feature. Feeding powers of continuous features was found to improve offline accuracy. 4.2 Modeling Expected Watch Time Our goal is to predict expected watch time given training examples that are either positive (the video impression was clicked) or negative (the impression was not clicked). Pos- itive examples are annotated with the amount of time the user spent watching the video. To predict expected watch time we use the technique of weighted logistic regression, which was developed for this purpose. The model is trained with logistic regression under cross- entropy loss (Figure 7). However, the positive (clicked) impressions are weighted by the observed watch time on the video. Negative (unclicked) impressions all receive unit weight. In this way, the odds learned by the logistic regres- sion are 􏰊 Ti where N is the number of training examples, N−k k is the number of positive impressions, and Ti is the watch time of the ith impression. Assuming the fraction of pos- itive impressions is small (which is true in our case), the learned odds are approximately E[T ](1 + P ), where P is the click probability and E[T] is the expected watch time of the impression. Since P is small, this product is close to E[T]. For inference we use the exponential function ex as the fi- nal activation function to produce these odds that closely estimate expected watch time. 4.3 Experiments with Hidden Layers Table 1 shows the results we obtained on next-day holdout data with different hidden layer configurations. The value shown for each configuration (“weighted, per-user loss”) was obtained by considering both positive (clicked) and negative (unclicked) impressions shown to a user on a single page. We first score these two impressions with our model. If the negative impression receives a higher score than the posi- tive impression, then we consider the positive impression’s watch time to be mispredicted watch time. Weighted, per- user loss is then the total amount mispredicted watch time as a fraction of total watch time over heldout impression pairs. These results show that increasing the width of hidden layers improves results, as does increasing their depth. The trade-off, however, is server CPU time needed for inference. The configuration of a 1024-wide ReLU followed by a 512- wide ReLU followed by a 256-wide ReLU gave us the best results while enabling us to stay within our serving CPU budget. For the 1024 → 512 → 256 model we tried only feeding the normalized continuous features without their powers, which increased loss by 0.2%. With the same hidden layer con- figuration, we also trained a model where positive and neg- ative examples are weighted equally. Unsurprisingly, this increased the watch time-weighted loss by a dramatic 4.1%. Hidden layers None 256 ReLU 512 ReLU 1024 ReLU 512 ReLU → 256 ReLU 1024 ReLU → 512 ReLU 1024 ReLU → 512 ReLU → 256 ReLU weighted, per-user loss 41.6% 36.9% 36.7% 35.8% 35.2% 34.7% 34.6% Table 1: Effects of wider and deeper hidden ReLU layers on watch time-weighted pairwise loss com- puted on next-day holdout data. 5. CONCLUSIONS We have described our deep neural network architecture for recommending YouTube videos, split into two distinct problems: candidate generation and ranking. Our deep collaborative filtering model is able to effectively assimilate many signals and model their interaction with lay- ers of depth, outperforming previous matrix factorization approaches used at YouTube [23]. There is more art than science in selecting the surrogate problem for recommenda- tions and we found classifying a future watch to perform well on live metrics by capturing asymmetric co-watch behavior and preventing leakage of future information. Withholding discrimative signals from the classifier was also essential to achieving good results - otherwise the model would overfit the surrogate problem and not transfer well to the home- page. We demonstrated that using the age of the training exam- ple as an input feature removes an inherent bias towards the past and allows the model to represent the time-dependent behavior of popular of videos. This improved offline holdout precision results and increased the watch time dramatically on recently uploaded videos in A/B testing. Ranking is a more classical machine learning problem yet our deep learning approach outperformed previous linear and tree-based methods for watch time prediction. Recom- mendation systems in particular benefit from specialized fea- tures describing past user behavior with items. Deep neural networks require special representations of categorical and continuous features which we transform with embeddings and quantile normalization, respectively. Layers of depth were shown to effectively model non-linear interactions be- tween hundreds of features. Logistic regression was modified by weighting training ex- amples with watch time for positive examples and unity for negative examples, allowing us to learn odds that closely model expected watch time. This approach performed much better on watch-time weighted ranking evaluation metrics compared to predicting click-through rate directly. 6. ACKNOWLEDGMENTS The authors would like to thank Jim McFadden and Pranav Khaitan for valuable guidance and support. Sujeet Bansal, Shripad Thite and Radek Vingralek implemented key com- ponents of the training and serving infrastructure. Chris Berg and Trevor Walker contributed thoughtful discussion and detailed feedback.

PDF Image | Deep Neural Networks for YouTube Recommendations

PDF Search Title:

Deep Neural Networks for YouTube Recommendations

Original File Name Searched:

45530.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)