logo资料库

Embedding-based News Recommendation for Millions of Users.pdf

第1页 / 共10页
第2页 / 共10页
第3页 / 共10页
第4页 / 共10页
第5页 / 共10页
第6页 / 共10页
第7页 / 共10页
第8页 / 共10页
资料共10页,剩余部分请下载后查看
Abstract
1 Introduction
2 Our service and process flow
3 Article representations
3.1 Generating Method
4 User representations
4.1 Notation
4.2 Word-based Model
4.3 Decaying Model
4.4 Recurrent Models
5 Offline Experiments
5.1 Training Dataset
5.2 Test Dataset
5.3 Offline Metrics
5.4 Models and Training
5.5 Experimental results
6 Deployment
6.1 Settings
6.2 Online Metrics
6.3 Experimental Results
6.4 Deployment challenges to large-scale deep-learning-based services
7 Related Work
8 Conclusion
References
Embedding-based News Recommendation for Millions of Users Shumpei Okura Yahoo Japan Corporation Tokyo, Japan sokura@yahoo-corp.jp Shingo Ono Yahoo Japan Corporation Tokyo, Japan shiono@yahoo-corp.jp Yukihiro Tagami Yahoo Japan Corporation Tokyo, Japan yutagami@yahoo-corp.jp Akira Tajima Yahoo Japan Corporation Tokyo, Japan atajima@yahoo-corp.jp ABSTRACT It is necessary to understand the content of articles and user pref- erences to make effective news recommendations. While ID-based methods, such as collaborative filtering and low-rank factoriza- tion, are well known for making recommendations, they are not suitable for news recommendations because candidate articles ex- pire quickly and are replaced with new ones within short spans of time. Word-based methods, which are often used in information retrieval settings, are good candidates in terms of system perfor- mance but have issues such as their ability to cope with synonyms and orthographical variants and define “queries” from users’ his- torical activities. This paper proposes an embedding-based method to use distributed representations in a three step end-to-end man- ner: (i) start with distributed representations of articles based on a variant of a denoising autoencoder, (ii) generate user represen- tations by using a recurrent neural network (RNN) with browsing histories as input sequences, and (iii) match and list articles for users based on inner-product operations by taking system perfor- mance into consideration. The proposed method performed well in an experimental of- fline evaluation using past access data on Yahoo! JAPAN’s home- page. We implemented it on our actual news distribution sys- tem based on these experimental results and compared its online performance with a method that was conventionally incorporated into the system. As a result, the click-through rate (CTR) improved by 23% and the total duration improved by 10%, compared with the conventionally incorporated method. Services that incorporated the method we propose are already open to all users and provide recommendations to over ten million individual users per day who make billions of accesses per month. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy other- wise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. KDD’17, August 13–17, 2017, Halifax, NS, Canada. © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-4887-4/17/08...$15.00 DOI: http://dx.doi.org/10.1145/3097983.3098108 KEYWORDS News recommendations, Neural networks, Distributed represen- tations, Large-scale services 1 INTRODUCTION It is impossible for users of news distributions to read all avail- able articles due to limited amounts of time. Thus, users prefer news services that can selectively provide articles. Such selection is typically done manually by editors and a common set of selected stories are provided to all users in outmoded media such as tele- vision news programs and newspapers. However, we can identify users before they select articles that will be provided to them on the Internet by using information, such as that in user ID cookies, and personalize the articles for individual users [3, 22]. ID-based methods, such as collaborative filtering and low-rank factorization, are well known in making recommendations. How- ever, Zhong et al. [22] suggested that such methods were not suit- able for news recommendations because candidate articles expired too quickly and were replaced with new ones within short time spans. Thus, the three keys in news recommendations are: Understanding the content of articles, Understanding user preferences, and Listing selected articles for individual users based on con- tent and preferences. In addition, it is important to make recommendations in the real world [14] that respond to scalability and noise in learning data [14]. Applications also need to return responses within hundreds of milliseconds with every user access. A baseline implementation to cover the three keys would be as follows. An article is regarded as a collection of words included in its text. A user is regarded as a collection of words included in articles he/she has browsed. The implementation learns click probability using co-occurrence of words between candidates of articles and browsing histories as features. This method has some practical advantages. It can immediately reflect the latest trends because the model is very simple so that the model can be taught to learn and update in short periods of time. The estimation of priority can be quickly calculated using existing search engines with inverted indices on words. KDD 2017 Applied Data Science PaperKDD’17, August 13–17, 2017, Halifax, NS, Canada1933
The previous version of our implementation was based on this method for these reasons. However, it had some issues that may have had a negative impact on the quality of recommendations. The first regarded the representation of words. When a word was used as a feature, two words that had the same meaning were treated as completely different features if the notations were differ- ent. This problem tended to emerge in news articles when multiple providers separately submitted articles on the same event. The second regarded the handling of browsing histories. Brows- ing histories were handled as a set in this approach. However, they were actually a sequence, and the order of browsing should have represented the transition of user interests. We also had to note large variances in history lengths by users that ranged from pri- vate browsing to those who visited sites multiple times per hour. Deep-learning-based approaches have recently been reported to be effective in various domains. Distributed representations of words thoroughly capture semantic information [11, 16]. Recur- rent neural networks (RNNs) have provided effective results as a method of handling input sequences of variable length [9, 15, 17]. If we build a model with a deep network using an RNN to esti- mate the degree of interest between users and articles, on the other hand, it is difficult to satisfy the response time constraints on ac- cesses in real systems. This paper proposes an embedding-based method of using distributed representations in a three step end- to-end manner from representing each article to listing articles for each user based on relevance and duplication. Start with distributed representations of articles based on a variant of the denoising autoencoder (which addresses the first issue in Section 3). Generate user representations by using an RNN with brows- ing histories as input sequences (which addresses the sec- ond issue in Section 4). Match and list articles for each user based on the inner product of article-user for relevance and article-article for de-duplication (outlined in Section 2). The key to our method is using a simple inner product to es- timate article-user relevance. We can calculate article representa- tions and user representations before user visits in sufficient amounts of time. When a user accesses our service, we only select his/her representations and calculate the inner product of candidate arti- cles and the representations. Our method therefore both expresses complex relations that are included in the user’s browsing history and satisfies the response time constraints of the real system. The proposed method was applied to our news distribution ser- vice for smartphones, which is described in the next section. We compared our method to a conventional approach, and the results (see Section 6) revealed that the proposed method outperformed the conventional approach with a real service as well as with static experimental data, even if disadvantages, such as increased learn- ing time and large latency in model updates, are taken into con- sideration. 2 OUR SERVICE AND PROCESS FLOW The methods discussed in this paper were designed to be applied to the news distribution service on the homepage of Yahoo! JAPAN on smartphones. The online experiments described in Section 6 were also conducted on this page. This section introduces our ser- vice and process flow. Figure 1: Example of Yahoo! JAPAN’s homepage on smart- phones. This paper discusses methods of providing articles in Personalized module. Figure 1 has a mockup of our service that was renewed in May 2015. There is a search window and links to other services at the top as a header. The middle part, called the Topics module, provides six articles on major news selected by human professionals for a general readership. The bottom part, called the Personalized mod- ule, provides many articles and advertising that has been person- alized for individual users. Users in the Personalized module can see as many articles as they want if they scroll down to the bot- tom. Typical users scroll down to browse the approximately top 20 articles. This paper describes optimization to provide articles in the Personalized module. tory in advance. Five processes are executed to select articles for millions of users Identify: Obtain user features calculated from user his- Matching: Extract articles from all those available using Ranking: Rearrange list of articles on certain priorities. De-duplication: Remove articles that contain the same Advertising: Insert ads if necessary. information as others. for each user access. user features. Header Topics module Personalized module First view KDD 2017 Applied Data Science PaperKDD’17, August 13–17, 2017, Halifax, NS, Canada1934
These processes have to be done within hundreds of milliseconds between user requests and when they are displayed because avail- able articles are constantly changing. In fact, as all articles in our service expire within 24 hours from the viewpoint of information freshness, tens of thousands of new articles are posted every day, and the same number of old articles are removed due to expira- tion. Thus, each process adopts computationally light methods that leverage pre-computed distributed article representations (de- scribed in Section 3) and user representations (described in Section 4). We use the inner product of distributed representations of a user and candidate articles in matching to quantify relevance and se- lect promising candidates. We determine the order of priorities in ranking by considering additional factors, such as the expected number of page views and freshness of each article, in addition to the relevance used for matching. We skip similar articles in a greedy manner in de-duplication based on the cosine similarity of distributed representations. An article is skipped when the maxi- mum value of its cosine similarity with articles with higher priori- ties is above a threshold. This is an important process in real news distribution services because similar articles tend to have similar scores in ranking. If similar articles are displayed close to one an- other, a real concern is that user satisfaction will decrease due to reduced diversity on the display. Details on comparison experi- ments in this process have been discussed in a report on our pre- vious study [12]. Advertising is also important, but several studies [2, 10] have already reported on the relationship between adver- tising and user satisfaction, so such discussion has been omitted here. 3 ARTICLE REPRESENTATIONS Section 1 discussed a method of using words as features for an article that did not work well in certain cases of extracting and de-duplicating. This section describes a method of dealing with articles as a distributed representation. We proposed a method in our previous study [12], from which part of this section has been excerpted. 3.1 Generating Method Our method generates distributed representation vectors on the basis of a denoising autoencoder [19] with weak supervision. The conventional denoising autoencoder is formulated as: where x 2 X is the original input vector and q(j) is the cor- rupting distribution. The stochastically corrupted vector, ˜x, is ob- tained from q(jx ). The hidden representation, h, is mapped from ˜x through the network, which consists of an activation function, f (), parameter matrix W , and parameter vector b. In the same way, the reconstructed vector, y, is also mapped from h with pa- ′. Using a loss function, LR (;), we learn these rameters W parameters to minimize the reconstruction errors of y and x. ′ and b ˜x q( ˜xjx ) h = f (W ˜x + b) ′ y = f (W h + b θ = arg min W ;W ′;b;b′ ∑ ′ ) x 2X LR (y; x ); The h is usually used as a representation vector that corresponds to x. However, h only holds the information of x. We want to in- terpret that the inner product of two representation vectors h0Th1 is larger if x0 is more similar to x1. To achieve that end, we use a triplet, (x0; x1; x2) 2 X 3, as input for training and modify the objective function to preserve their categorical similarity as: ˜xn q( ˜xnjxn ) hn = f (W ˜xn + b) f (b) yn = f (W ∑ ′ hn + b 2∑ ′ ) LT (h0; h1; h2) = log(1 + exp(h0 Th2 h0 (1) Th1)) θ = arg min W ;W ′;b;b′ (x0;x1;x2)2T n=0 LR (yn; xn ) + αLT (h0; h1; h2); whereT X 3, such that x0 and x1 are in the same category/similar categories and x0 and x2 are in different categories. The h in Eq.1 satisfies the property, x = 0 ) h = 0. This means that an article that has no available information is not similar to any other article. The notation, LT (;;) is a penalty function for article similarity to correspond to categorical similarity, and α is a hyperparameter for balancing. Figure 2 provides an overview of this method. Figure 2: Encoder for triplets of articles We use the elementwise sigmoid function, σ (x )i = 1=(1+exp(xi )), as f (), elementwise cross entropy as LR (;), and masking noise as ′g, by using mini-batch q(j). We train the model, θ = fW ;W stochastic gradient descent (SGD). ′; b; b We construct ˜x in the application phase by using constant decay, instead of stochastic corruption in the training phase, as: ˜x = (1 p)x h = f (W ˜x + b) f (b); where p is the corruption rate in the training phase. Thus, h is uniquely determined at the time of application. Multiplying 1 p has the effect of equalizing the input distribution to each neuron in the middle layer between learning with masking noise and that without the application of this noise. We use the h generated above in three applications as the rep- resentation of the article: (i) to input the user-state function de- scribed in Section 4, (ii) to measure the relevance of the user and the article in matching, and (iii) to measure the similarity between articles in de-duplication. h0h1h2Base article Sim(h0,h1)Sim(h0,h2)Article in similar category Article in different category Encode Decode x0x1x2y0y1y2> KDD 2017 Applied Data Science PaperKDD’17, August 13–17, 2017, Halifax, NS, Canada1935
4 USER REPRESENTATIONS This section describes several variations of the method to calculate user preferences from the browsing history of the user. First, we formulate our problem and a simple word-based baseline method and discuss the issues that they have. We then describe some methods of using distributed representations of articles, as was ex- plained in the previous section. 4.1 Notation Let A be the entire set of articles. Representation of element a 2 A depends on the method. The a is a sparse vector in the word-based method described in Section 4.2, and each element of a vector cor- responds to each word in the vocabulary (i.e., x in Section 3). How- ever, a is a distributed representation vector of the article (i.e., h in Section 3) in the method using distributed representations de- scribed in Sections 4.3 and 4.4. (URL) of the page of an article. Let fau ing history of user u 2 U . Browse means that the user visits the uniform resource locator 2 Agt =1; ;Tu be the brows- Session means that the user visits our recommendation service t and clicks one of the articles in the recommended list. t and au When u clicks an article in our recommendation service (a ses- sion occurs), he/she will immediately visit the URL of the clicked article (a browse occurs). Thus, there is never more than one ses- t +1; therefore, this session is re- sion between browses au ferred to as su t . However, u can visit the URL of an article without t does not our service, e.g., by using a Web search. Therefore, su always exist. Since a session corresponds to the list presented to u, we ex- 2 Agp2P . The notation, press a session, su P N, is the set of positions of the recommended list that is ac- tually displayed on the screen in this session. Let P+ P be the clicked positions and P = P n P+ be non-clicked positions. Al- though P, P+, and P depend on u and t, we omit these subscripts to simplify the notation. Figure 3 outlines the relationships be- tween these notations. t , by a list of articles fsu t;p Figure 3: Browsing history and session 1 ; ; au Let ut be the user state depending on au resents the preference of u immediately after browsing au t , i.e., ut rep- t . Let R(ut ; a) be the relevance between the user state, ut , and the ar- ticle, a, which represents the strength of u’s interest in a in time t. Our main objective is to constitute user-state function F (; : : : ;) and relevance function R(;) that satisfy the property: 8 su t 8 p+ 2 P+ ut = F (au 1 ; : : : ; au t ) 8 p 2 P R(ut ; su t;p+ ) > R(ut ; su t;p ): (2) When considering the constrained response time of a real news distribution system with large traffic volumes, R(;) must be a sim- ple function that can be quickly calculated. Because candidate ar- ticles are frequently replaced, it is impossible to pre-calculate the relevance score, R(ut ; a), for all users fut ju 2 U g and all articles fa 2 Ag. Thus, it is necessary to calculate this in a very short time until the recommended list from visiting our service page is dis- played. However, we have sufficient time to calculate the user state function, F (; : : : ;), until the next session occurs from browsing some article pages. We restrict relevance function R(;) to a simple inner-product t a, for these reasons and only optimize the log(σ (R(ut ; su relation, R(ut ; a) = uT user state function, F (; : : : ;), to minimize the objective: t;p ))) ∑ ∑ ; t;p+ ) R(ut ; su jP+jjPj (3) su t p+2P+ p2P where σ () is the logistic sigmoid function. Equation 4.1 is a relax- ation of Eq. 2. As there is a bias, in practice, in the probability of clicks depending on the displayed position when articles are ver- tically arranged, we use the following objective including the bias term, B(;), to correct such effects. Although B(;) is a parame- ∑ ter to be determined by learning, its description has been omitted below since it is a common term in the models that follow. t;p ) + B(p+; p))) ∑ log(σ (R(ut ; su : t;p+ ) R(ut ; su jP+jjPj su t p+2P+ p2P 4.2 Word-based Model We formulate the word-based baseline model introduced in Section 1 as a baseline implementation on the basis of the notations in the previous section. Recall the three steps in the baseline implementation. in its text, An article is represented by a collection of words included A user is represented by a collection of words included in The relevance between the user and article is expressed by a linear function on the co-occurrences of words between them. articles he/she has browsed, and This model can be regarded as a special case of the notation in the previous section, if article representation a and user function F are defined as: { V : Vocabulary set a; ut 2 f0; 1gV 1 (a)v = 0 (ut )v = (F (au (if the article contains the word v) (otherwise) 1 ; : : : ; au t ))v = αv max 1t′t t′ )v ; (au (4) Browsing  history     Recommended  list ...Session     Clicked Not displayed Not clicked Not clicked User  state     utaut1autaut+1sutsut,1sut,2sut,3sut,NKDD 2017 Applied Data Science PaperKDD’17, August 13–17, 2017, Halifax, NS, Canada1936
where (x )v is the v-th element of x. Then, the relevance function becomes a simple linear model with parameters fαvg as: R(ut ; a) = uT t a ∑ ∑ v2V v2V = = (ut )v (a)v αv1v2ut\a : There are two major issues with this model, as was explained in Section 1. The first is the sparseness of the representation. The R(ut ; a) in this model increases if and only if ut and a contain the same words. Thus, even if a user is interested in similar articles, the relevance scores may greatly differ depending on whether they have been written using the same words as the articles in his/her browsing histories. The second issue is intensity. Because the browsing history is regarded as a set, the information about the browsing order and frequency are lost. Thus, the model is very vulnerable to noise that is mixed into the browsing history. We describe other models by using distributed representations by taking into account these issues in the subsections below. 4.3 Decaying Model We introduce a simple model with distributed representations to solve these issues. Two changes in this model from that of the baseline are: It uses the distributed representation constructed in Sec- tion 3 as the representation vector, a, of an article, instead of a bag-of-words (BoW) representation (which addresses the first issue). It uses the weighted average to aggregate browsing histo- ries, instead of the maximum value. More specifically, we increase the weights for recent browses and reduce the weights for the previous days’ browses (which addresses the second issue). In summary, ut can be expressed as: ∑ ut = α ⊙ 1 1t′t βtt′ ′ βtt au t′; ∑ 1t′t t , ⊙ where α is a parameter vector that has the same dimension as au is the elementwise multiplication of two vectors, and 0 < β 1 is a scalar value that is a hyperparameter that represents the strength of time decay. If β is 1, aggregation is the simple average, so that the browsing order is not taken into consideration. Training pa- rameters are only α in this model, which are similar to those in the baseline model. 4.4 Recurrent Models 4.4.1 Simple Recurrent Unit. Although the decaying model in the previous subsection took into account issues with the word- based model, it had limitations such as being linear with respect to frequency and that the forgetting effect was limited to exponential decay with hyperparameters. More generally, ut should be determined by a previous state, ut1, and previous browse au t as: ut = f (au t ; ut1): Therefore, we try to learn this function by using an RNN. A simple RNN is formulated by: ut = φ (W inau t + W outut1 + b); where φ () is the activation function; therefore, we subsequently use hyperbolic tangent function tanh(). Training parameters are square matrices W in;W out , bias vector b, and initial state vector u0 in this model, where u0 is a common initial value that does not depend on u. We learn these parameters by end-to-end mini-batch SGD with the objective function in Eq. 4.1. However, when the input se- quence is too long, simple RNN makes it too difficult to learn this due to gradient vanishing and explosion problems [5]. Additional structures that are attached to the hidden layer are able to alleviate these problems in such cases. The following subsections introduce two models that use such structures. 4.4.2 Long-short Term Memory Unit. Long-short term memory (LSTM) [6] is a well-known structure for vanishing and exploding gradients [5]. We formulate an LSTM-based model as: дi t + W out t + W out t + W out дi ut1 + W mem дf ut1 + W mem enc ut1 + benc ) дf t + W out дo ut1 + W mem дo hu t1 + bдi ) hu t1 + bдf ) hu t + bдo ) (5) (6) = дit ⊙ enct + д ft ⊙ hu t1 дi au дf au encau дit = σ (W in д ft = σ (W in enct = φ (W in hu t дot = σ (W in дo au dect = φ (W mem dec hu ut = дot ⊙ dect ; t + bdec ) (7) (8) where σ () is the elementwise logistic sigmoid function, and hu t is a hidden memory state. Figure 4 has a network image of the structure of the LSTM-based model. The center flows are the main flows from input (browsed article) to output (user state). The input, au t , is encoded from article vector space to hidden space (Eq. 5), merged into the previous hidden state (Eq. 6), and decoded to the article vector space (Eq. 7, Eq. 8) as the user state. In addition, this unit has three gates, called input gate (дit ), for- get gate (д ft ), and output gate (дot ). We assumed each gate would conceptually play the following roles in this case. The input gate filters unnecessary inputs to construct a user state, such as that caused by sudden interest. The forget gate represents the decline in interest by the user. It can represent a more complex forget- ting effect than the exponential decay that is used in the decaying model. The output gate filters components that should not be fo- cused on in the next session. , bias vectors b and 0 are initial state vectors u0 and hu common initial values that do not depend on u. Training parameters are weight matricesW 0 in this model, where u0 and hu KDD 2017 Applied Data Science PaperKDD’17, August 13–17, 2017, Halifax, NS, Canada1937
Figure 4: LSTM-based model 4.4.3 Gated Recurrent Unit. Gated recurrent unit (GRU) [1] is another structure to avoid gradient vanishing and explosion prob- lems [5]. We formulate a GRU-based model as: ht1 + bдz ) ht1 + bдr ) дz t + W mem t + W mem t + W out дr enc (дrt ⊙ ht1) + benc ) дz au дr au encau дzt = σ (W in дrt = σ (W in enct = φ (W in hu t dect = φ (W mem ut = dect : = дzt ⊙ enct + (1 дzt ) ⊙ hu t1 dec hu t + bdec ) (9) (10) We describe these formulations using symbols that correspond to those of the LSTM-based model as much as possible. More pre- cisely, this model is constructed using one GRU layer and one fully connected layer because Eq. 10 is not contained in the original GRU configuration. Figure 5 outlines a network image of the GRU- based model structure. Except for the omission of some arrows, this structure is similar to that of the LSTM-based model. However, there is an important difference between Eqs. 6 and 9. The дzt gate in this model plays the role of two gates, i.e., дit and д ft in the LSTM-based model. As a result, the following difference in the upper limit of norm jjhu jj1 occurs. t jjhu t jj1 = sup u jφ (x )j 0 jj1 + t sup jjhu max(jjhu 0 jj1; sup x jφ (x )j) x in LSTM (11) in GRU . (12) 8>>><>>>: Equation 11 can be a large value for a very long input sequence; however, Eq. 12 never exceeds the constant. Therefore, we think the GRU-based model has a higher aptitude to solve the gradient explosion problem than the LSTM-based model. The LSTM-based model occasionally failed in training due to gradient explosion when we did not use gradient clipping [13] in the experiments that are described in the next section. However, the GRU-based model did not cause gradient explosion without any supplementary processing. 5 OFFLINE EXPERIMENTS This section discusses the effectiveness of the distributed representation- based method with the models in the previous section by offline evaluation using past serving logs. We compared the three word- based models that were variants of the model introduced in Section 4.2 and five distributed representation-based models introduced in Sections 4.3 and 4.4. These models are summarized in Table 1. Figure 5: GRU-based model 5.1 Training Dataset First, we sampled approximately 12 million users who had clicked at least one article from the service logs of Yahoo! JAPAN’s home- page on smartphones between January and September 2016. We extracted logs for each user over a two-week period chosen at ran- dom to include at least one click. This method of extraction was used to mitigate the impact of epidemic articles within a specific period. As a result, there were about 166 million sessions, one billion browses, and two million unique articles in the training data. We also created another dataset for the same period and used it as a validation dataset to optimize the hyperparameters. 5.2 Test Dataset We sampled 500,000 sessions, in which users clicked articles above position 20 on October 2016. We extracted browse logs in the pre- vious two weeks for each session. We used the article data from positions 1 to 20 for evaluation regardless of whether they were actually displayed on the screen. This was based on our observa- tion with timeline-based user-interfaces. Users scrolled from top to bottom and tended to leave our service when they clicked one article. That is, if we only used data actually displayed for evalua- tion, the method of arranging the actual displayed order in reverse was evaluated as being disproportionately better. 5.3 Offline Metrics We evaluated the rankings provided by each model by using three popular metrics, i.e., the area under the ROC curve (AUC), mean reciprocal rank (MRR), and normalized discounted cumulative gain (nDCG), which regarded clicks as positive labels. Let S be the set of sessions for the evaluation, cs;i be one when the article at position i is clicked, and zero otherwise. Each metric is formulated as: ∑ ∑ ∑ s2S s2S s2S AUC = MRR = nDCG = 1 jSj 1 jSj 1 jSj jf(i; j)ji < j; cs;i = 1; cs; j = 0gj jfijcs;i = 1gjjfjjcs; j = 0gj 1 ∑ min ∑ cs;i =1 i i cs;i = log2(i + 1) i cs;π (i ) maxπ = log2(π (i) + 1) ; where π is an arbitrary permutation of positions. ⌦⌦⌦gitgftgotenctdectuthutaut⌦⌦enctdectuthut⌦gztgrtautKDD 2017 Applied Data Science PaperKDD’17, August 13–17, 2017, Halifax, NS, Canada1938
The AUC is the metric directly related to the objective of train- ing (see Eq.2). The MRR and nDCG are popular ranking metrics; the former focuses on the first occurrence of positive instances, and the latter evaluates ranks for all the positive instances. Each metric is a value between zero and one, which is calculated as the average score for each session; i.e., the larger the better. 5.4 Models and Training We evaluated the eight models listed in Table 1. Table 1: Model descriptions Description Name BoW The simplest word-based model Section 4.2 4.2 +α 4.2 +α 4.3 4.3 4.4.1 4.4.2 4.4.3 BoW-Ave Word-based model that uses average func- tion instead of max in Eq.4 BoW-Dec Word-based model that uses decayed aver- age function with β = 0:9 similar to that introduced in Section 4.3 Average Decaying model with β = 1 (no decaying) Decay RNN LSTM GRU Decaying model with β = 0:9 Recurrent model using simple RNN unit Recurrent model using LSTM-based unit Recurrent model using GRU-based unit As was described in Section 4.2, word-based models can be re- garded as simple linear logistic regressions for pairwise data with sparse feature vectors, like the ranking support vector machine (SVM) [7]. Thus, we used LIBLINEAR [4] for training L2-regularized logistic regression (solver type 0). We used mini-batch SGD with RMS-prop to adjust the learn- ing rate for the other models. The mini-batch size was 20 and the initial learning rate was 0.005. The numbers of dimensions of the distributed representations of article a and user state ut were 500. The number of dimensions of internal state hu t was 200 in LSTM and GRU. These parameters were determined by using the devel- opment dataset. We used the gradient clipping technique [13] to avoid gradient explosion in RNN and LSTM. GRU did not cause gradient explosion when this technique was not used in these experiments. 5.5 Experimental results Table 2 lists all metrics used in the experiments. We split the test dataset into ten sub-datasets and calculated each metric per sub- dataset. The values in Table 2 are the averages for each metric of the sub-datasets and each metric’s 99% confidence intervals that were estimated based on the t-distribution. GRU had the best score for all metrics with a sufficient margin, and it yielded a significant difference against all other models ac- cording to a paired t-test with p < 0:01. Because Average exhibited a significant improvement from BoW- Ave, distributed representations of articles should work better than BoW representations. Decay and BoW-Dec were worse than Average and BoW-Ave. This suggests that browsing-order information cannot be expressed with simple attenuation. RNN also exhibited a slight improvement on Table 2: Results from offline experiments. Values indicate average of metrics and 99% confidence intervals in ten split test sets. BoW BoW-Ave BoW-Dec Average Decay RNN LSTM GRU AUC 0.582 0.003 0.579 0.004 0.560 0.004 0.608 0.003 0.596 0.003 0.612 0.004 0.648 0.004 0.652 0.003 MRR 0.300 0.003 0.310 0.003 0.297 0.004 0.313 0.003 0.302 0.002 0.309 0.004 0.344 0.004 0.347 0.004 nDCG 0.446 0.002 0.452 0.002 0.442 0.003 0.457 0.002 0.449 0.001 0.455 0.003 0.481 0.003 0.484 0.003 AUC against Average. However, LSTM and GRU were significantly better than Average. We believe this was because it was able to ex- press more complex relations for the order of browsing sequences by using the gate structures in these models. 6 DEPLOYMENT We began using GRU on December 2016 for the main traffic in our service (which we denoted the Proposed bucket). However, we continued to apply the conventional BoW model to 1 % of users to enable comparisons of performance (which we denoted the Con- trol bucket). This section reports the results obtained from com- parisons after deployment. 6.1 Settings Our system presents articles to users who visit our service page according to a three step procedure. The user state, ut , of each model should be calculated from the browsing history in advance. When user u visits our service, we calculate the relevance t a, for all articles a that were newly We present the articles to the user in order from the high- scores, R(ut ; a) = uT published within a certain period of time. est relevance score by applying de-duplication. The representation of ut and a for relevance scores depends on the bucket. We applied de-duplication by using the distributed rep- resentations described in Section 2 to both Control and Proposed buckets. As the effects of de-duplication on users are not discussed here, refer to our past paper [12] for details on an experiment on these effects. We can identify the user from browser cookies in addition to the ID for login. Therefore, we can extract some histories for most users including those who are not logged in. However, there are some users who have no history such as new users or who have been privately browsing. Although we can provide articles to them with other methods by using popularity, diversity, and freshness, instead of relevance scores, all results on them have been excluded from the following reports. 6.2 Online Metrics The list below represents the four online metrics we used. Sessions The average number of times one user utilized our service per day. KDD 2017 Applied Data Science PaperKDD’17, August 13–17, 2017, Halifax, NS, Canada1939
Table 3: Average metric lift rates in 7th week and those by user segments. Table 4: Average daily percentage of user composition in 7th week for both buckets. Metric Sessions Duration Clicks CTR Heavy Medium Light ALL +1:1% +1:8% +1:0% +2:3% +13:3% +17:4% +7:8% +4:9% +26:3% +42:3% +19:1% +14:3% +23:0% +18:7% +29:8% +45:1% Duration The average time (in seconds) that the user spent with our service per session. It was the total time the user took looking at the recommendation list and the time it took him/her to read the article after clicking on it. responded to jP+j in Section 4). Clicks The average number of clicks per session (which cor- Click through rate (CTR) Clicks/number of displayed ar- ticles. These were decreased if article retrieval became in- efficient and users spent more time exploring the recom- mendation list. A session in this section means the collection of accesses by a user. It is regarded as another session if there is a gap of more than 10 min between accesses. Our most important metric is the total duration per user, which is the product of Sessions and Duration. 6.3 Experimental Results The results obtained from comparison are summarized in Figure 6 and Table 3. Figure 6 plots the daily transition in the lift rate of each metric (i.e., that with the Proposed and metric with Control). All metrics improved significantly with the Proposed bucket. Duration, Clicks, and CTR indicated a certain rate of improvement from the first day the new model was applied. However, Sessions demonstrated a relatively small rate of improvement on the first day but grad- ually improved a great deal. This meant that a good recommen- dation model first increases clicks, and multiple click experiences encourage users to gradually use the service more frequently. Table 3 summarizes the average metric lift rates in the seventh week and those by user segments split according to the frequency of visits. The definitions for segments are three fold. The user com- position ratios are roughly Heavy : Medium : Liдht = 3 : 2 : 1. during the previous week. Heavy: Users who have visited for more than five days Medium: Users who have visited for between two and five Light: Users who have visited for less than two days dur- days during the previous week. ing the previous week. Improvements in the metrics were confirmed for all segments. Light users demonstrated particularly large improvement rates. Therefore, we derived the following hypotheses. Users who had lit- tle history were strongly influenced by feature sparsity if we used BoW representations. Also, some noisy browsing caused by sud- den interest seriously affected the accuracy of recommendations. Our method with the GRU-based model might successfully pre- vent these problems with distributed representations and the gate structures of GRU. This is why the Proposed bucket worked better Bucket Control Proposed ALL (%) Heavy (%) Medium (%) 33:9 33:6 34:3 Light (%) 16:1 15:0 0th week (ref.) 16:0 than the Control bucket with the word-based model, especially for light users. 100:0 100:0 100:0 50:0 51:4 49:7 In addition to the improvement in metrics within each segment, we also observed a shift in users from Light to Medium, and Medium to Heavy, as listed in Table 4. This is why the overall rate of in- crease in Sessions was higher than the increase in each segment in Table 3. 6.4 Deployment challenges to large-scale deep-learning-based services It is very important to keep updating models with the latest data when applying machine learning to actual news distribution sys- tems. In fact, the Control bucket described in the previous section uses the latest article data and session data every three hours to up- date the model. The Proposed bucket using the GRU-based model, on the other hand, could not update the model as frequently due to four reasons. It takes a long time to learn the model. In fact, the model actually used in main traffic is calculated over a week us- ing GPU. If we update the article-representation model (discussed in Section 3), we have to re-calculate representations for all available articles and re-index to the article search engine. If we update the user-representation model (discussed in Section 4), we have to re-calculate the user state from the first browse in the past. In practice, there is no choice but to give up on re-calculating old histories over a certain period of time. The data store holding the user representations and the search index holding the article representations must be synchronously updated. Therefore, we prepared two identical systems and switched them alternately to update the model once every two weeks. Of course, adding fresh articles and updating user states by users’ browsing are executed more frequently because they require fewer calcula- tions than those for updating models. As word-based models in our experience have responded sensi- tively to buzz-words, they deteriorate as soon as a few days after stops have been updated; however, the distributed representation- based model maintains sufficient accuracy at this frequency of model updates. When we left the GRU-based model for about three months, it had deteriorated to the same accuracy as the word-based model with updates every three hours in experiments without model up- dates. 7 RELATED WORK This section presents related work to our proposed approach. We generated distributed representations of articles by using a denoising autoencoder with weak supervision, as described was KDD 2017 Applied Data Science PaperKDD’17, August 13–17, 2017, Halifax, NS, Canada1940
分享到:
收藏