logo资料库

generalized linear mixed model.pdf

第1页 / 共10页
第2页 / 共10页
第3页 / 共10页
第4页 / 共10页
第5页 / 共10页
第6页 / 共10页
第7页 / 共10页
第8页 / 共10页
资料共10页,剩余部分请下载后查看
GLMix: Generalized Linear Mixed Models For Large-Scale Response Prediction Xianxing Zhang, Yitong Zhou, Yiming Ma, Bee-Chung Chen, Liang Zhang, Deepak Agarwal {xazhang, yizhou, yma, bchen, lizhang, dagarwal}@linkedin.com LinkedIn Mountain View, CA, USA ABSTRACT Generalized linear model (GLM) is a widely used class of models for statistical inference and response prediction prob- lems. For instance, in order to recommend relevant content to a user or optimize for revenue, many web companies use logistic regression models to predict the probability of the user’s clicking on an item (e.g., ad, news article, job). In scenarios where the data is abundant, having a more fine- grained model at the user or item level would potentially lead to more accurate prediction, as the user’s personal pref- erences on items and the item’s specific attraction for users can be better captured. One common approach is to in- troduce ID-level regression coecients in addition to the global regression coecients in a GLM setting, and such models are called generalized linear mixed models (GLMix) in the statistical literature. However, for big data sets with a large number of ID-level coecients, fitting a GLMix model can be computationally challenging. In this paper, we re- port how we successfully overcame the scalability bottleneck by applying parallelized block coordinate descent under the Bulk Synchronous Parallel (BSP) paradigm. We deployed the model in the LinkedIn job recommender system, and generated 20% to 40% more job applications for job seekers on LinkedIn. 1. INTRODUCTION Accurate prediction of users’ responses to items is one of the core functions of many recommendation applications. Examples include recommending movies, news articles, songs, jobs, advertisements, and so forth. Given a set of features, a common approach is to apply generalized linear models (GLM). For example, when the response is numeric (e.g., rating of a user to a movie), a linear regression on the fea- tures is commonly used to predict the response. For the binary response scenario (e.g. whether to click or not when a user sees a job recommendation), logistic regression is of- ten used. Sometimes the response is a count (e.g., number of times a user listens to a song), and Poisson regression be- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. KDD ’16 Aug 13–17, 2016, San Francisco, CA, USA c 2016 ACM. ISBN 978-1-4503-4232-2/16/08. . . $15.00 DOI: http://dx.doi.org/10.1145/2939672.2939684 comes a natural choice. All of the above models are special cases of GLM. The features available in recommender systems often in- clude user features (e.g., age, gender, industry, job func- tion) and item features (e.g., title and skills for jobs, title and named entities for news articles). An approach that is widely adopted in industry to model interactions between users and items is to form the outer (cross) product of user and item features, followed by feature selection to reduce the dimensionality and mitigate the problem of overfitting. In reality, we often observe a lot of heterogeneity in the amount of data per user or item that cannot be suciently modeled by user/item features alone, which provides an opportunity to improve model accuracy by adding more granularity to the model. Specifically, for a user who has interacted with many items in the past, we should have sucient data to fit regression coecients that are specific to that user to cap- ture his/her personal interests. Similarly, for an item that has received many users’ responses, it is beneficial to model its popularity and interactions with user features through regression coecients that are specific to the item. One common approach to capture such behavior of each individual user and item is to use ID-level features, i.e., the outer product of user IDs and item features, and the outer product of item IDs and user features. Models with ID-level features are usually referred to as generalized linear mixed models (GLMix) in Statistics [15]. Although conceptually simple, it can generate a very large number of regression coecients to be learned. For example, for a data set of 10 million users, and each user with 1,000 non-zero coe- cients on item features, the total number of regression coef- ficients can easily go beyond 1010. Therefore, fitting GLMix models for big data is computationally challenging. Dimen- sion reduction methods such as feature hashing [1] or princi- pal component analysis can reduce the number of features. However, they also reduce our ability to interpret the model or explain the predictions in the original feature space (e.g., at the user’s ID level), making it dicult to debug or inves- tigate system issues or user complaints. 1.1 Our contributions In this paper we develop a parallel block-wise coordinate descent (PBCD) algorithm under the Bulk Synchronous Par- allel (BSP) paradigm [21] for GLMix model, with a novel model score movement scheme that allows both the coe- cients and the data to stay in local nodes. We show that our algorithm successfully scales model fitting of GLMix for very large data sets. We empirically demonstrate that this conceptually simple class of models achieves high accuracy
bers to connect with all kinds of professional opportunities for their career growth. One of the most important products is the Jobs Homepage, which serves as a central place for members with job seeking intention to come and find good jobs to apply for. Figure 1 is a snapshot of the LinkedIn Jobs Homepage. One of the main modules on the page is “Jobs you may be interested in”, where relevant job thumb- nails are recommended to members based on their public profile data and past activity on the site. If a member is interested in a recommended job, she can click on it to go to the job detail page, where the original job post is shown with information such as the job title, description, respon- sibilities, required skills and qualifications. The job detail page also has an “apply” button which allows the member to apply for the job with one click, either on LinkedIn or on the website of the company posting the job. One of the key success metrics for LinkedIn jobs business is the total num- ber of job application clicks (i.e the number of clicks on the “apply” button), which is the focus for the job recommenda- tion problem in this paper. Another second-order metric is called “job detail views”, which represents the total number of clicks generated from the “Jobs you may be interested in” module to the job details pages. We consider job applica- tion clicks to be the key metric to optimize for, instead of job detail views, because the job posters mostly care about how many applications they receive, instead of how many people view their job postings. 2.2 GLMix Model Now we consider the GLMix model for the job recommen- dation problem. To measure whether job j is a good match for a member m and to select the best jobs according to this measure, the key is to predict the probability that member m would apply for job j given an impression on the “Jobs you may be interested in” module. Let ymjt denote the bi- nary response of whether member m would apply for job j in context t, where the context usually includes the time and location where the job is shown. We use qm to denote the feature vector of member m, which includes the features ex- tracted from the member’s public profile, e.g., the member’s title, job function, education history, industry, etc. We use sj to denote the feature vector of job j, which includes fea- tures extracted from the job post, e.g. the job title, desired skills and experiences, etc. Let xmjt represent the overall feature vector for the (m, j, t) triple, which can include qm and sj for feature-level main e↵ects, the outer product be- tween qm and sj for interactions among member and job features, and features of the context. We assume that xmjt does not contain member IDs or item IDs as features, be- cause IDs will be treated di↵erently from regular features. The GLMix model for predicting the probability of mem- ber m applying for job j using logistic regression is: g(E[ymjt]) = x0mjtb + s0j↵m + q0mj, (1) where g(E[ymjt]) = log E[ymjt] 1E[ymjt] is the link function, b is the global coecient vector (also called fixed e↵ect coe- cients in the statistical literature); and ↵m and j are the coecient vectors specific to member m and job j, respec- tively. ↵m and j are called random e↵ects coecients, which capture member m’s personal preference on di↵erent item features and item j’s attraction for di↵erent member features. Figure 1: A snapshot of the LinkedIn jobs homepage. on two public data sets and the LinkedIn job recommenda- tion data set. We also report our successful deployment of GLMix in LinkedIn’s job recommendation production sys- tem, which improved the total number of job applications from active job seekers by 20% to 40%. To our knowledge, this is the first paper that discusses the practical lessons learned to scale GLMix in real web recommender system applications. We also extend GLMix by adding a matrix factorization component in order to further leverage recent advancements in scaling up matrix factorization in the distributed environ- ment [24], and see some slight improvement on LinkedIn job recommendation data. Since the parallelization of matrix factorization is well studied, in this paper we mainly focus on the scalability and parallelization of fitting GLMix. We provide an implementation of the GLMix model with Apache Spark1, and the code base is open sourced as part of the Photon-ML library2. 1.2 Organization The paper is organized as follows. We start by introducing the LinkedIn job recommendation problem as our motivat- ing example and describing the formulation of the GLMix model in such a context in Section 2. The fitting algorithm using PBCD and the implementation details are described in Section 3. We show our o✏ine and online experiments for LinkedIn job recommendation in Section 4, and show results for the other two public data sets in Section 5. We review the literature background of GLMix and related work in Section 6. We conclude the paper in Section 7. 2. OUR MODEL In this section, we describe the generalized linear mixed model (GLMix) in detail. We start with our motivating example in Section 2.1, the LinkedIn job recommendation problem, to provide the context for GLMix. We introduce the model in Section 2.2 and later describe a general for- mulation in Section 2.3. Section 2.4 describes extensions of GLMix such as adding a matrix factorization component. 2.1 Job Recommendation at LinkedIn As the world’s largest professional social network, LinkedIn provides a unique value proposition for its over 400M mem- 1https://spark.apache.org 2https://github.com/linkedin/photon-ml
Priors: To mitigate the risk of overfitting due to the large number of parameters and sparsity of the data, we put the following Gaussian priors on both fixed e↵ects and random e↵ects: b b ⇠ N (0, 1 I), ↵m ⇠ N (0, 1 where I is the identity matrix, 1 b and 1 ↵ and 1 ↵ I), j ⇠ N (0, 1 is the prior variance of b, (2) I), are the prior variances of ↵m and j. Model behavior: Like many other recommender system applications, we observe a lot of heterogeneity in the amount of data per member or job. There are new members to the site (hence almost no data) as well as members who have strong job search intentions and applied for many jobs in the past. Similarly, for jobs there are both popular and unpopular ones. For a member m with many responses to di↵erent items in the past, we are able to accurately estimate her personal coecient vector ↵m and provide personalized predictions. On the other hand, if member m does not have much past response data, the posterior mean of ↵m will be close to zero, and the model for member m will fall back to the global fixed e↵ect component x0mjtb. The same behavior applies to the per-job coecient vector j. 2.3 General Formulation We now consider a more generic formulation of GLMix where more than two sets of random e↵ects can be present in the model. This is useful for scenarios such as multi-context modeling, where we can have random e↵ects to model the interactions between the context id and user/item features. The general formulation of GLMix is provided in the follow- ing equation: g(E[yn]) = x0nb +Xr2R z0rnr,i(r,n), (3) b ⇠ p(b), rl ⇠ p(rl), 8r 2 R, 1  l  Nr Comparing equation (3) with (1), the following new nota- tions are introduced: Let R denote the collection of random e↵ect types being modeled. Also let i(r, n) denote an index- ing function that retrieves the index of random e↵ect type r in the n-th training sample; for example, if random ef- fect type r corresponds to the per-job random e↵ect, i(r, n) then returns the job ID associated with sample n. Given the definition of the indices, we use r,i(r,n) to denote random e↵ect coecient vector and zrn as the corresponding feature vector, for random e↵ect type r in the n-th sample. For the specific case of job recommendations, there are two types of random e↵ects: R = {member, job}, and n is a per-sample index that represents the triple (m, j, t). For the member- level random e↵ect (i.e., r=member), zrn represents sj and r,i(r,n) represents ↵m. For the job-level random e↵ect (i.e., r=job), zrn represents qm and r,i(r,n) represents j. We generalize the Gaussian priors of fixed e↵ects b and random e↵ects to p(·), and also use Nr to denote the total number of instances for random e↵ect type r, e.g., when r represents member, Nr represents the total number of mem- bers in the data set. 2.4 Extensions We note that GLMix can be extended by incorporating other popular modeling approaches in recommender sys- tems, such as matrix factorization [11]. For example, for the job recommendation application, a simple extension to the model in Equation (1) with an additional matrix factor- ization component is: g(E[ymjt]) = x0mjtb + s0j↵m + q0mj + u0mvj, (4) where u0m and vj are k-dimensional member and job factors, respectively. 3. ALGORITHM In this section we describe our scalable algorithm for the GLMix model using parallel block-wise coordinate descent (PBCD). We start with the algorithm itself in Section 3.1, and in Section 3.2 we discuss the implementation details under the Bulk Synchronous Parallel (BSP) [21] paradigm and lessons learned, such as how to take the network I/O communication complexity into account and how to reduce the size of the random e↵ect coecients. Throughout the section, we use the notation of the general formulation described in Section 2.3. Given the set of sample responses y(⌦) = {yn}n2⌦, where ⌦ is the set of sample indices, the main parameters to learn are the fixed e↵ect coecients b and random e↵ect coecients = {r}r2R, where r = {rl}Nr l=1. The hyper-parameters for the priors of b and , such as b, ↵ and for the job recommendation case, can be tuned through cross-validation. For the general formulation of the GLMix model specified in Equation (3), the posterior mode of b and can be found through the following optimization problem: max b,{r}r2RXn2⌦ log p(yn|sn)+log p(b)+Xr2R NrXl=1 log p(rl), (5) where sn is defined as sn = x0nb +Xr2R z0rnr,i(r,n), (6) and p(yn|sn) is the underlying likelihood function of the link function g[E[yn]] for response yn given b and . 3.1 Parallel Block-wise Coordinate Descent Traditionally, the fitting algorithms for random e↵ect mod- els require to be integrated out either analytically or nu- merically (literature review can be found in Section 6), which becomes infeasible when facing industry-scale large data sets. Similarly, both deterministic and MCMC sampling based al- gorithms that operate on as a whole become cumbersome. In Algorithm 1, we propose a parallel block-wise coor- dinate descent based iterative conditional mode algorithm, where the posterior mode of the random e↵ect coecients r for each random e↵ect r is treated as a block-wise coor- dinate to be optimized in the space of unknown parameters. Given the scores s defined in equation (6), the optimization problems for updating the fixed e↵ects b and the random e↵ects are provided in Equation (7) and (8), respectively: b = arg max b (log p(b) +Xn2⌦ log p(yn|sn x0nbold + x0nb)) (7)
Algorithm 1: Parallel block-wise coordinate descent (PBCD) for GLMix 1 while not converged do 2 Update fixed e↵ect b as in (7) foreach n 2 ⌦ in parallel do Update sn as in (9) foreach r 2 R do foreach l 2 {1, . . . , Nr} in parallel do Update random e↵ect rl as in (8) foreach n 2 ⌦ in parallel do Update sn as in (10) 3 4 5 6 7 8 9 rl = arg max rl ⇢ log p(rl) + Xn|i(r,n)=l log p(yn|sn z0rnold rl + z0rnrl) (8) As in most coordinate descent algorithms, we perform in- cremental updates for s = {sn}n2⌦ for computational e- ciency. More specifically, when the fixed e↵ects b get up- dated, Equation (9) is applied for updating s; and when the random e↵ects get updated, Equation (10) is used. snew n = sold n x0nbold + x0nbnew r,i(r,n) + z0rnnew r,i(r,n) snew n = sold n z0rnold (9) (10) Note that Equation (7) and (8) are both standard optimiza- tion problems for generalized linear model, for which many numerical optimizers are available for large data sets, such as L-BFGS, SGD, or TRON, and they can be easily paral- lelized for scalability [9, 14]. 3.2 Implementation Details In this section, we discuss some details on how to imple- ment the parallel block-wise coordinate descent algorithm (PBCD) for GLMix described in Algorithm 1 under the BSP paradigm. Other alternatives for implementing PBCD such as using parameter server are discussed in Section 6. We first introduce some additional notation, and then dis- cuss the implementation details for updating the fixed e↵ects and the random e↵ects, respectively. An overview of the k- th iteration of the block coordinate descent algorithm for GLMix is given in Figure 2. 3.2.1 At iteration k of Algorithm 1, let sk denote the current value of s = {sn}n2⌦. Let P denote the dimension of the fixed e↵ect feature space, i.e., xn 2 RP , and Pr denote the dimension of the feature space for random e↵ect r, i.e., zrn 2 RPr . And we use C to denote the overall dimension of the feature space, that is: Some Notations C = P +Xr2R PrNr, (11) where Nr denotes the number of random e↵ects of type r (e.g., number of users). For the set of sample responses y(⌦) = {yn}n2⌦, we use |⌦| to denote the size of ⌦, i.e. the total number of training samples. We also let |R| be the number of types of random e↵ects, and M be the num- ber of computing nodes in the cluster. These numbers will be used to compute and discuss the network I/O cost of our proposed approach, which is usually one of the major bottle- necks for an algorithm to scale up in a distributed computing environment [27, 16] 3.2.2 Updating Fixed Effects We consider the details of updating the fixed e↵ect b at iteration k, following Equation (7). We start with preparing the training data with both the feature set xn and the latest score sk n, and partition them into M nodes. This is identi- fied as Superstep 1 in Figure 2. Given the training data, numerous types of distributed algorithms can be applied to learn b, such as [16, 6]. Here we adopt the all-reduce [1] type of implementation for generalized linear models as in MLlib [12]. First, the gradient of b for each sample n is com- puted and aggregated from each node to the master node. The master node computes the total gradient and updates b using optimizers such as L-BFGS. This corresponds to Su- perstep 2 in Figure 2. The new coecients bnew are then broadcast back to each node together with bold to update the score s as in Equation (9), in order to prepare for the next e↵ect’s update. This phase corresponds to Superstep 3 in Figure 2. Since the main network communication here is the transmission of b from the master node to the worker nodes, the overall network communication cost for one iter- ation of updating the fixed e↵ects is O(M P ). One trick we found empirically useful is that it sometimes improves con- vergence to update b multiple times before we update the random e↵ects, for each iteration in Algorithm 1. In summary, updating the fixed e↵ects consists of three supersteps: (1) Preparing the training data with scores s. (2) Updating the coecients b. (3) Updating the scores s. 3.2.3 Updating Random Effects The main challenge in designing a scalable algorithm for GLMix on data sets with a huge number of random e↵ects is that the dimension of the random e↵ect coecient space can potentially be as large as NrPr. Therefore, if we naively apply the same approach as the one used in updating fixed e↵ects, the network communication cost for updating the random e↵ects for r becomes M NrPr. Given some data of moderate size for example, if Nr = 106, Pr = 105 and a cluster with M = 100, the network I/O cost amounts to 1013! As a result, the key to making the algorithm scalable is to avoid communicating or broadcasting the random e↵ect coecient across the computing nodes in the cluster. Before the random e↵ect updating phase and as a prepro- cessing step, for each random e↵ect r and ID l, we group the feature vectors zrn to form a feature matrix Zrl, which consists of all the zrn that satisfy i(r, n) = l. At iteration k and for random e↵ect r, we shu✏e the current values of s = {sk n}n2⌦ using the same strategy, i.e. for ID l we group sk n to form a vector Sk n that satisfies i(r, n) = l. With the right partitioning strategies, Sk l can be made to collocate with the corresponding feature matrix Zrl, to prepare the training data for updating the random e↵ects r, using Equation (8). This step corresponds to Superstep 4 in Figure 2. With the training data ready for each ID l, any optimizer (e.g., L-BFGS) can be applied again to solve the optimization problem locally, such that the random e↵ects rl can be learned in parallel without l , which consists of all the sk
Figure 2: Overview of the kth iteration of the PBCD in BSP paradigm, with five supersteps highlighted in the blue boxes. any additional network communication cost. Note that be- cause both the coecients and data collocate in the same node, the scores can be updated locally within the same step, and this phase corresponds to Superstep 5 in Figure 2. Also note that during the whole process, the random ef- fect coecients rl live with the data and would never get communicated through the network; only s would get shuf- fled around the nodes through network I/O. As a result, the overall network communication cost for one iteration of up- dating one random e↵ects is O(|⌦|), and O(|R||⌦|) for |R| random e↵ects. Since the optimization problem in Equation (8) can be solved locally, we have an opportunity to apply some tricks that can further reduce the memory complexity C as defined in Equation (11). Note that although the overall feature space size is Pr for random e↵ect r, sometimes the under- lying dimension of the feature matrix Zrl could be smaller than Pr, due to the lack of support for certain features. For example, a member who is a software engineer is unlikely to be served jobs with the required skill ”medicine”. Hence there will not be any data for the feature“job skill=medicine” for this member’s random e↵ects, and in such a scenario, Zrl would end up with an empty column. As a result, for each random e↵ect r and ID l, we can condense Zrl by removing all the empty columns and reindexing the features to form a more compact feature matrix, which would also reduce the size of random e↵ect coecients rl and potentially im- prove the overall eciency of solving the local optimization problem in Equation (8). An example is shown in Figure 3, where we compare the random e↵ect coecient size before and after applying such a condensed data storage strategy on a data set consisting of four months’ worth of LinkedIn’s job recommendations. More details on the data set can be found in Section 4. In summary, the updating of random e↵ects consists of two supersteps: (1) Prepare the training data with scores s. (2) Update the random e↵ect coecients and also the scores. Summary 3.2.4 The overall I/O communication cost for our implementa- tion for GLMix is O(M P + |R||⌦|). Since |R| is a small constant (usually less than 5), the communication cost is mainly bounded by the number of training instances |⌦|. Figure 3: The member and job coecient size before and after the reindexing and compression with the LinkedIn Job Recommendation data set. This is a significant improvement compared to the naive im- plementation with a cost of O(M C), where C is defined in (11), which could approach O(M Pr|⌦|) if we have a large number of random e↵ects Nr that is of the same order as the number of the training instances |⌦| (e.g., power law dis- tribution). On the other hand, note that even with a large scale data set with trillions of training samples, the size of s is bounded to be a couple of terabytes. Given the recent ad- vances in shu✏ing techniques [26], shu✏ing terabytes of data is no longer a significant challenge. At the same time, since the total number of random e↵ects Nr no longer depends on the number of nodes in the cluster M , our algorithm has the chance to scale linearly with the number of executors. 4. EXPERIMENTS OF LINKEDIN JOB REC- OMMENDATION In this section, we describe our experiments for the LinkedIn job recommendation problem. Section 4.1 starts with the o✏ine experimental results to compare several variations of GLMix models with a couple of baselines for prediction ac- curacy, and also experiments on scalability of GLMix in Sec- tion 4.1. In Section 4.2 we share implementation details of how we launched GLMix in LinkedIn’s production system. We discuss the online A/B test results in Section 4.3. 4.1 Offline Experiments 4.1.1 All the methods used in the o✏ine experiments are im- plemented using Apache Spark. In particular, the baseline Setup
methods such as ridge regression, `2-regularized logistic re- gression, matrix factorization with alternating least squares are from Spark’s open source machine learning library ML- lib. The experiments were conducted on a cluster consisting of 135 nodes managed by Apache YARN 3. Each node has 24 Intel Xeon(R) CPU E5-2640 processors with 6 cores at 2.50GHz each, and every node has 250GB memory. In the following experiments, Spark is running on top of YARN as a client, and the memory for each Spark executor is configured to 10GB, where each executor corresponds to a container (e.g., a core of the node) in YARN. 4.1.2 Our Data Our o✏ine experiments are conducted using a sample of the four months’ worth of data (07/2015 - 11/2015) of mem- ber’s interactions on the “Jobs You May Be Interested In” module on the LinkedIn jobs homepage. The data include 500M samples with 5M members and 3M jobs. We split the data by time into training (90%), validation (5%) and test (5%). Impressions that resulted in application clicks provide the positive responses, and impressions that did not result in application clicks provide the negative responses. Note that the detail job views are not considered in this experiment, since that is not the metric to optimize for. There are about 2K member features for qm, mostly extracted from public member profiles, e.g. industry, job function, education his- tory, skills, and so forth. There are also around 2K job fea- tures for sj, e.g. the job title, key words in the description, required skills and qualifications etc. We consider two types of interactions among the member and job features: a) The cosine-similarity of the member and job features in the same domain, e.g. keywords in the summary of the member/job profile, or the skills of the member / job requirement. There are around 100 such cosine-similarity features, and we de- note the feature vector as fc. b) The simple outer-product of the member and job features. After applying standard feature selection techniques we obtained 20K outer-product features (denoted as fo). 4.1.3 Experiments of Performance We consider the following models in our o✏ine experi- ments (where g denotes the link function g(E[ymjt])): • LR-Cosine: The baseline logistic regression model that only includes the cosine-similarity features g = f0cb. • LR-Full: The logistic regression model that includes all global features, xmjt = {qm, sj, fc, fo}, g = x0mjtb. • MF: A baseline matrix factorization model, g = x0mjtb+ u0mvj. • GLMix-Member: The GLMix model which includes only the member-level random e↵ects: g = x0mjtb + s0j↵m. • GLMix-Job: The GLMix model which includes only the job-level random e↵ects: g = x0mjtb + q0mj. • GLMix-Full: The full GLMix model as described in Equation (1). • GLMix-Full+MF: An extension of GLMix which adds an additional matrix factorization component, as in Equation (4). For each model listed above, we tried multiple values of hyper-parameters for the priors (i.e. b, ↵ and ), picked the ones with the best AUC performance on the validation data, and then report their AUC on the test data in Table 1. It is not surprising that adding the member and job random e↵ects to the model dramatically improves the AUC performance (0.723 for LR-Full vs 0.799 for GLMix-Full). There are also two other interesting observations: a) GLMix- Full significantly outperforms MF, which indicates that for this data, using features to interact with member/job IDs is more e↵ective than the interactions among member IDs and job IDs. We believe this is due to the sparse structure of the data and also because the features are very useful. b) Adding matrix factorization to GLMix-Full provides very little improvement in terms of AUC (0.799 to 0.801). This indicates that GLMix already captures most signals of the data, and adding matrix factorization for additional signals in the residual is not e↵ective. 4.1.4 Experiments of Scalability In this section, we study a) how GLMix scales up with the increase of both data size and the number of Spark execu- tors used, and b) how GLMix speeds up with the increase of number of Spark executors, while fixing the data size. In this experiment, the size of data is controlled by the number of days used to collect the data, e.g., 60 days of the data is about half of the size of 120 days of the data. The experi- ment results can be found in Figure 4. In general, we find that our proposed algorithm has good scalability properties, and achieves roughly linear speed-up on the Spark cluster, provided that the amount of data processed per executor does not become so small that system overheads start to dominate. In Figure 4, we also provide the details on the change of the data complexity C, as defined in Equation (11), as the data size increases. We note that the scalability of our implementation is quite robust to the increasing complexity of the data set. If we had used the naive implementation where the network I/O complexity is proportional to the data complexity, it would have been much worse. 4.2 GLMix in Production System LinkedIn operates one of the largest job recommendation systems in the world, which serves relevant jobs to members who visit the Job Homepage every day. For each member visit, it is required that we rank millions of jobs using rel- evance models and serve the top ranked ones in less than a second. Apart from model training challenges, to pro- ductionize a GLMix model with a huge number of features and random e↵ects, the scoring and ranking become another critical challenge to overcome. Our approach to mitigate such a challenge is to have a two-stage ranking strategy. In the first stage of the rank- ing, we convert it to a search-like problem by leveraging the open-source library Apache Lucene 4. In this stage, an in- verted job index is built, where keys are the job features and values are the corresponding job IDs. We run the baseline model LR-Cosine in this stage, which works eciently with 3https://hadoop.apache.org 4https://lucene.apache.org/
Model LR-Cosine LR-Full MF GLMix-Member GLMix-Job GLMix-Full GLMix-Full+MF AUC 0.801 0.664 0.723 0.772 0.780 0.758 0.799 Table 1: O✏ine model performance for LinkedIn Job Recommendation data. 0 0 0 3 0 0 5 2 0 0 0 2 0 0 5 1 0 0 0 1 0 0 5 0 GLMix running 4me 1x 1x 1.07x 1.15x 1.22x 1.25x 0.67x 0.50x 0.36x 0.30x 90 @ 90 60 @ 60 120 @ 120 150 @ 150 180 @ 180 Number of days of data @ Number of executors 60 90 120 180 240 Number of executors (days = 120) 5 2 0 2 5 1 0 1 5 0 ) n o i l l i B ( e z S i 1x 60 Complexity 2.51x 2.13x 1.75x 1.35x 120 90 Number of days of data 150 180 ) s d n o c e s ( e m i t k c o c l l l a W Figure 4: Left and Mid panel: Scale up and speed up results on the Spark cluster. The scale-up experiment is shown on the left panel, where we increase the number of days of data and the amount of Spark executors used at the same time, and the speed-up experiment is shown on the mid panel, where we fix the number of days of data and increase the number of Spark executors used. Right panel: Data complexity as a function of number of days of data. ! "#$%%"&' ( ) $&* +",- .#"* /$#( . - $. ) $/&0( &'( ) & . $%+12( %&30( /$& 45( 6+$. ( /07&& 9 4: .; *: $' "<* =,( .&.&0* +/( 0"8*! ( &2.&03** 9 4: .; * >"( /?,"*@.A"<.&"7* 89: 3&- $. ) $/&'( ) & ; 1%+<+10$&30( /$& 89: 3&: $10"/$& 30( /$& +/( 0"1*! ( &2.&03** 4! 56$7* B&<.&"* C( ' $$A* !"#$%$& '( ) &*%+$, & & Figure 5: High-level architecture of GLMix in job recom- mendation production system Lucene (since all the features are cosine-similarity based), and select the top K results (e.g. K=1000) for the second stage ranking. In the second stage, we re-score the top K results using the full GLMix model and show the top-ranked jobs to the member. Although such a two-stage strategy can be implemented online, at this time we choose an o✏ine approach as in Fig- ure 5, for more flexibility in experimentation and because of the fact that jobs are usually more evergreen than news articles. Through Hadoop Map-Reduce, we pre-generate the top job recommendations for all members, and push them to an online distributed key-value store using Voldemort 5, where the keys are member IDs and the values are the rec- ommended job IDs. When a member visits the Jobs Home- 5http://www.project-voldemort.com/voldemort/ page, we simply retrieve the recommended jobs from the store, do some post-processing (e.g. freshness control), and show them to the members. This o✏ine-based approach to serve online recommendations has proven to be very valuable in facilitating our experimentation and exploration of new features. Through A/B tests, whatever member or job fea- tures that become available on our Hadoop system, we can quickly test them before fully implementing the features in the online retrieval systems, which usually takes much more time. Note that the two-stage scoring approach is very impor- tant even if we move the computation o✏ine: Although it is possible to use Hadoop Map-Reduce to score millions of jobs for each of the 400M LinkedIn members using GLMix, the amount of Hadoop resources it consumes is too high to be cost-ecient. We also note that it is required to retrain the GLMix model frequently (e.g. every hour or every day). Our o✏ine experiments show that the AUC of GLMix-Full drops from 0.799 to 0.765 if we do not update the model for 20 days. 4.3 Online Experimental Results We deployed the GLMix-Full model to the LinkedIn job recommendation system as described in Section 4.2 and ran an A/B test to compare with the existing baseline model in the production system, which uses the LR-Cosine model. Each model served a randomly-selected 2% of members, and we report the online experimental results for a week’s data in November 2015. In Figure 6 we show the lift of the job application clicks and job detail views comparing the GLMix model against the baseline LR-Cosine model, for each day of the week. We observe that every day there is a 20%-40% lift of the job ap- plication clicks, and interestingly also a consistent 10%-20% lift of job detail views. This shows that a more relevant job recommendation model, which is optimized for job applica- tions, can also dramatically increase the number of times
t f i L 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 apply view 1.0 0.5 t f i L 0.0 −0.5 −1.0 #app #job #view 1 2 3 4 5 6 7 Days 1 20 40 60 80 100 Number of impressions per job Figure 6: The lift of job application clicks and job detail views comparing GLMix-Full and the baseline model LR- Cosine, for each day of the week. people click on the job thumbnails to view a given job post- ing in detail. To further understand the experimental results and investigate the e↵ectiveness of the job random e↵ects in the GLMix model, we segment our online A/B test data by the number of impressions per job, ranging from 1 to 100, and show the lift of job application clicks and detail views comparing GLMix with LR-Cosine in Figure 7. It is inter- esting to observe that as a job gets more impressions (which is when the per-job random e↵ects become more e↵ective), GLMix generates more lifts compared to the baseline. In the meantime, note that the distribution of the number of impressions per job changes very little, indicating that the GLMix model does not simply serve more popular jobs to members to obtain such a lift in application clicks (in that case the total number of jobs that receive one impression would have dramatically shrunk), instead it tries to serve more relevant jobs to members while keeping the distribu- tion of impressions per job unchanged. Similarly, we study the e↵ectiveness of the member ran- dom e↵ects by segmenting the data into 8 buckets based on the number of visits per member in Figure 8. We observe similar rising patterns for the lift of job application clicks as the number of visits per member increases. There is some slight up patterns of the job detail view curve too but not very significant. The distribution of total number of visits per member remains unchanged since this is a randomized A/B test experiment. 5. EXPERIMENTS ON PUBLIC DATA SETS In this section, we report the performance of GLMix on two public data sets to demonstrate its modeling capability. KDD Cup 2012 Track 2: We obtain this data set from https://www.kddcup2012.org/c/kddcup2012-track2. Recall that Nr and Pr denote the random e↵ect size and feature dimension of random e↵ect of type r 2 R respectively, and complexity = Nr⇤Pr, which represents the size of the design matrix resulting from the outer product between the random e↵ect IDs and the features. Basic statistics of the data set is summarized in Table 2. All the features used through the experiments are raw features from the data set. No feature engineering has been Figure 7: The #app and #view curves are the lift of job ap- plication clicks and job detail views comparing GLMix-Full against the baseline model LR-Cosine, for di↵erent segments of jobs with di↵erent numbers of impressions. The curve of #job shows the lift of the number of jobs per segment com- paring GLMix-Full and LR-Cosine. conducted for two reasons: (i) to compare GLMix against models with less granularity but trained on well-engineered features as reported in [22] and [18]; and (ii) to make it more clear whether the lift results from the well-engineered features, or from GLMix itself. Nevertheless, we believe the best performance could come from the combination of well- engineered features and fine-grained modeling. We leave such study as future work. The features provided in the data set include the position of the ad, the user’s age and gender, the tokens of the query, the ad keyword, and the title. The details of the features can be found on the website above. The accuracy results of di↵erent models, measured by AUC are shown in Table 3. The GLMix model with all random e↵ects (G-All) significantly outperforms LR and other variants of GLMix with fewer random e↵ects (e.g., G-Query, which only contains the per-query random e↵ect coecients). This shows that adding random e↵ects can usually improve model accuracy. We also compare GLMix to other existing methods. In particular, G-All outperforms both RankMF [22] and FM [18] based on their published results. When comparing to the best models in the KDD Cup competition, G-All is ranked between top 5 to top 10. One thing to note is that all of the models that outperform G-All are ensemble methods, and G-All is the best among the single models. Type user advertiser ad query title description keyword Nr NrPr 22,023,547 286,306,111 14,847 641,707 24,122,076 3,735,796 2,934,101 1,188,089 14,450,644,488 624,575,989,928 4.5192951e+12 115,809,676 90,957,131 36,830,759 Table 2: Summary of KDD Cup 2012 Track 2 Data Yahoo! Music: We obtain this data set from Yahoo! Web-
分享到:
收藏