logo资料库

Deep Collaborative Filtering via Marginalized Denoising.pdf

第1页 / 共10页
第2页 / 共10页
第3页 / 共10页
第4页 / 共10页
第5页 / 共10页
第6页 / 共10页
第7页 / 共10页
第8页 / 共10页
资料共10页,剩余部分请下载后查看
Deep Collaborative Filtering via Marginalized Denoising Auto-encoder ∗ Sheng Li Northeastern University shengli@ece.neu.edu Boston, MA, USA Jaya Kawale Adobe Research San Jose, CA, USA kawale@adobe.com Yun Fu Northeastern University yunfu@ece.neu.edu Boston, MA, USA ABSTRACT Collaborative filtering (CF) has been widely employed within rec- ommender systems to solve many real-world problems. Learning effective latent factors plays the most important role in collabora- tive filtering. Traditional CF methods based upon matrix factoriza- tion techniques learn the latent factors from the user-item ratings and suffer from the cold start problem as well as the sparsity prob- lem. Some improved CF methods enrich the priors on the latent factors by incorporating side information as regularization. How- ever, the learned latent factors may not be very effective due to the sparse nature of the ratings and the side information. To tackle this problem, we learn effective latent representations via deep learning. Deep learning models have emerged as very appealing in learning effective representations in many applications. In particular, we propose a general deep architecture for CF by integrating matrix factorization with deep feature learning. We provide a natural in- stantiations of our architecture by combining probabilistic matrix factorization with marginalized denoising stacked auto-encoders. The combined framework leads to a parsimonious fit over the latent features as indicated by its improved performance in comparison to prior state-of-art models over four large datasets for the tasks of movie/book recommendation and response prediction. Categories and Subject Descriptors H.2.8 [Database Management]: Data Mining; H.3.3 [Information Search Retrieval]: Information Filtering Keywords Collaborative filtering; matrix factorization; deep learning; denois- ing auto-encoder 1. INTRODUCTION Recommendation is a fundamental problem that has gained ut- most importance in the modern era of information overload. The ∗ Part of the work is completed during Sheng Li’s internship at Adobe Research. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. CIKM’15, October 19–23, 2015, Melbourne, Australia. c 2015 ACM. ISBN 978-1-4503-3794-6/15/10 ...$15.00. DOI: http://dx.doi.org/10.1145/2806416.2806527. goal of recommendation is to help users find the item that they maybe potentially interested in from a large repository of items. Recommender systems are widely used by websites (e.g., Amazon, Google News, Netflix, and Last.fm) in various contexts to target customers and provide them with useful information. A widely used setting of recommendation system is to predict how a user would rate an item (such as a movie) if only given the past rating history of the users. Many classical recommendation methods have been proposed during the last decade. The two broad categories of recommendation systems are content filtering approaches and collaborative filtering (CF) based methods. The CF based meth- ods have attracted more attention due to their impressive perfor- mance [1]. Among various CF methods, matrix factorization has emerged as a powerful tool to perform recommendations in large datasets [2, 3]. Learning effective latent factors plays the most important role in matrix factorization based CF methods. Traditional matrix fac- torization methods for CF directly learn the latent factors from the user-item rating matrix [2, 4]. One of the main challenges faced by these systems is to provide a rating when a new user/item ar- rives in the system, which is also known as the cold start scenario. The cold start problem is circular in nature as - the system will not recommend an item unless it has some ratings for it and unless the system recommends it will not get ratings for it. Another practical challenge is learning the appropriate latent factors when the rating matrix is sparse, which is often the case in many real world scenar- ios. In order to overcome these challenges, researchers have sug- gested to incorporate additional sources of information about the users or items, also known as the side information. This side infor- mation could be obtained from the user/item profiles, for example, demographics of a user, genre of a movie, etc. The user demo- graphics could be used to infer the relationships between the users and similarly the item similarity can be used to automatically as- sign ratings to new items. The use of side information to aid matrix factorization has been successfully applied by various prior work, for example [5, 6, 7]. These methods, however, only utilize the side information as regularizations in the model, and the learned latent factors may not be very effective due to the sparse nature of the rat- ings and the side information. In order to make matrix factorization based methods effective in such a setting, it is highly desirable to learn and extract discriminative features from the datasets. One of the powerful approaches to capture feature interactions and complex relations that has emerged in the recent past is deep learning [8, 9]. Deep learning has attracted a lot of attention be- cause of its promising performance to learn representations on var- ious tasks. Deep neural networks have been shown to achieve state- of-the-art results in computer vision, speech recognition and ma- chine translation. The application of deep learning in recommen- 811
dation systems, however, is very recent. With large-scale data and rich-side information available, it is now practicable to learn latent factors through deep architectures. Researchers have invested in modifying deep learning algorithms like Restricted Botzmann Ma- chines or Convolutional Neural Networks or Deep Belief Networks directly for the task of collaborative filtering [10, 11, 12, 13, 14]. However, there are no prior work that bridge together matrix fac- torization with deep learning methods with the notable exception of [15]. In this paper, we present a deep learning model for collab- orative filtering that tightly couples matrix factorization based col- laborative filtering with deep learning algorithm namely marginal- ized denoising auto-encoders (mDA) [16]. Unlike [15] which inte- grates collaborative topic regression and bayesian stacked denois- ing auto-encoders and requires learning of a large number of hy- per parameters using an EM style algorithm, our approach uses a much more efficient architecture based upon mDA and stochastic gradient descent and is thus computationally efficient and highly scalable. This paper makes the following contributions: • We propose a general deep architecture named deep collab- orative filtering (DCF), which integrates matrix factorization and deep feature learning. It models the mappings between the latent factors used in CF and the latent layers in deep models. • We present a practical instantiation (i.e., mDA-CF and mSDA- CF) of the proposed architecture, by utilizing the probabilis- tic matrix factorization and mDA. The scalability and low computational cost of the mDA makes it a highly attractive deep learning tool, which is unlike the prior work [15]. • We evaluate the performance of our model on three real- world applications, movie recommendation, book recommen- dation and response prediction. Our model outperforms con- ventional CF methods. 2. RELATED WORK In general, our work is closely related to the following topics: matrix factorization based collaborative filtering, and deep learn- ing based collaborative filtering. We will discuss the two in the following subsections. 2.1 Matrix Factorization for Collaborative Fil- tering The importance of accurate recommendation techniques moti- vated by wide ranging applications has fuelled a great amount of academic as well as industrial research in this area [17]. Recom- mender systems are most often based on collaborative filtering and there are typically two approaches that are widely used. In neigh- borhood methods, the similarity between users based on the content they have consumed and rated is the basis of a new recommenda- tion. A related but intrinsically more powerful approach has been the use of latent factor models. Matrix factorization (MF) is the most popular technique to derive latent factor models and their suc- cess at the Netflix competition have highlighted their strength [18, 2]. For example, the given matrix X ∈ RN×M consisting of the item preferences of the users can be decomposed as a product of two low dimensional matrices U and V . The decomposition can be carried out by a variety of methods ranging from SVD based approaches [19] to the relatively new non-negative matrix factor- ization approach [20]. One classical MF method is probabilistic matrix factorization (PMF) [4]. The underlying assumption behind this method is that the prior probability distribution of the latent factors and the probability of the observed ratings given the latent factors follows a Gaussian distribution. Many algorithms have been developed to enhance the performance of PMF, by designing the Bayesian versions [21, 22, 23], or incorporating side information, such as social relationships [24, 5, 25]. Although promising, matrix factorization methods suffer from the problem of cold-start, i.e. what recommendations to make when a new user/item arrives in the system. Another problem often pre- sented in many real world applications is data sparsity or reduced coverage. Incorporating side information has shown promising per- formance in collaborative filtering in such scenarios. Porteous et al. proposed a Bayesian matrix factorization (BMF) approach with side information and Dirichlet process mixtures [26]. A varia- tional BMF method and a hierarchical BMF method that utilizes side information were also proposed in [27] and [28], respectively. Hu et al. proposed a cross-domain triadic factorization (CDTF) method [29], which leverages the information from other domains. The methods discussed above are proposed for addressing recom- mendation problems. Recently, MF based collaborative filtering is also applied to response prediction [30, 31]. The afore-mentioned approaches can alleviate the problem of cold start and data sparsity but might still suffer when the side information is sparse. Learn- ing effective features is critical in matrix factorization. Recently, deep learning based methods have emerged as a powerful tool for learning representation and are widely used in many applications ranging from computer vision to speech recognition and machine translation. In this paper, our goal is to combine deep learning based methods with matrix factorization for collaborative filtering. In the next subsection, we survey the application of deep learning based methods for collaborative filtering. 2.2 Deep Learning for Collaborative Filtering The application of deep learning models to the task of collabo- rative filtering is very new and there are not much attempts in this direction. Salakhutdinov et al. [10] were the first to apply deep learning to the task of collaborative filtering. They modified the restricted Boltzmann machines as a two-layer undirected graphi- cal model consisting of binary hidden units and softmax visible units for the task of collaborative filtering. They designed an effi- cient learning procedure called the Contrastive Divergence (CD) to maximize an approximation to the true likelihood function. They also proposed a conditional RBM model and inference procedures. They tested the performance of the model on the Netflix dataset for movie recommendation and showed that their model performs well as compared to the baseline methods. Truyen et al. [14] proposed ordinal Boltzmann machines for col- laborative filtering. They studied the parameterizations for han- dling the ordinal nature of ratings, and presented the integration of multiple Boltzmann machines for user-based and item-based pro- cesses. Recently, some deep learning models learn latent factors from content information such as raw features of audio or articles [32, 33]. Wang et al. [12] utilized deep belief nets (DBN) for music recommendation, which unifies feature extraction and recommen- dation of songs in a joint framework. They assumed that a user has a feature vector βu drawn from a Gaussian prior and the songs have a feature vector xv. They automatically learned the feature vec- tors of the songs using a deep belief network which is a generative probabilistic graphical model with hidden nodes and observation. It has millions of parameters to be learned from the training data. The authors used stacked layers of Restricted Boltzmann Machines for pretraining in an unsupervised fashion, and then employed the Maximum Likelihood Estimation (MLE) for supervised learning. 812
Oord et al. [13] addressed the music recommendation problem using the convolutional neural networks. They first conducted a weighted matrix factorization to handle implicit feedback and ob- tained latent factors for all songs. After that they used deep learning to map audio content to those latent factors. In particular, they ex- tracted local features from audio signals and aggregated them into a bag-of-words representation. Finally, the deep convolutional net- work was employed to map this feature representation to the latent factors. They tested their algorithm on the Million song dataset and showed that their model improved the recommendation perfor- mance by augmenting the audio signals. All the previously mentioned approaches mainly modify the deep learning algorithms for the task of collaborative filtering and do not directly couple matrix factorization with deep learning models. Most recently, Wang et al. [15] proposed a hierarchical Bayesian model called collaborative deep learning (CDL) which tightly cou- ples stacked denoising auto-encoders (SDA) and collaborative topic regression (CTR). This work is the closest to our work but differs from ours in many significant ways as follows - (i) CDL utilized a Bayesian formulation of SDA. The generative process of CDL consists of drawing samples for CDL uses an EM-style algorithm for obtaining the MAP estimates of Bayesian SDA, and thus it has to learn a large number of parameters. Our model employs a more efficient architecture, marginalized SDA (mSDA), which computes the parameters in closed form and is thus highly efficient and scal- able. (ii) CDL only extracts deep features for items, whereas our model learns deep features for both items and users. 3. PRELIMINARIES Before we describe our general framework, we discuss the pre- liminaries as follows. 3.1 Matrix Factorization Matrix Factorization (MF) is the most effective collaborative fil- tering approach. It allows us to discover the latent factors of user- item interactions by factorizing the interactions matrix into a joint latent space of user and item features respectively. It proceeds by decomposing the original rating matrix R ∈ Rm×n consist- ing of ratings by m users for n items into two low-rank matrices U ∈ Rm×d and V ∈ Rn×d consisting of the user and item features respectively of rank d. The system learns the latent factors by minimizing the following objective function - l(R, U, V ) +β (U2 F + V 2 F), arg min U,V (1) where l(R, U, V ) is the loss function of predicting rating using the latent factors U and V and the last two terms are the regularizations used to avoid overfitting. · F denotes the Frobenius norm. Many MF-based methods have been proposed, by designing a sophisticated loss function l(R, U, V ). Existing works usually pose some assumptions on the latent factors U and V in (1). Probabilis- tic matrix factorization (PMF) [4] provides a probabilistic founda- tion by assuming a probabilistic linear model with Gaussian obser- vation noise and Gaussian priors on the latent factors. M N i=1 j=1 p(R|U, V, σ2) = M N (Ui|0, σ2 i Vj, σ2)Iij N (Rij|U T N (2) (3) p(U|σ2 u) = u) ; p(V |σ2 v) = N (Vj|0, σ2 v). i=1 j=1 The model is fitted by finding a MAP estimate of the parameters. Maximizing the log posterior leads to the following objective func- tion that can be solved using stochastic gradient descent (SGD): arg minU,V E = R − U V 2 F + β(U2 F + V 2 F). To improve the recommendation performance of PMF, Bayesian probabilistic matrix factorization method (BPMF) [34] considers a full Bayesian treatment of the parameter space instead of a point estimate used in PMF. In the weighted matrix factorization (WMF), )2 l(R, U, V ) = C (R − U V F, where C is the weight ma- trix [35]. Our model is based upon the probabilistic matrix factor- ization approach as it has shown to have a very good performance on several datasets and at the same time is computationally more efficient as compared to BPMF. When side information are available, some MF methods make use of these additional information via regression to predict the rat- ings [28]. 3.2 Marginalized Denoising Auto-encoder (mDA) As a specific form of neural network, an autoencoder takes a given input and maps it (encodes) to a hidden representation via a deterministic mapping. Denoising autoencoders reconstruct the input from a corrupted version of the data with the motivation of learning a more robust mapping from the data. Various types of au- toencoders have been developed in the literature and have shown promising results in several domains [36, 37]. Moreover, de- noising autoencoders can be stacked to construct a deep network also known as stacked denoising autoencoder (SDA) which allows learning higher level representations [38]. Despite their state-of- the-art performance, one of the main drawbacks of SDA is the high computational cost of training, as they rely upon the iterative and numerical optimization techniques to learn a large amount of model parameters. Marginalized denoising auto-encoder (mDA) [16] is a variant of SDA that avoids the high computational cost by marginalizing out the random feature corruption and thus has a closed-form solution to learn model parameters. Therefore, mDA is highly scalable and faster than SDA. It proceeds as follows: Given a sample set X = [x1,··· , xk], mDA considers multiple passes (e.g., c-times) of random corruptions over X to obtain X. It then reconstructs the input with a mapping W that minimizes the squared loss as follows: L(W ) = 1 2ck xi − W ˜xij2, (4) c k j=1 i=1 where ˜xij represents the jth corrupted version of the original input xi and W represents the mapping that is expected to minimizes the loss function. The above objective can be rewritten in the matrix form as L(W ) = ¯X − W X2 F, (5) where ¯X = [X,··· , X] is the c-times repeated version of X, and ˜X is the corresponding corrupted version. This problem is similar to the ordinary least squares problem and has the analytical solution −1, where S = ˜X ˜X T , Q = ¯X ˜X T When as given by W = SQ c → ∞ in (4), we can derive the expectations of Q and P , and obtain the closed form solution of the mDA [16]. Further, multi- ple mDAs can be stacked to form a deep architecture, marginalized stacked denoising auto-encoder (mSDA). mSDA usually enhances the performance of mDA. Most recently, a nonlinear version of mDA is presented [39]. 813
Table 1: Summary of notations. Notation m n d p q Description Number of users Number of items Dimension of latent factors Dimension of user features Dimension of item features Rating matrix Latent factors of users Latent factors of items R ∈ Rm×n U ∈ Rm×d V ∈ Rn×d X ∈ Rp×m Side information of users Y ∈ Rq×n Side information of items W1 ∈ Rp×p Mapping function for X in auto-encoder P1 ∈ Rp×d Projection matrix for U 4. OUR APPROACH As we noted earlier, deep learning models have been proven to be very effective in extracting high-level representations from the raw input data in several learning tasks. The learned features represent high-level knowledge. In the collaborative filtering problem, we face a similar challenge of inferring effective latent and high-level knowledge on user preferences from the raw inputs, including the rating matrix and related features. MF based CF methods are able to capture the implicit relationship between the users and the items successfully, but they suffer from the cold start and data sparsity problems. Therefore, it is reasonable to draw strength from the deep models to assist the collaborative filtering process. Table 1 summarizes the symbols used in our approach. Next, we describe a general framework that integrates matrix factorization and deep feature learning. 4.1 Deep Collaborative Filtering (DCF): A Gen- eral Framework In this section, we introduce the proposed deep collaborative fil- tering (DCF) framework which unifies the deep learning models with MF based collaborative filtering. Fig. 1 illustrates the idea of our DCF framework. DCF is a hybrid model, which makes use of both rating matrix and side information and bridges together matrix factorization and feature learning. Given a user-item rating matrix R, the user side information X and the item side information Y , DCF jointly decomposes R and learns latent factors (i.e., U, V ) from ratings and side information (i.e., X and Y ) through the following formulation: arg min U,V l(R, U, V ) +β (U2 F + V 2 F) +γL(X, U ) +δ L(Y, V ), (6) where β, γ and δ are the trade-off parameters. There are two key components in the DCF framework: (i) the function l(R, U, V ) for decomposing the rating matrix R into the two latent matrices; (ii) the functions L(X, U ) and L(Y, V ) that connect the user/item contextual features with the latent factors. The first component derived through matrix factorization extracts latent knowledge from the rating matrix. The second component devised using deep learning models establishes connections of the side information with the latent factors. 4.2 DCF using PMF + mDA A natural instantiation of DCF is combining probabilistic matrix factorization (PMF) with marginalized denoising auto-encoders (mDA). PMF is a widely applied CF approach with excellent performance, and mDA is a powerful tool in extracting high-level features from Input layer Hidden layer Output layer User--Item Rating matrix User Feature Feature Item Input layer Hidden layer Output layer Figure 1: Illustration of DCF framework. The inputs are user- item rating matrix R, the user feature set X and the item fea- ture set Y . Our approach jointly decomposes R and learns la- tent factors (i.e., U, V ) from ratings and side information (i.e., X and Y ). In particular, the latent factors are extracted from the hidden layer of deep networks. raw inputs. The combination of the two leverages their benefits for learning even richer models. 4.2.1 mDA based Collaborative Filtering (mDA-CF) Let ¯X ∈ Rp×cm and ¯Y ∈ Rq×cn denote the c-times repeated versions of X and Y respectively and let ˜X and ˜Y denote their corrupted versions. As discussed before, we utilize the loss func- tion of PMF to decompose rating matrix R, i.e., l(R, U, V ) = )2 A (R − U V F, where A is the indicator matrix indicating the non-empty entries in R and denotes the Hadamard or point- wise product. The objective function of mDA-CF is formulated as follows: LU (W1, P1, U ) +L V (W2, P2, V )+ αA (R − U V )2 F + β(U2 F + V 2 F), (7) arg min U,V,W1, W2,P1,P2 where LU (W1, P1, U ) = ¯X − W1 ˜X2 LV (W2, P2, V ) = ¯Y − W2 ˜Y 2 F + λP1U F + λP2V − W1X2 F, − W2Y 2 F, W1 ∈ Rp×p and W2 ∈ Rq×q are reconstructive mappings, P1 ∈ Rp×d and P2 ∈ Rq×d are projection matrices, α, β and λ are trade- off parameters. Note that, we set γ and δ in (6) to 1 for simplicity. The first term in LU (W1, P1, U ) denotes the learning process in marginalized denoising auto-encoder. It measures the reconstruc- tion error between input user features ¯X and the mapped features of corrupted inputs, i.e., W1 ˜X. W1 is the learned mapping that is ex- pected to minimize the loss. The second term connects the hidden layer feature W1X and the latent factor U. Generally, the latent factor has much lower dimension than the raw features. Therefore, we add a low-dimensional projection P1 that maps latent factor to the feature space. 814
Optimization Although the optimization problem in (7) is not jointly convex in all the variables, it is convex to each of them when fixing the others. Hence, we can alternately optimize for each of the variables in (7). The detailed procedures are provided below. First, we derive a solution to solve W1 and W2 using [16]. By ignoring the variables irrelevant to W1, the objective (7) can be rewritten as ¯X − W1 ˜X2 F + λP1U − W1X2 F. (8) arg min W1 Inspired by mDA, we consider the infinitely many copies of noisy data, and obtain the optimal solution W1 = E[S1]E[Q1] −1, (9) where S1 = ¯X ˜X . An efficient solver for calculating the expectations E[S1] and E[Q1] is provided in [16]. and Q1 = ¯X ˜X + λP1U + λXX X Similarly, we can derive the closed-form solution to W2 W2 = E[S2]E[Q2] −1, where S2 = ¯Y ˜Y + λP2V Y and Q2 = ¯Y ˜Y + λY Y . Next, by dropping the irrelevant variables w.r.t. P1, the objective function becomes λP1U − W1X2 F. arg min P1 We can obtain the closed-form solution as P1 = W1XU (U −1. U ) Similarly, the optimal solution of P2 is P2 = W2Y V (V −1. V ) To solve for the latent factors U and V , we use the popular stochastic gradient descent (SGD) algorithm. In particular, when other variables irrelevant to U and V are fixed, we use f (U, V ) to denote the objective in (7). The update rules are: ui = ui − η ∂ vj = vj − η ∂ ∂ui ∂vj f (U, V ), f (U, V ), (14) where η is the learning rate , and the detailed derivatives are defined as ∂f (U,V ) ∂ui ∂f (U,V ) ∂vj 1 (P1ui − (W1X)i)) + βui =λ(P − α (Ri,j − uiv j )vj. (i,j)∈A 2 (P2vj − (W2Y )j)) + βvj =λ(P − α (Ri,j − uiv j )ui. (i,j)∈A The above steps are repeated until convergence. Finally, we ob- tain the latent factors U and V . Algorithm Complexity The steps of our mDA-CF approach are summarized in Algorithm 1. The learned latent factors U and V can be used to predict miss- ing entries in the rating matrix. In Algorithm 1, we have analytical solutions of Steps 3-6 which are efficient to compute. The ma- trix multiplication and inversion used in Step 5 and Step 6 cost O(p2m + pmd + d3) and O(q2n + qnd + d3), respectively. The Steps 8-9 are implemented in a batch-learning fashion, and cost Algorithm 1. mDA-CF Algorithm Input: Rating matrix R, user features X, item features Y , parameters λ, α, β. Output: Latent factors U, V 1: Initialize U, V , P1 and P2; 2: while validation error decreases, do 3: 4: 5: 6: 7: 8: 9: end for 10: 11: end while Update W1 using (9); Update W2 using (10); Update P1 using (12); Update P2 using (13); for each observed Rij, do Update ui using (14); Update vj using (14); O(tN ) to evaluate the gradients, where t is the number of iterations and N is the number of training ratings/responses in R. Consider- ing that N max{m, n, d}, the time complexity of Algorithm 1 is mainly determined by O(tN ). Hence, our approach owns a good scalability. We discuss the settings of the parameters in the exper- imental section. To further reduce the computational cost, some advanced distributed optimization algorithms could be applied to our model [40]. 4.2.2 Stacked mDA based Collaborative Filtering (mSDA- CF) Existing literature show that stacking multiple deep learning lay- ers together can usually generate rich features in the form of hid- den layers, and therefore results in better performance for vari- ous learning tasks. Inspired by the marginalized stacked denoising auto-encoders (mSDA) [16], we stack multiple mDA together, and present the mSDA-CF approach. We assume that only one hidden layer should be close to the latent factor. The reasons are two-fold. First, latent factors are high-level representations, which should correspond to the deeper layers in deep models. Secondly, latent factors should be unique, but different hidden layers have various representations. Therefore, enforcing the similarity between multiple hidden layers and latent factors is unreasonable. l+1 In our mSDA-CF model, we assume that the latent factors are layer, given the total number of layers generated from the is l. When we train the model for the rest layers, the parameters λ, α and β are simply set to 0. In particular, if i = , we only need to update W i 2 and ignore the other steps, where W i 2 denote the mappings in the i-th layer. One benefit of such setting is the time efficiency, as we do not increase too much computational burden when adding multiple layers. 1 and W i 1 and W i l+1 Another interesting problem is how to set the number of layers. The number of layers implies the model complexity, which is usu- ally related to the learning task and the size of training data. In the experiments we will discuss the influence of different number of layers. The detailed procedures of mSDA-CF are summarized in Algorithm 2. 2 2 4.3 Discussion We notice that most existing deep learning based collaborative filtering methods can be unified in our DCF framework. For exam- ple, Oord et al. [13] use deep convolutional neural networks (CNN) to predict latent factors from music audio using the following ob- (10) (11) (12) (13) (15) (16) 815
Algorithm 2. mSDA-CF Algorithm Input: Rating matrix R, user features X, item features Y λ, α, β, layers l. Output: Latent factors U, V 1: for i 1 :l , do l+1 2: 3: if i = , do 2 Update U and V using Algorithm 1, by setting valid values to λ, α and β; Update W i λ = 0, α = 0 and β = 0; 1 and W i 2 using Algorithm 1, by setting otherwise 4: 5: end if 6: 7: end for jective function: u,i min θ cu,i(pu,i − x u y i)2, (17) where xu, the latent factor of user, is estimated from weighted ma- i, the latent factors of audio, are trix factorization beforehand, and y i = CNN(fv, Ω). Here, fv denotes audio learned from CNN, i.e., y contents, and Ω denotes the model parameters in CNN. Eqn (17) can be interpreted in our DCF formulation (i.e., Eqn (6)). First, it utilizes the weighted matrix factorization as the loss function l(·). Secondly, the regularization function L(Y, V ) is implemented by CNN. Wang et al. [12] utilize deep belief networks (DBN) to build a hybrid model for collaborative filtering. The objective function is (ruv − β uxv − r +λγγ2 F + λyy2 F. uyv)2 + λββ − μ2 F (18) u,v∈I LHybrid = where xv, the latent factor of item, is obtained from DBN, i.e., xv = DBN(fv, Ω). Also, we can interpret model (18) in our DCF framework. First, loss function l(·) is implemented in a hybrid way, i.e., rating ruv is predicted by the sum of the CF part r uyv and the content part uxv. Secondly, DBN is employed to map the content features fv β and latent factor xv, which can be formulated by L(Y, V ) in DCF framework. Meanwhile, γ in (6) is set to 0. L = − Wang et al. [15] propose a model with Bayesian stacked denois- ing auto-encoders (SDAE) and collaborative topic regression are integrated. The objective function is 2 (rij − u i vj)2 − λn vj − fe(X0,j∗, W +) fr(X0,j∗, W +) − Xc,j∗2 2 − freg, 2 − λv cij i,j 2 2 j 2 j (19) where fe(·) and fr(·) denote the encoding and decoding functions in SDAE, λn and λv denote the trade-off parameters, and freg de- note the regularization terms that prevent overfitting. Obviously, the model (19) can also be interpreted in the DCF framework. The loss function for decomposing rating matrix in (19) is in the standard matrix factorization fashion. Further, the second and the third terms in (19) infer the item latent factor using SDAE, which can be abstracted as L(Y, V ) in DCF. In summary, the existing deep collaborative filtering methods [13, 12, 15] can be unified in a common framework, DCF. We also notice that the existing models only infer latent factors of items us- ing deep models, whereas the latent factors of users are generated in traditional ways. Compared to existing works, our DCF framework provides a more flexible way to explore effective latent factors for users and/or items via deep models. 5. EXPERIMENTS We evaluate the performance of our mDA-CF and mSDA-CF approaches on three challenging tasks that are movie recommenda- tion, book recommendation and response prediction. 5.1 Movie Recommendation For movie recommendation, we conduct experiments on two bench- mark datasets MovieLens-100K and MovieLens-1M 1, which are commonly used for evaluating collaborative filtering algorithms. The MovieLens-100K dataset contains 100K ratings of 943 users and 1682 movies, and the MovieLens-1M dataset consists of about 1 million ratings of 6040 users and 3706 movies. Each rating is an integer between 1 (worst) and 5 (best). The ratings are highly sparse. Table 2 summarizes the statistics of datasets. We extract the features from side information of users and movies to construct X and Y . To summarize, the user information which consists of the user‘s age, gender and occupation were encoded into a binary valued vector of length 28. Similarly, the item feature information which consists of the 18 category of movie genre were encoded into a binary valued vector of length 18. Ratings were normalized to be zero-mean. As our model (7) is an extension of the representative collab- orative filtering method PMF, we mainly compare our approach with PMF and its several variants, such as the Biased PMF [41] and sparse covariance matrix factorization (SCMF) [23]. PMF and Biased PMF are special cases of our approach mDA-CF. For ex- ample, by adding zero weight to the first two terms in (7), mDA- CF degrades to PMF. Further, since our approach takes advantage of the side information of users and movies, we also compare our approach with the collaborative filtering method that incorporates side information, such as the Bayesian matrix factorization with side information (BMFSI) [26]. We employ the root mean squared error (RMSE) as the evalua- tion metric. RMSE is defined as: ij (Rij − ˆRij)2, Z P (20) i,j RMSE = 1 N where Rij is the ground-truth rating of user i for item j, ¯Rij de- notes the corresponding predicted rating, N is the total number of ratings in the test set, and Z P ij is a binary matrix that indicates test ratings. For all the compared methods, we set the regularization param- eters (e.g., λ, α and β) via 5-fold cross validation. Following the experimental settings in [23], we train each compared method with different percentages (50%, 80%, and 99%) of ratings. The train- ing data are randomly chosen from each dataset, and the remaining data are used for testing. This process is repeated five times, and we report the average RMSE. For the MovieLens-100K dataset, the parameters α, β and λ are set to 0.7, 0.004 and 0.2, respectively. The learning rate used in SGD is set to 0.002. Table 3 shows the average RMSE (with stan- dard deviations) of baselines PMF, Biased PMF, BMFSI, SCMF and our approaches, mDA-CF and mSDA-CF, on the MovieLens- 100K dataset. For each method, we have two settings for the di- mensions of latent factors, including d = 10 and d = 20. We can observe from Table 3 that (a) Our approaches (mDA-CF and mSDA-CF) achieve much bet- ter performance than PMF and Biased PMF, which are special cases of our approach. It demonstrates the effectiveness of in- corporating side information and deep architectures. 1http://grouplens.org/datasets/movielens/ 816
Table 2: Statistics of datasets used in our experiments. Dataset MovieLens-100K MovieLens-1M Book-Crossing Advertising #Users 943 6040 278858 448158 #Items 1682 3706 271379 737 User Features Sparsity 93.7% 95.8% 99.9% 99.7% Age, geolocation, domain, etc. Age, gender, and occupation Age, gender, and occupation Age, country, city, etc. Item Features Genres Genres Title, year, publisher, etc. Ad size, advertisor, etc. Table 3: Average RMSE (with standard deviation) of compared methods with different percentages of training data on MovleLens- 100K dataset. 99% 80% Method PMF [4] Biased PMF [41] BMFSI [26] SCMF [23] mDA-CF (Ours) mSDA-CF (Ours) d=10 0.9184±0.0265 0.8953±0.0189 0.8912±0.0127 0.8891±0.0146 0.8874±0.0142 0.8852±0.0135 d=20 0.9164±0.0261 0.8923±0.0150 0.8905±0.0154 0.8896±0.0198 0.8861±0.0153 0.8849±0.0167 d=10 0.9223±0.0056 0.9135±0.0039 0.9114±0.0031 0.9092±0.0033 0.9043±0.0043 0.9035±0.0028 d=20 0.9190±0.0052 0.9087±0.0030 0.9065±0.0029 0.9068±0.0036 0.9040±0.0045 0.9024±0.0030 d=10 0.9524±0.0023 0.9388±0.0029 0.9371±0.0023 0.9334±0.0025 0.9312±0.0026 0.9309±0.0026 50% d=20 0.9506±0.0024 0.9337±0.0020 0.9335±0.0025SS 0.9331±0.0021 0.9311±0.0025 0.9308±0.0028 (b) BMFSI is a Bayesian matrix factorization method that utilizes side information, so it performs better than PMF and Biased PMF that ignore such information. Our approach outperforms BMFSI, which validates the strengths of the latent factors learned by marginalized denoising auto-encoders. (c) Usually, deep models with multiple layers lead to better per- formance. Our mSDA-CF slightly enhances the performance of mDA-CF. We will show the influence of different number of layers in the next section. (d) Note that the basic component in our approach is PMF. Ac- tually, DCF is a general framework for collaborative filtering. When we implement l(R, U, V ) in (6) as some advanced MF methods (e.g., weighted matrix factorization), the results could be further improved. For the MovieLens-1M dataset, the parameters α, β and λ are set to 0.8, 0.003 and 0.3, respectively. Table 4 shows the aver- age RMSE (with standard deviations) of our approach and com- pared methods on the MovieLens-1M dataset. Basically, Table 4 shows similar phenomenon to that we observed from Table 3. The proposed mDA-CF and mSDA-CF approaches consistently achieve lower RMSE than compared methods. As before, we evaluate each method in two settings with different dimension of latent factors. Usually, d = 20 generates better results than d = 10 on the two MovieLens datasets. In addition, we compare our approaches with the state-of-the-art deep learning based collaborative filtering method, a joint user-item based restricted Boltzmann machine (UI-RBM) [11]. To conduct fair comparisons with UI-RBM, we adopt the mean absolute error (MAE) used in [11] as evaluation metric. The MAE is defined as follows L MAE = i |gi − pi| , L (21) where gi is the ground truth rating, pi is the predicted rating, and L is the total number of ratings. We follow the experimental settings in [11], and conduct 5-fold cross-validations. The dimension of latent factors is set to 20. Ta- Table 5: MAE of compared methods on MovieLens-100K dataset. Method PMF [4] U-RBM [11] I-RBM [11] I-RBM+INB [11] UI-RBM [11] mDA-CF (Ours) mSDA-CF (Ours) MAE 0.793 0.779 0.775 0.699 0.690 0.683 0.680 ble 5 shows the average MAE of compared methods on the MovieLens- 100K dataset. U-RBM and I-RBM denote the user-based model and item-based model, respectively. They are the baselines used in [11]. Table 5 shows that UI-RBM achieves much better perfor- mance than traditional methods like PMF, as it takes advantage of the deep feature learning. Our mDA-CF and mSDA-CF approaches obtain lower MAE than UI-RBM, demonstrating the effectiveness of our DCF framework compared to the RBM based deep learning models. 5.2 Book Recommendation For book recommendation, we utilize the Book-Crossing dataset 2, which contains 1149780 ratings for 271379 books from 278858 users. The rating scale is from 0 to 10 with the higher score indi- cating the more preference. Some attributes of users and books are also provided in this dataset. These attributes are encoded to binary vectors, which form the feature sets for users and books. Table 2 shows some statistics of this dataset. We follow the settings in [43], and conduct 5-fold cross-validation. The baselines include PMF, the implicit social matrix factoriza- tion (ISMF) [42] and the coupled item-based matrix factorization (CIMF) [43]. ISMF incorporates the implicit social relationships 2http://www2.informatik.uni-freiburg.de/ ~cziegler/BX/ 817
Table 4: Average RMSE (with standard deviation) of compared methods with different percentages of training data on MovleLens- 1M dataset. 99% Method PMF [4] Biased PMF [41] BMFSI [26] SCMF [23] mDA-CF (Ours) mSDA-CF (Ours) d=10 0.8424±0.0071 0.8408±0.0070 0.8391±0.0067 0.8364±0.0065 0.8335±0.0064 0.8320±0.0063 d=20 0.8388±0.0059 0.8367±0.0067 0.8340±0.0069 0.8323±0.0065 0.8317±0.0062 0.8304±0.0057 80% 50% d=10 0.8559±0.0022 0.8531±0.0019 0.8503±0.0017 0.8496±0.0019 0.8449±0.0015 0.8416±0.0014 d=20 0.8512±0.0017 0.8493±0.0020 0.8478±0.0019 0.8465±0.0018 0.8429±0.0013 0.8407±0.0011 d=10 0.8790±0.0009 0.8766±0.0015 0.8742±0.0016 0.8707±0.0013 0.8655±0.0007 0.8628±0.0005 d=20 0.8745±0.0011 0.8722±0.0012 0.8703±0.0010 0.8678±0.0007 0.8645±0.0006 0.8613±0.0006 Table 6: RMSE of compared methods on Book-Crossing data. RMSE (d = 10) RMSE (d = 50) Method PMF [4] ISMF [42] CIMF [43] mDA-CF (Ours) mSDA-CF (Ours) 3.7483 3.7440 3.7398 3.6610 3.6592 3.7452 3.7415 3.7372 3.6528 3.6513 between users and between items. CIMF makes use of the attributes of books in the model. Table 6 shows the RMSE of all compared methods. We can ob- serve that ISMF and CIMF obtain better results than the conven- tional PMF method, as they incorporate side information to their models. Our approaches obtain much lower RMSE than PMF, ISMF and CIMF in different settings. mSDA-CF achieves the best results among all competitors. 5.3 Response Prediction Response prediction is another interesting application of collab- orative filtering [30]. In particular, we consider a specific type of response prediction, which is click prediction. Given an online ad- vertisement, our task is to predict whether a given user will click it or not in the near future. Previous research works have proved that collaborative filtering (CF) methods are suitable for addressing the click prediction prob- lem. Unfortunately, there are few datasets available for evaluating the click prediction performance of CF models. To evaluate the performance of our model in real-world applications, we collected an advertising dataset at a large software company. The dataset is collected from its website, which contains the click responses for advertisements over 3 million users. For our purpose, we analyze the data from a 2-month period, from October 1, 2013 to November 30, 2013. The dataset used in our experiments contains 737 ads and 448,158 users. It also has the impression and click data of the advertise- ments. For each click event, we have the user ID, day, ad ID, page ID, country, browser, advertiser ID, and size of ad. Each record can be uniquely identified by a (user, ad, day) triplet. In addition, we have information about the user profiles. For each user, we have some attributes such as country, domain, etc. Apart from the user information, our dataset contains the meta-information of the ads such as ad size. In the experiments, we encode the demographic information (e.g., country, state, domain) into a binary valued vec- tor for each user. The attributes of ads (e.g., advertiser, ad size) are also encoded into binary vectors. Some statistics of this dataset can be found in Table 2. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 R R F PMF Biased PMF BMFSI SCMF mDA−CF mSDA−CF 0 0 0.2 0.4 0.6 0.8 1 FAR Figure 2: ROC curves of our approach and compared methods on Advertising dataset (d=10). The advertising dataset used in our experiments is very sparse, which contains 880,569 click responses (density: 0.27%). We use the first 50% click responses for training, and the remaining data for testing. Following [44], we use the receiver operating charac- teristic (ROC) curve and the area under ROC (AUC-ROC) as our evaluation metrics. The true positive rate (TPR) and false positive rate (FPR) used for generating ROC curves are defined as follows: T P R = T P T P +F N , F P R = F P F P +T N , (22) where T P represents the true positives, F N represents the false negatives, T N represents the true negatives, and F P represents the false positives. We evaluate the performance of each compared method on the Advertising dataset. The major parameters α, β and λ are set to 0.8, 0.02 and 0.12, respectively. Two settings are utilized, by setting the latent factor dimension to 10 and 20, respectively. Fig. 2 shows the ROC curves of PMF, Biased PMF, BMFSI, SCMF, mDA-CF and mSDA-CF. Table 7 shows the corresponding AUC of all compared methods. We can observe that our approaches obtain higher AUC than other methods, which demonstrates the effectiveness of our framework. 5.4 Discussion So far, we have seen that our approach outperforms the exist- ing approaches on the different datasets. We also analyze the con- vergence property and parameter settings of our approach on the 818
分享到:
收藏