Building a Music Recommendation
Engine with Probabilistic Matrix
Factorization in PyTorch
Recommendation systems are one of the most widespread forms of
machine learning in modern society. Whether you are looking for
your next show to watch on Netflix or listening to an automated
music playlist on Spotify, recommender systems impact almost all
aspects of the modern user experience. One of the most common
ways to build a recommendation system is with matrix factorization,
which finds ways to predict a user’s rating for a specific product based
on previous ratings and other users’ preferences. In this article, I will
use a dataset of music ratings to build a recommendation engine for
music with PyTorch in an attempt to clarify the details behind the
theory and implementation of a probabilistic matrix factorization
model.
Introduction to Recommendation
Systems
C
a
m
e
r
o
n
W
o
l
f
e
F
o
l
l
o
w
M
a
r
2
2
·
1
5
m
i
n
r
e
a
d
In this section, I will attempt to provide a quick, but comprehensive
introduction to recommendation systems and matrix factorization
before introducing the actual implementation and results for the
music recommendation project.
What is a Recommendation System?
To create a recommendation system, we need a dataset that includes
users, items, and ratings. Users can be customers buying products
from a store, a family streaming a movie from home, or, in our case,
even a person listening to music. Similarly, items are the products
that a user is buying, rating, listening to, etc. Therefore, the data set
for a recommendation system generally has many entries that include
a user-item pair and a value representing the user’s rating for that
item, such as the data set seen below:
User ratings can be collected either explicitly (asking user for a 5 star
ratings, asking whether a user likes a product, etc.) or implicitly
(examining purchase history, mouse movements, listening history,
etc.). Additionally, the values for ratings could be binary (a user
buying a product), discrete (rating a product 1–5 stars), or almost
anything else. In fact, the dataset used for this project represents
ratings as the total number of minutes a user has spent listening to an
F
i
g
u
r
e
1
:
S
i
m
p
l
e
D
a
t
a
S
e
t
f
o
r
R
e
c
o
m
m
e
n
d
a
t
i
o
n
S
y
s
t
e
m
s
artist (this is an implicit rating!). Each of the entries in the data set
represents a user-item pairing with a known rating, and, from these
known user-item pairings, a recommendation system tries to predict
user-item ratings that are not yet known. Recommendations may be
created by finding items that are similar to those a user likes (item-
based), finding similar users and recommending the stuff they like
(user-based), or finding a relationship between user-item interactions
as a whole. In this way, the model recommends different items that it
believes the user might enjoy that the user has not yet seen.
It’s pretty easy to understand the purpose and value of a good
recommender system, but here is a nice figure that I believe
summarizes recommendation systems nicely:
What is Matrix Factorization?
Matrix factorization is a common and effective way to implement a
recommendation system. Using the previously-mentioned user-item
dataset (Figure 1), matrix factorization attempts to learn ways to
quantitatively characterize both users and items in a lower-
dimensional space (as opposed to looking at every item the user has
ever rated), such that these characterizations can be used to predict
user behavior and preferences for a known set of possible items. In
recent years, matrix factorization has become increasingly popular
due to its accuracy and scalability.
F
i
g
u
r
e
2
:
R
e
c
o
m
m
e
n
d
e
r
s
y
s
t
e
m
s
a
i
m
t
o
fi
n
d
i
t
e
m
s
t
h
a
t
a
u
s
e
r
e
n
j
o
y
s
,
b
u
t
h
a
s
n
o
t
y
e
t
s
e
e
n
[
1
]
Using the data set of user-item pairings, one can create a matrix such
that all rows represent different users, all columns represent different
items, and each entry at location (i, j) in the matrix represents user i’s
rating for item j. This matrix is illustrated in the following figure:
In the above figure, each row contains a single user’s ratings for every
item in the set of possible items. However, some of these ratings
might be empty (represented by “???”). These empty spots represent
user-item pairings that are not yet known, or that the
recommendation system is trying to predict. Moreover, this partially-
filled matrix, referred to as a sparse matrix, can be thought of as the
product of two matrices. For example, the above 4x4 matrix could be
created by multiplying two 4x4 matrices with each other, a 4x10
matrix with a 10x4 matrix, a 4x2 matrix with a 2x4 matrix, etc. This
process of decomposing the original matrix into a product of two
matrices is called matrix factorization.
Matrix Factorization for Recommendation
Systems
So, we know that we can decompose a matrix by finding two matrices
that can be multiplied together to form our original matrix. But, how
does matrix factorization actually work in a recommendation system?
F
i
g
u
r
e
3
:
A
U
s
e
r
‑
I
t
e
m
R
e
c
o
m
m
e
n
d
a
t
i
o
n
M
a
t
r
i
x
When these two separate matrices are created, they each carry
separate information about users, items, and the relationships
between them. Namely, one of the matrices will store information
that characterizes users, while the other will store information about
the items. In fact, every row of the user (left) matrix is a vector of size
k that quantitatively describes a single user, while every column of
the item (right) matrix is a vector of size k that characterizes a single
item. The size of these vectors, k, is called the latent dimensionality
(or embedding size), and is a hyperparameter that must be tuned in
the matrix factorization model — a larger latent dimentionality will
allow for the model to capture more complex relationships and store
more information, but can also lead to overfitting.
This idea may seem weird at first, but if you examine closely how the
matrices are multiplied to create the user-item matrix, it actually
makes a lot of sense — the rating of user i for item j is obtained by
finding the inner product of user i’s vector from the left matrix and
item j’s vector from the right matrix. Therefore, if these vector
embeddings for each user and item are trained to store useful,
characteristic information, the inner product can accurately predict a
user-item rating based on an item’s characteristics and a user’s
preferences for such characteristics, which are all stored in the values
of the embedding vector. This concept is illustrated in the following
figure:
In the above figure, note how each entry of the product matrix is
derived. For example, entry (0, 0) of the user-item matrix,
representing user 0’s rating for item 0, is obtained by taking the inner
product of the 0th row on the left matrix (the embedding vector for
user 0) and the 0th column on the right matrix (the embedding
F
i
g
u
r
e
4
:
M
a
t
r
i
x
F
a
c
t
o
r
i
z
a
t
i
o
n
w
i
t
h
U
s
e
r
(
l
e
f
t
)
a
n
d
I
t
e
m
(
r
i
g
h
t
)
M
a
t
r
i
c
e
s
vector for item 0) — this pattern continues for every rating in the user-
item matrix. Therefore, as previously described, every rating is
predicted by taking the inner product of the corresponding user and
item embedding vectors, which is the core idea behind how matrix
factorization works for recommendation systems. Using common
training methods, like stochastic gradient descent or alternating least
squares, these two matrices, and the embedding vectors within them,
can be trained to yield a product that is very similar to the original
user-item matrix, thus creating an accurate recommendation model.
But wait… didn’t you say Probabilistic
Matrix Factorization
Yes, I did! Don’t worry, probabilistic matrix factorization is very
similar to what I have been explaining so far. The key difference
between normal matrix factorization and probabilistic matrix
factorization is the way in which the resulting matrix is created. For
normal matrix factorization, all unknown values (the ratings we are
trying to predict) are set to some constant (usually 0) and the
decomposed matrices are trained to reproduce the entire matrix,
including the unknown values. It is not very hard to realize why this
might become an issue, as the matrix factorization model is
predicting values that we do not actually know and that could be
completely wrong. Additionally, how can we predict unknown user-
item ratings when we are already training our model to predict a
random user-defined constant for all of them?
To solve this problem, we have probabilistic matrix factorization.
Instead of trying to reproduce the entire resulting matrix,
probabilistic matrix factorization only attempts to reproduce ratings
that are in the training set, or the set of known ratings. Therefore, the
model is trained using only known ratings, and unknown ratings can
be predicted by taking the inner product of the user and item
embedding vectors.
.
.
.
Probabilistic Matrix Factorization
in PyTorch
Now that you understand the basics behind recommender systems
and probabilistic matrix factorization, I am going to outline how a
model for such a recommender system can be implemented using
PyTorch. If you are unfamiliar with PyTorch, it is a robust python
framework often used for deep learning and scientific computing. I
encourage anyone who is curious in learning more about PyTorch to
check it out, as I heavily use the framework in my research and
personal projects.
What are we trying to implement?
As seen in Figure 4, our model for probabilistic matrix factorization is
just two matrices — how simple! One of the matrices will represent the
embeddings for each user, while the other matrix will contain
embeddings for each item. Each embedding is a vector of k values (k
is the latent dimensionality) that describe a user or item. Using these
two matrices, a rating prediction for a user-item pairing can be
obtained by taking the inner product of the user’s embedding vector
and the item’s embedding vector, yielding a single value that
represents the predicted rating. Such a process can be observed in the
following figure:
F
i
g
u
r
e
5
:
P
r
e
d
i
c
t
i
n
g
a
n
u
n
k
n
o
w
n
u
s
e
r
‑
i
t
e
m
r
a
t
i
n
g
As can be seen above, a single rating can be determined by finding
the corresponding user and item vectors from the user (left) and item
(right) matrices and computing their inner product. In this case, the
result shows that the chosen user is a big fan of ice cream — what a
surprise! However, it should be noted that each of these matrices are
randomly initialized. Therefore, in order for these predictions and
embeddings to accurately characterize both users and items, they
must be trained using a set of known ratings so that the accuracy of
predictions generalizes to ratings that are not yet known.
Such a model can be implemented with relative ease using the
Embedding class in PyTorch, which creates a 2-dimensional
embedding matrix. Using two of these embeddings, the probabilistic
matrix factorization model can be created in a PyTorch module as
follows:
The matrix factorization model contains two embedding matrices,
which are initialized inside of the model’s constructor and later
trained to accurately predict unknown user-item ratings. In the
forward function (used to predict a rating), the model is passed a
mini-batch of index values, which represent the identification
numbers (IDs) of different users and items. Using these indices,
passed in the “cats” parameter (an Nx2 matrix) as user-item index
pairs, the predicted ratings are obtained by indexing the user and
item embedding matrices and taking the inner product of the
corresponding vectors.
Adding Bias
The model from Figure 6 is lacking an important feature that is
needed for the creation of a powerful recommendation system — the
F
i
g
u
r
e
6
:
S
i
m
p
l
e
m
a
t
r
i
x
f
a
c
t
o
r
i
z
a
t
i
o
n
i
m
p
l
e
m
e
n
t
a
t
i
o
n