Foundations and Trends R in
sample
Vol. xx, No xx (2008) 1–112
c 2008 xxxxxxxxx
DOI: xxxxxx
Learning to Rank for Information Retrieval
Tie-Yan Liu1
1 Microsoft Research Asia, Sigma Center, No. 49, Zhichun Road, Haidian
District, Beijing, 100190, P. R. China, Tie-Yan.Liu@microsoft.com
Abstract
Learning to rank for information retrieval (IR) is a task to automat-
ically construct a ranking model using training data, such that the
model can sort new objects according to their degrees of relevance,
preference, or importance. Many IR problems are by nature ranking
problems, and many IR technologies can be potentially enhanced by
using learning-to-rank techniques. The objective of this tutorial is to
give an introduction to this research direction. Specifically, the exist-
ing learning-to-rank algorithms are reviewed and categorized into three
approaches: the pointwise, pairwise, and listwise approaches. The ad-
vantages and problems with each approach are analyzed, and the rela-
tionships between the loss functions used in these approaches and IR
evaluation measures are discussed. Then the empirical evaluations on
typical learning-to-rank methods are shown, with the LETOR collec-
tion as a benchmark dataset, which seem to suggest that the listwise ap-
proach be the most effective one among all the approaches. After that,
a statistical ranking theory is introduced, which can describe different
learning-to-rank algorithms, and be used to analyze their query-level
generalization abilities. At the end of the tutorial, we make a summary
and discuss potential future work on learning to rank.
Contents
1 Introduction
1.1 Ranking in Information Retrieval
1.2 Learning to Rank
1.3 About this Tutorial
2 The Pointwise Approach
2.1 Regression based Algorithms
2.2 Classification based Algorithms
2.3 Ordinal Regression based Algorithms
2.4 Discussions
3 The Pairwise Approach
3.1 Example Algorithms
3.2 Discussions
4 The Listwise Approach
i
1
3
11
20
22
23
24
26
30
34
35
41
44
ii Contents
4.1 Direct Optimization of IR Evaluation Measures
4.2 Minimization of Listwise Ranking Losses
4.3 Discussions
5 Analysis of the Approaches
5.1 The Pointwise Approach
5.2 The Pairwise Approach
5.3 The Listwise Approach
5.4 Discussions
6 Benchmarking Learning-to-Rank Algorithms
6.1 The LETOR Collection
6.2 Experimental Results on LETOR
7 Statistical Ranking Theory
7.1 Conventional Generalization Analyses on Ranking
7.2 A Query-level Ranking Framework
7.3 Query-level Generalization Analysis
7.4 Discussions
8 Summary and Outlook
References
Acknowledgements
44
50
54
56
57
59
61
64
66
67
73
79
80
85
89
94
96
104
112
1
Introduction
With the fast development of the Web, every one of us is experiencing
the flood of information. It was estimated that there are about 25
billion pages on the Web as of October 20081, which makes it generally
impossible for common users to locate their desired information by
browsing the Web. As a consequence, efficient and effective information
retrieval (IR) has become more important than ever, and search engines
(or IR systems) have become an essential tool for many people.
Ranking is a central problem in IR. Many IR problems are by nature
ranking problems, such as document retrieval, collaborative filtering
[58], key term extraction [30], definition finding [130], important email
routing [23], sentiment analysis [94], product rating [36], and anti Web
spam [56]. In this tutorial, we will mainly take document retrieval as
an example. Note that document retrieval is not a narrow task. Web
pages, emails, academic papers, books, and news stories are just a few
of the many examples of documents. And there are also many different
ranking scenarios for document retrieval of our interest.
Scenario 1: Rank the documents purely according to their rele-
1 http://www.worldwidewebsize.com/
1
2 Introduction
vance with regards to the query.
Scenario 2: Consider the relationships of similarity [118], website
structure [35], and diversity [139] between documents in the ranking
process. This is also referred to as relational ranking [102].
Scenario 3: Aggregate several candidate ranked lists to get a bet-
ter ranked list. This scenario is also referred to as meta search [7]. The
candidate ranked lists may come from different index servers, or differ-
ent vertical search engines, and the target ranked list is the final result
presented to users.
Scenario 4: Find whether and to what degree a property of a
webpage influences the ranking result. This is referred to as “reverse
engineering” in search engine optimization (SEO)2.
To tackle the problem of document retrieval, many heuristic ranking
models have been proposed and used in the literature of IR. Recently,
given the amount of potential training data available, it has become
possible to leverage machine learning (ML) technologies to build effec-
tive ranking models. Specifically, we call those methods that learn how
to combine predefined features for ranking by means of discriminative
learning “learning-to-rank” methods.
In recent years, learning to rank has become a very hot research
direction in IR, and a large number of learning-to-rank algorithms have
been proposed, such as [49] [73] [33] [90] [78] [34] [59] [114] [26] [9] [29]
[14] [122] [47] [62] [97] [16] [117] [136] [134] [13] [104] [99] [17] [129]. We
can foresee that learning to rank will have an even bigger impact on
IR in the future.
When a research area comes to this stage, several questions as fol-
lows naturally arise.
• To what respect are these learning-to-rank algorithms similar
and in which aspects do they differ? What are the strengths
and weaknesses of each algorithm?
• Empirically, which of those many learning-to-rank algorithms
perform the best? What kind of datasets can be used to make
2 http://www.search-marketing.info/newsletter/reverse-engineering.htm
1.1. Ranking in Information Retrieval
3
fair comparison among different learning-to-rank algorithms?
• Theoretically, is ranking a new machine learning problem, or
can it be simply reduced to existing machine learning prob-
lems? What are the unique theoretical issues for ranking that
should be investigated?
• Are there many remaining issues regarding learning to rank
to study in the future? What are they?
The above questions have been brought to the attention of the IR
and ML communities in a variety of contexts, especially during recent
years. The aim of this tutorial is to review recent work that attempts
to answer these questions. Needless to say, the comprehensive under-
standing of the task of ranking in IR is the key to finding the right
answers. Therefore, we will first give a brief introduction of ranking in
IR, and then formalize the problem of learning to rank so as to set the
stage for the upcoming detailed reviews.
1.1 Ranking in Information Retrieval
In this section, we briefly review representative ranking models in the
literature of IR, and introduce how these models are evaluated.
1.1.1 Conventional Ranking Models for IR
In the literature of IR, many ranking models have been proposed [8].
They can be roughly categorized as query-dependent models and query-
independent models.
Query-dependent Models
The early models retrieve documents based on the occurrences of
the query terms in the documents. Examples include the Boolean model
[8]. Basically these models can only predict whether a document is
relevant to the query or not, but cannot predict the degree of relevance.
To further model the relevance degree, the Vector Space model
(VSM) was proposed [8]. Both documents and queries are represented
as vectors in a Euclidean space, in which the inner product of two vec-
tors can be used to measure their similarities. To get an effective vector
4 Introduction
representation of the query and the documents, TF-IDF weighting has
been widely used3. The TF of a term t in a vector is defined as the
normalized number of its occurrences in the document, and the IDF of
it is defined as follows.
IDF (t) = log
N
n(t)
,
(1.1)
where N is the total number of documents in the collection, and n(t)
is the number of documents containing term t.
While VSM implies the assumption on the independence between
terms, Latent Semantic Indexing (LSI) [37] tries to avoid this as-
sumption. In particular, Singular Value Decomposition (SVD) is used
to linearly transform the feature space and thus a “latent semantic
space” is generated. Similarity in this new space is then used to define
the relevance between the query and the documents.
As compared with the above, models based on the probabilistic
ranking principle [83] garnered more attention and achieved more suc-
cess in the past decades. The famous ranking models like the BM25
model [111]4 and the language model for IR, can both be categorized
as probabilistic ranking models.
The basic idea of BM25 is to rank documents by the log-odds of their
relevance. Actually BM25 is not a single model, but actually defines
a whole family of ranking models, with slightly different components
and parameters. One of the popular instantiations of the model is as
follows.
Given a query q, containing terms t1, . . . , tM , the BM25 score of a
document d is computed as below,
BM 25(d, q) =
MXi=1
IDF (ti) · T F (ti, d) · (k1 + 1)
T F (ti, d) + k1 · (1 − b + b · LEN (d)
avdl
,
)
(1.2)
3 Note that there are many different definitions of TF and IDF in the literature. Some are
purely based on the frequency and the others include smoothing or normalization [116].
Here we just give some simple examples to illustrate the main idea.
4 The name of the actual model is BM25. To the right context, however, it is usually
referred to as “OKapi BM25”, since the OKapi system was the first system to implement
this model.
1.1. Ranking in Information Retrieval
5
where T F (t, d) is the term frequency of t in document d, LEN (d) is
the length (number of words) of document d, and avdl is the aver-
age document length in the text collection from which documents are
drawn. k1 and b are free parameters, IDF (t) is the IDF weight of term
t, computed by using Eqn.(1.1), for example.
The language model for IR [96] is an application of the statistical
language model on IR. A statistical language model assigns a proba-
bility to a sequence of terms. When used in IR, a language model is
associated with a document. With query q as input, documents are
ranked based on the query likelihood, or the probability that the doc-
ument’s language model would generate the terms in the query (i.e.,
P (q|d)). By further assuming the independence among terms, one has
P (q|d) =QM
i=1 P (ti|d), if query q contains terms t1,··· , tM .
To learn the document’s language model, a maximum likelihood
method is used. As in many maximum likelihood methods, the issue
of smoothing the estimate is critical. Usually a background language
model estimated using the entire collection is used for this purpose.
Then, the document’s language model can be constructed as follows.
p(ti|d) = (1 − λ)
T F (ti, d)
LEN (d)
+ λp(ti|C),
(1.3)
where p(ti|C) is the background language model for term ti, and λ ∈
[0, 1] is a smoothing factor.
There are many variants of the language model for IR, some of
them even go beyond the query likelihood retrieval model (e.g., the
models based on K-L divergence [140]). We would not introduce more
about them, and readers are encouraged to read the tutorial authored
by Zhai [138].
In addition to the above examples, many other models have also
been proposed to compute the relevance between a query and a
document. Some of them [119] take the proximity of the query terms
into consideration, and some others consider the relationship between
documents in terms of content similarity [118], hyperlink structure
[113], website structure [101], and topic diversity [139].