logo资料库

Learning to Rank for Information Retrieval.pdf

第1页 / 共115页
第2页 / 共115页
第3页 / 共115页
第4页 / 共115页
第5页 / 共115页
第6页 / 共115页
第7页 / 共115页
第8页 / 共115页
资料共115页,剩余部分请下载后查看
Foundations and Trends R in sample Vol. xx, No xx (2008) 1–112 c 2008 xxxxxxxxx DOI: xxxxxx Learning to Rank for Information Retrieval Tie-Yan Liu1 1 Microsoft Research Asia, Sigma Center, No. 49, Zhichun Road, Haidian District, Beijing, 100190, P. R. China, Tie-Yan.Liu@microsoft.com Abstract Learning to rank for information retrieval (IR) is a task to automat- ically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance. Many IR problems are by nature ranking problems, and many IR technologies can be potentially enhanced by using learning-to-rank techniques. The objective of this tutorial is to give an introduction to this research direction. Specifically, the exist- ing learning-to-rank algorithms are reviewed and categorized into three approaches: the pointwise, pairwise, and listwise approaches. The ad- vantages and problems with each approach are analyzed, and the rela- tionships between the loss functions used in these approaches and IR evaluation measures are discussed. Then the empirical evaluations on typical learning-to-rank methods are shown, with the LETOR collec- tion as a benchmark dataset, which seem to suggest that the listwise ap- proach be the most effective one among all the approaches. After that, a statistical ranking theory is introduced, which can describe different learning-to-rank algorithms, and be used to analyze their query-level generalization abilities. At the end of the tutorial, we make a summary and discuss potential future work on learning to rank.
Contents 1 Introduction 1.1 Ranking in Information Retrieval 1.2 Learning to Rank 1.3 About this Tutorial 2 The Pointwise Approach 2.1 Regression based Algorithms 2.2 Classification based Algorithms 2.3 Ordinal Regression based Algorithms 2.4 Discussions 3 The Pairwise Approach 3.1 Example Algorithms 3.2 Discussions 4 The Listwise Approach i 1 3 11 20 22 23 24 26 30 34 35 41 44
ii Contents 4.1 Direct Optimization of IR Evaluation Measures 4.2 Minimization of Listwise Ranking Losses 4.3 Discussions 5 Analysis of the Approaches 5.1 The Pointwise Approach 5.2 The Pairwise Approach 5.3 The Listwise Approach 5.4 Discussions 6 Benchmarking Learning-to-Rank Algorithms 6.1 The LETOR Collection 6.2 Experimental Results on LETOR 7 Statistical Ranking Theory 7.1 Conventional Generalization Analyses on Ranking 7.2 A Query-level Ranking Framework 7.3 Query-level Generalization Analysis 7.4 Discussions 8 Summary and Outlook References Acknowledgements 44 50 54 56 57 59 61 64 66 67 73 79 80 85 89 94 96 104 112
1 Introduction With the fast development of the Web, every one of us is experiencing the flood of information. It was estimated that there are about 25 billion pages on the Web as of October 20081, which makes it generally impossible for common users to locate their desired information by browsing the Web. As a consequence, efficient and effective information retrieval (IR) has become more important than ever, and search engines (or IR systems) have become an essential tool for many people. Ranking is a central problem in IR. Many IR problems are by nature ranking problems, such as document retrieval, collaborative filtering [58], key term extraction [30], definition finding [130], important email routing [23], sentiment analysis [94], product rating [36], and anti Web spam [56]. In this tutorial, we will mainly take document retrieval as an example. Note that document retrieval is not a narrow task. Web pages, emails, academic papers, books, and news stories are just a few of the many examples of documents. And there are also many different ranking scenarios for document retrieval of our interest. Scenario 1: Rank the documents purely according to their rele- 1 http://www.worldwidewebsize.com/ 1
2 Introduction vance with regards to the query. Scenario 2: Consider the relationships of similarity [118], website structure [35], and diversity [139] between documents in the ranking process. This is also referred to as relational ranking [102]. Scenario 3: Aggregate several candidate ranked lists to get a bet- ter ranked list. This scenario is also referred to as meta search [7]. The candidate ranked lists may come from different index servers, or differ- ent vertical search engines, and the target ranked list is the final result presented to users. Scenario 4: Find whether and to what degree a property of a webpage influences the ranking result. This is referred to as “reverse engineering” in search engine optimization (SEO)2. To tackle the problem of document retrieval, many heuristic ranking models have been proposed and used in the literature of IR. Recently, given the amount of potential training data available, it has become possible to leverage machine learning (ML) technologies to build effec- tive ranking models. Specifically, we call those methods that learn how to combine predefined features for ranking by means of discriminative learning “learning-to-rank” methods. In recent years, learning to rank has become a very hot research direction in IR, and a large number of learning-to-rank algorithms have been proposed, such as [49] [73] [33] [90] [78] [34] [59] [114] [26] [9] [29] [14] [122] [47] [62] [97] [16] [117] [136] [134] [13] [104] [99] [17] [129]. We can foresee that learning to rank will have an even bigger impact on IR in the future. When a research area comes to this stage, several questions as fol- lows naturally arise. • To what respect are these learning-to-rank algorithms similar and in which aspects do they differ? What are the strengths and weaknesses of each algorithm? • Empirically, which of those many learning-to-rank algorithms perform the best? What kind of datasets can be used to make 2 http://www.search-marketing.info/newsletter/reverse-engineering.htm
1.1. Ranking in Information Retrieval 3 fair comparison among different learning-to-rank algorithms? • Theoretically, is ranking a new machine learning problem, or can it be simply reduced to existing machine learning prob- lems? What are the unique theoretical issues for ranking that should be investigated? • Are there many remaining issues regarding learning to rank to study in the future? What are they? The above questions have been brought to the attention of the IR and ML communities in a variety of contexts, especially during recent years. The aim of this tutorial is to review recent work that attempts to answer these questions. Needless to say, the comprehensive under- standing of the task of ranking in IR is the key to finding the right answers. Therefore, we will first give a brief introduction of ranking in IR, and then formalize the problem of learning to rank so as to set the stage for the upcoming detailed reviews. 1.1 Ranking in Information Retrieval In this section, we briefly review representative ranking models in the literature of IR, and introduce how these models are evaluated. 1.1.1 Conventional Ranking Models for IR In the literature of IR, many ranking models have been proposed [8]. They can be roughly categorized as query-dependent models and query- independent models. Query-dependent Models The early models retrieve documents based on the occurrences of the query terms in the documents. Examples include the Boolean model [8]. Basically these models can only predict whether a document is relevant to the query or not, but cannot predict the degree of relevance. To further model the relevance degree, the Vector Space model (VSM) was proposed [8]. Both documents and queries are represented as vectors in a Euclidean space, in which the inner product of two vec- tors can be used to measure their similarities. To get an effective vector
4 Introduction representation of the query and the documents, TF-IDF weighting has been widely used3. The TF of a term t in a vector is defined as the normalized number of its occurrences in the document, and the IDF of it is defined as follows. IDF (t) = log N n(t) , (1.1) where N is the total number of documents in the collection, and n(t) is the number of documents containing term t. While VSM implies the assumption on the independence between terms, Latent Semantic Indexing (LSI) [37] tries to avoid this as- sumption. In particular, Singular Value Decomposition (SVD) is used to linearly transform the feature space and thus a “latent semantic space” is generated. Similarity in this new space is then used to define the relevance between the query and the documents. As compared with the above, models based on the probabilistic ranking principle [83] garnered more attention and achieved more suc- cess in the past decades. The famous ranking models like the BM25 model [111]4 and the language model for IR, can both be categorized as probabilistic ranking models. The basic idea of BM25 is to rank documents by the log-odds of their relevance. Actually BM25 is not a single model, but actually defines a whole family of ranking models, with slightly different components and parameters. One of the popular instantiations of the model is as follows. Given a query q, containing terms t1, . . . , tM , the BM25 score of a document d is computed as below, BM 25(d, q) = MXi=1 IDF (ti) · T F (ti, d) · (k1 + 1) T F (ti, d) + k1 · (1 − b + b · LEN (d) avdl , ) (1.2) 3 Note that there are many different definitions of TF and IDF in the literature. Some are purely based on the frequency and the others include smoothing or normalization [116]. Here we just give some simple examples to illustrate the main idea. 4 The name of the actual model is BM25. To the right context, however, it is usually referred to as “OKapi BM25”, since the OKapi system was the first system to implement this model.
1.1. Ranking in Information Retrieval 5 where T F (t, d) is the term frequency of t in document d, LEN (d) is the length (number of words) of document d, and avdl is the aver- age document length in the text collection from which documents are drawn. k1 and b are free parameters, IDF (t) is the IDF weight of term t, computed by using Eqn.(1.1), for example. The language model for IR [96] is an application of the statistical language model on IR. A statistical language model assigns a proba- bility to a sequence of terms. When used in IR, a language model is associated with a document. With query q as input, documents are ranked based on the query likelihood, or the probability that the doc- ument’s language model would generate the terms in the query (i.e., P (q|d)). By further assuming the independence among terms, one has P (q|d) =QM i=1 P (ti|d), if query q contains terms t1,··· , tM . To learn the document’s language model, a maximum likelihood method is used. As in many maximum likelihood methods, the issue of smoothing the estimate is critical. Usually a background language model estimated using the entire collection is used for this purpose. Then, the document’s language model can be constructed as follows. p(ti|d) = (1 − λ) T F (ti, d) LEN (d) + λp(ti|C), (1.3) where p(ti|C) is the background language model for term ti, and λ ∈ [0, 1] is a smoothing factor. There are many variants of the language model for IR, some of them even go beyond the query likelihood retrieval model (e.g., the models based on K-L divergence [140]). We would not introduce more about them, and readers are encouraged to read the tutorial authored by Zhai [138]. In addition to the above examples, many other models have also been proposed to compute the relevance between a query and a document. Some of them [119] take the proximity of the query terms into consideration, and some others consider the relationship between documents in terms of content similarity [118], hyperlink structure [113], website structure [101], and topic diversity [139].
分享到:
收藏