logo资料库

大名鼎鼎的斯坦福搜索引擎原理.pdf

第1页 / 共506页
第2页 / 共506页
第3页 / 共506页
第4页 / 共506页
第5页 / 共506页
第6页 / 共506页
第7页 / 共506页
第8页 / 共506页
资料共506页,剩余部分请下载后查看
Cover
Half-title
Title
Copyright
Contents
Table of Notation
Preface
1 Boolean retrieval
1.1 An example information retrieval problem
1.2 A first take at building an inverted index
1.3 Processing Boolean queries
1.4 The extended Boolean model versus ranked retrieval
1.5 References and further reading
2 The term vocabulary and postings lists
2.1 Document delineation and character sequence decoding
2.1.1 Obtaining the character sequence in a document
2.1.2 Choosing a document unit
2.2 Determining the vocabulary of terms
2.2.1 Tokenization
2.2.2 Dropping common terms: stop words
2.2.3 Normalization (equivalence classing of terms)
2.2.4 Stemming and lemmatization
2.3 Faster postings list intersection via skip pointers
2.4 Positional postings and phrase queries
2.4.1 Biword indexes
2.4.2 Positional indexes
2.4.3 Combination schemes
2.5 References and further reading
3 Dictionaries and tolerant retrieval
3.1 Search structures for dictionaries
3.2 Wildcard queries
3.2.1 General wildcard queries
3.2.2 k-gram indexes for wildcard queries
3.3 Spelling correction
3.3.1 Implementing spelling correction
3.3.2 Forms of spelling correction
3.3.3 Edit distance
3.3.4 k-gram indexes for spelling correction
3.3.5 Context-sensitive spelling correction
3.4 Phonetic correction
3.5 References and further reading
4 Index construction
4.1 Hardware basics
4.2 Blocked sort-based indexing
4.3 Single-pass in-memory indexing
4.4 Distributed indexing
4.5 Dynamic indexing
4.6 Other types of indexes
4.7 References and further reading
5 Index compression
5.1 Statistical properties of terms in information retrieval
5.1.1 Heaps' law: Estimating the number of terms
5.1.2 Zipf's law: Modeling the distribution of terms
5.2 Dictionary compression
5.2.1 Dictionary as a string
5.2.2 Blocked storage
5.3 Postings file compression
5.3.1 Variable byte codes
5.3.2 Gamma codes
5.4 References and further reading
6 Scoring, term weighting, and the vector space model
6.1 Parametric and zone indexes
6.1.1 Weighted zone scoring
6.1.2 Learning weights
6.1.3 The optimal weight g
6.2 Term frequency and weighting
6.2.1 Inverse document frequency
6.2.2 Tf--idf weighting
6.3 The vector space model for scoring
6.3.1 Dot products
6.3.2 Queries as vectors
6.3.3 Computing vector scores
6.4 Variant tf--idf functions
6.4.1 Sublinear tf scaling
6.4.2 Maximum tf normalization
6.4.3 Document and query weighting schemes
6.4.4 Pivoted normalized document length
6.5 References and further reading
7 Computing scores in a complete search system
7.1 Efficient scoring and ranking
7.1.1 Inexact top K document retrieval
7.1.2 Index elimination
7.1.3 Champion lists
7.1.4 Static quality scores and ordering
7.1.5 Impact ordering
7.1.6 Cluster pruning
7.2 Components of an information retrieval system
7.2.1 Tiered indexes
7.2.2 Query term proximity
7.2.3 Designing parsing and scoring functions
7.2.4 Putting it all together
7.3 Vector space scoring and query operator interaction
7.4 References and further reading
8 Evaluation in information retrieval
8.1 Information retrieval system evaluation
8.2 Standard test collections
8.3 Evaluation of unranked retrieval sets
8.4 Evaluation of ranked retrieval results
8.5 Assessing relevance
8.5.1 Critiques and justifications of the concept of relevance
8.6 A broader perspective: System quality and user utility
8.6.1 System issues
8.6.2 User utility
8.6.3 Refining a deployed system
8.7 Results snippets
8.8 References and further reading
9 Relevance feedback and query expansion
9.1 Relevance feedback and pseudo relevancefeedback
9.1.1 The Rocchio algorithm for relevance feedback
9.1.2 Probabilistic relevance feedback
9.1.3 When does relevance feedback work?
9.1.4 Relevance feedback on the web
9.1.5 Evaluation of relevance feedback strategies
9.1.6 Pseudo relevance feedback
9.1.7 Indirect relevance feedback
9.1.8 Summary
9.2 Global methods for query reformulation
9.2.1 Vocabulary tools for query reformulation
9.2.2 Query expansion
9.2.3 Automatic thesaurus generation
9.3 References and further reading
10 XML retrieval
10.1 Basic XML concepts
10.2 Challenges in XML retrieval
10.3 A vector space model for XML retrieval
10.4 Evaluation of XML retrieval
10.5 Text-centric versus data-centric XML retrieval
10.6 References and further reading
11 Probabilistic information retrieval
11.1 Review of basic probability theory
11.2 The probability ranking principle
11.2.1 The 1/0 loss case
11.2.2 The probability ranking principle with retrieval costs
11.3 The binary independence model
11.3.1 Deriving a ranking function for query terms
11.3.2 Probability estimates in theory
11.3.3 Probability estimates in practice
11.3.4 Probabilistic approaches to relevance feedback
11.4 An appraisal and some extensions
11.4.1 An appraisal of probabilistic models
11.4.2 Tree-structured dependencies between terms
11.4.3 Okapi BM25: A nonbinary model
11.4.4 Bayesian network approaches to information retrieval
11.5 References and further reading
12 Language models for information retrieval
12.1 Language models
12.1.1 Finite automata and language models
12.1.2 Types of language models
12.1.3 Multinomial distributions over words
12.2 The query likelihood model
12.2.1 Using query likelihood language models in IR
12.2.2 Estimating the query generation probability
12.2.3 Ponte and Croft's experiments
12.3 Language modeling versus other approachesin information retrieval
12.4 Extended language modeling approaches
12.5 References and further reading
13 Text classification and Naive Bayes
13.1 The text classification problem
13.2 Naive Bayes text classification
13.2.1 Relation to multinomial unigram language model
13.3 The Bernoulli model
13.4 Properties of Naive Bayes
13.4.1 A variant of the multinomial model
13.5 Feature selection
13.5.1 Mutual information
13.5.2 Chi2 Feature selection
13.5.3 Frequency-based feature selection
13.5.4 Feature selection for multiple classifiers
13.5.5 Comparison of feature selection methods
13.6 Evaluation of text classification
13.7 References and further reading
14 Vector space classification
14.1 Document representations and measures of relatedness in vector spaces
14.2 Rocchio classification
14.3 k nearest neighbor
14.3.1 Time complexity and optimality of k nearest neighbor
14.4 Linear versus nonlinear classifiers
14.5 Classification with more than two classes
14.6 The bias--variance tradeoff
14.7 References and further reading
15 Support vector machines and machine learning on documents
15.1 Support vector machines: The linearly separable case
15.2 Extensions to the support vector machine model
15.2.1 Soft margin classification
15.2.2 Multiclass support vector machines
15.2.3 Nonlinear support vector machines
15.2.4 Experimental results
15.3 Issues in the classification of text documents
15.3.1 Choosing what kind of classifier to use
15.3.2 Improving classifier performance
15.4 Machine-learning methods in ad hoc information retrieval
15.4.1 A simple example of machine-learned scoring
15.4.2 Result ranking by machine learning
15.5 References and further reading
16 Flat clustering
16.1 Clustering in information retrieval
16.2 Problem statement
16.2.1 Cardinality -- The number of clusters
16.3 Evaluation of clustering
16.4 K-means
16.4.1 Cluster cardinality in K-means
16.5 Model-based clustering
16.6 References and further reading
17 Hierarchical clustering
17.1 Hierarchical agglomerative clustering
17.2 Single-link and complete-link clustering
17.2.1 Time complexity
17.3 Group-average agglomerative clustering
17.4 Centroid clustering
17.5 Optimality of hierarchical agglomerative clustering
17.6 Divisive clustering
17.7 Cluster labeling
17.8 Implementation notes
17.9 References and further reading
18 Matrix decompositions and latent semantic indexing
18.1 Linear algebra review
18.1.1 Matrix decompositions
18.2 Term--document matrices and singular value decompositions
18.3 Low-rank approximations
18.4 Latent semantic indexing
18.5 References and further reading
19 Web search basics
19.1 Background and history
19.2 Web characteristics
19.2.1 The web graph
19.2.2 Spam
19.3 Advertising as the economic model
19.4 The search user experience
19.4.1 User query needs
19.5 Index size and estimation
19.6 Near-duplicates and shingling
19.7 References and further reading
20 Web crawling and indexes
20.1 Overview
20.1.1 Features a crawler must provide
20.1.2 Features a crawler should provide
20.2 Crawling
20.2.1 Crawler architecture
20.2.2 DNS resolution
20.2.3 The URL frontier
20.3 Distributing indexes
20.4 Connectivity servers
20.5 References and further reading
21 Link analysis
21.1 The Web as a graph
21.1.1 Anchor text and the web graph
21.2 PageRank
21.2.1 Markov chains
21.2.2 The PageRank computation
21.2.3 Topic-specific PageRank
21.3 Hubs and authorities
21.3.1 Choosing the subset of the Web
21.4 References and further reading
Bibliography
Index
This page intentionally left blank
P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 June 26, 2008 21:26 Introduction to Information Retrieval Introduction to Information Retrieval is the first textbook with a coherent treat- ment of classical and web information retrieval, including web search and the related areas of text classification and text clustering. Written from a computer science perspective, it gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents and of methods for evaluating systems, along with an introduction to the use of machine learning methods on text collections. Designed as the primary text for a graduate or advanced undergraduate course in information retrieval, the book will also interest researchers and professionals. A complete set of lecture slides and exercises that accompany the book are available on the web. Christopher D. Manning is Associate Professor of Computer Science and Lin- guistics at Stanford University. Prabhakar Raghavan is Head of Yahoo! Research and a Consulting Professor of Computer Science at Stanford University. Hinrich Sch ¨utze is Chair of Theoretical Computational Linguistics at the In- stitute for Natural Language Processing, University of Stuttgart. i
P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 June 26, 2008 21:26 ii
P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 June 26, 2008 21:26 Introduction to Information Retrieval Christopher D. Manning Stanford University Prabhakar Raghavan Yahoo! Research Hinrich Sch ¨utze University of Stuttgart iii
CAMBRIDGE UNIVERSITY PRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521865715 © Cambridge University Press 2008 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published in print format 2008 ISBN-13 978-0-511-41405-3 eBook (EBL) ISBN-13 978-0-521-86571-5 hardback Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.
P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 June 26, 2008 21:26 Contents Table of Notation Preface 1 Boolean retrieval 1.1 An example information retrieval problem 1.2 A first take at building an inverted index 1.3 Processing Boolean queries 1.4 The extended Boolean model versus ranked retrieval 1.5 References and further reading 2 The term vocabulary and postings lists 2.1 Document delineation and character sequence decoding 2.2 Determining the vocabulary of terms 2.3 Faster postings list intersection via skip pointers 2.4 Positional postings and phrase queries 2.5 References and further reading 3 Dictionaries and tolerant retrieval 3.1 Search structures for dictionaries 3.2 Wildcard queries 3.3 Spelling correction 3.4 Phonetic correction 3.5 References and further reading 4 Index construction 4.1 Hardware basics 4.2 Blocked sort-based indexing 4.3 Single-pass in-memory indexing 4.4 Distributed indexing 4.5 Dynamic indexing page xi xv 1 3 6 9 13 16 18 18 21 33 36 43 45 45 48 52 58 59 61 62 63 66 68 71 v
P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 June 26, 2008 21:26 vi Contents 4.6 Other types of indexes 4.7 References and further reading 5 Index compression 5.1 Statistical properties of terms in information retrieval 5.2 Dictionary compression 5.3 Postings file compression 5.4 References and further reading 6 Scoring, term weighting, and the vector space model 6.1 Parametric and zone indexes 6.2 Term frequency and weighting 6.3 The vector space model for scoring 6.4 Variant tf–idf functions 6.5 References and further reading 7 Computing scores in a complete search system 7.1 Efficient scoring and ranking 7.2 Components of an information retrieval system 7.3 Vector space scoring and query operator interaction 7.4 References and further reading 8 Evaluation in information retrieval 8.1 Information retrieval system evaluation 8.2 Standard test collections 8.3 Evaluation of unranked retrieval sets 8.4 Evaluation of ranked retrieval results 8.5 Assessing relevance 8.6 A broader perspective: System quality and user utility 8.7 Results snippets 8.8 References and further reading 9 Relevance feedback and query expansion 9.1 Relevance feedback and pseudo relevance feedback 9.2 Global methods for query reformulation 9.3 References and further reading 10 XML retrieval 10.1 Basic XML concepts 10.2 Challenges in XML retrieval 10.3 A vector space model for XML retrieval 10.4 Evaluation of XML retrieval 73 76 78 79 82 87 97 100 101 107 110 116 122 124 124 132 136 137 139 140 141 142 145 151 154 157 159 162 163 173 177 178 180 183 188 192
分享到:
收藏