logo资料库

Relevant Search(Manning,2016).pdf

第1页 / 共362页
第2页 / 共362页
第3页 / 共362页
第4页 / 共362页
第5页 / 共362页
第6页 / 共362页
第7页 / 共362页
第8页 / 共362页
资料共362页,剩余部分请下载后查看
Front cover
brief contents
contents
foreword
preface
acknowledgments
about this book
Who should read this book
How this book is organized
About the code
Author Online
Other online resources
about the authors
about the cover illustration
1 The search relevance problem
1.1 Your goal: gaining the skills of a relevance engineer
1.2 Why is search relevance so hard?
1.2.1 What’s a “relevant” search result?
1.2.2 Search: there’s no silver bullet!
1.3 Gaining insight from relevance research
1.3.1 Information retrieval
1.3.2 Can we use information retrieval to solve relevance?
1.4 How do you solve relevance?
1.5 More than technology: curation, collaboration, and feedback
1.6 Summary
2 Search—under the hood
2.1 Search 101
2.1.1 What’s a search document?
2.1.2 Searching the content
2.1.3 Exploring content through search
2.1.4 Getting content into the search engine
2.2 Search engine data structures
2.2.1 The inverted index
2.2.2 Other pieces of the inverted index
2.3 Indexing content: extraction, enrichment, analysis, and indexing
2.3.1 Extracting content into documents
2.3.2 Enriching documents to clean, augment, and merge data
2.3.3 Performing analysis
2.3.4 Indexing
2.4 Document search and retrieval
2.4.1 Boolean search: AND/OR/NOT
2.4.2 Boolean queries in Lucene-based search (MUST/MUST_NOT/SHOULD)
2.4.3 Positional and phrase matching
2.4.4 Enabling exploration: filtering, facets, and aggregations
2.4.5 Sorting, ranked results, and relevance
2.5 Summary
3 Debugging your first relevance problem
3.1 Applications to Solr and Elasticsearch: examples in Elasticsearch
3.2 Our most prominent data set: TMDB
3.3 Examples programmed in Python
3.4 Your first search application
3.4.1 Your first searches of the TMDB Elasticsearch index
3.5 Debugging query matching
3.5.1 Examining the underlying query strategy
3.5.2 Taking apart query parsing
3.5.3 Debugging analysis to solve matching issues
3.5.4 Comparing your query to the inverted index
3.5.5 Fixing our matching by changing analyzers
3.6 Debugging ranking
3.6.1 Decomposing the relevance score with Lucene’s explain feature
3.6.2 The vector-space model, the relevance explain, and you
3.6.3 Practical caveats to the vector space model
3.6.4 Scoring matches to measure relevance
3.6.5 Computing weights with TF × IDF
3.6.6 Lies, damned lies, and similarity
3.6.7 Factoring in the search term’s importance
3.6.8 Fixing Space Jam vs. alien ranking
3.7 Solved? Our work is never over!
3.8 Summary
4 Taming tokens
4.1 Tokens as document features
4.1.1 The matching process
4.1.2 Tokens, more than just words
4.2 Controlling precision and recall
4.2.1 Precision and recall by example
4.2.2 Analysis for precision or recall
4.2.3 Taking recall to extremes
4.3 Precision and recall—have your cake and eat it too
4.3.1 Scoring strength of a feature in a single field
4.3.2 Scoring beyond TF × IDF: multiple search terms and multiple fields
4.4 Analysis strategies
4.4.1 Dealing with delimiters
4.4.2 Capturing meaning with synonyms
4.4.3 Modeling specificity in search
4.4.4 Modeling specificity with synonyms
4.4.5 Modeling specificity with paths
4.4.6 Tokenize the world!
4.4.7 Tokenizing integers
4.4.8 Tokenizing geographic data
4.4.9 Tokenizing melodies
4.5 Summary
5 Basic multifield search
5.1 Signals and signal modeling
5.1.1 What is a signal?
5.1.2 Starting with the source data model
5.1.3 Implementing a signal
5.1.4 Signal modeling: data modeling for relevance
5.2 TMDB—search, the final frontier!
5.2.1 Violating the prime directive
5.2.2 Flattening nested docs
5.3 Signal modeling in field-centric search
5.3.1 Starting out with best_fields
5.3.2 Controlling field preference in search results
5.3.3 Better best_fields with more-precise signals?
5.3.4 Letting losers share the glory: calibrating best_fields
5.3.5 Counting multiple signals using most_fields
5.3.6 Boosting in most_fields
5.3.7 When additional matches don’t matter
5.3.8 What’s the verdict on most_fields?
5.4 Summary
6 Term-centric search
6.1 What is term-centric search?
6.2 Why do you need term-centric search?
6.2.1 Hunting for albino elephants
6.2.2 Finding an albino elephant in the Star Trek example
6.2.3 Avoiding signal discordance
6.2.4 Understanding the mechanics of signal discordance
6.3 Performing your first term-centric searches
6.3.1 Working with the term-centric ranking function
6.3.2 Running a term-centric query parser (into the ground)
6.3.3 Understanding field synchronicity
6.3.4 Field synchronicity and signal modeling
6.3.5 Query parsers and signal discordance
6.3.6 Tuning term-centric search
6.4 Solving signal discordance in term-centric search
6.4.1 Combining fields into custom all fields
6.4.2 Solving signal discordance with cross_fields
6.5 Combining field-centric and term-centric strategies: having your cake and eating it too
6.5.1 Grouping “like fields” together
6.5.2 Understanding the limits of like fields
6.5.3 Combining greedy naïve search and conservative amplifiers
6.5.4 Term-centric vs. field-centric, and precision vs. recall
6.5.5 Considering filtering, boosting, and reranking
6.6 Summary
7 Shaping the relevance function
7.1 What do we mean by score shaping?
7.2 Boosting: shaping by promoting results
7.2.1 Boosting: the final frontier
7.2.2 When boosting—add or multiply? Boolean or function query?
7.2.3 You choose door A: additive boosting with Boolean queries
7.2.4 You choose door B: function queries using math for ranking
7.2.5 Hands-on with function queries: simple multiplicative boosting
7.2.6 Boosting basics: signals, signals everywhere
7.3 Filtering: shaping by excluding results
7.4 Score-shaping strategies for satisfying business needs
7.4.1 Search all the movies!
7.4.2 Modeling your boosting signals
7.4.3 Building the ranking function: adding high-value tiers
7.4.4 High-value tier scored with a function query
7.4.5 Ignoring TF × IDF
7.4.6 Capturing general-quality metrics
7.4.7 Achieving users’ recency goals
7.4.8 Combining the function queries
7.4.9 Putting it all together!
7.5 Summary
8 Providing relevance feedback
8.1 Relevance feedback at the search box
8.1.1 Providing immediate results with search-as-you-type
8.1.2 Helping users find the best query with search completion
8.1.3 Correcting typos and misspellings with search suggestions
8.2 Relevance feedback while browsing
8.2.1 Building faceted browsing
8.2.2 Providing breadcrumb navigation
8.2.3 Selecting alternative results ordering
8.3 Relevance feedback in the search results listing
8.3.1 What information should be presented in listing items?
8.3.2 Relevance feedback through snippets and highlighting
8.3.3 Grouping similar documents
8.3.4 Helping the user when there are no results
8.4 Summary
9 Designing a relevance-focused search application
9.1 Yowl! The awesome new start-up!
9.2 Gathering information and requirements
9.2.1 Understand users and their information needs
9.2.2 Understand business needs
9.2.3 Identify required and available information
9.3 Designing the search application
9.3.1 Visualize the user’s experience
9.3.2 Define fields and model signals
9.3.3 Combine and balance signals
9.4 Deploying, monitoring, and improving
9.4.1 Monitor
9.4.2 Identify problems and fix them!
9.5 Knowing when good is good enough
9.6 Summary
10 The relevance-centered enterprise
10.1 Feedback: the bedrock of the relevance-centered enterprise
10.2 Why user-focused culture before data-driven culture?
10.3 Flying relevance-blind
10.4 Relevance feedback awakenings: domain experts and expert users
10.5 Relevance feedback maturing: content curation
10.5.1 The role of the content curator
10.5.2 The risk of miscommunication with the content curator
10.6 Relevance streamlined: engineer/curator pairing
10.7 Relevance accelerated: test-driven relevance
10.7.1 Understanding test-driven relevance
10.7.2 Using test-driven relevance with user behavioral data
10.8 Beyond test-driven relevance: learning to rank
10.9 Summary
11 Semantic and personalized search
11.1 Personalizing search based on user profiles
11.1.1 Gathering user profile information
11.1.2 Tying profile information back to the search index
11.2 Personalizing search based on user behavior
11.2.1 Introducing collaborative filtering
11.2.2 Basic collaborative filtering using co-occurrence counting
11.2.3 Tying user behavior information back to the search index
11.3 Basic methods for building concept search
11.3.1 Building concept signals
11.3.2 Augmenting content with synonyms
11.4 Building concept search using machine learning
11.4.1 The importance of phrases in concept search
11.5 The personalized search—concept search connection
11.6 Recommendation as a generalization of search
11.6.1 Replacing search with recommendation
11.7 Best wishes on your search relevance journey
11.8 Summary
Appendix A—Indexing directly from TMDB
A.1 Setting the TMDB key and loading the IPython notebook
A.2 Setting up for the TMDB API
A.3 Crawling the TMDB API
A.4 Indexing TMDB movies to Elasticsearch
Appendix B—Solr reader’s companion
B.1 Chapter 4: taming Solr’s terms
B.1.1 Summary of Solr analysis and mapping features
B.1.2 Building custom analyzers in Solr
B.1.3 Using field mappings in Solr
B.2 Chapters 5 and 6: multifield search in Solr
B.2.1 Summary of query feature mappings
B.2.2 Understanding query differences between Solr and Elasticsearch
B.2.3 Querying Solr: the ergonomics
B.2.4 Term-centric and field-centric search with the edismax query parser
B.2.5 All fields and cross_fields search
B.3 Chapter 7: shaping Solr’s ranking function
B.3.1 Summary of boosting feature mappings
B.3.2 Solr’s Boolean boosting
B.3.3 Solr’s function queries
B.3.4 Multiplicative boosting in Solr
B.4 Chapter 8: relevance feedback
B.4.1 Summary of relevance feedback feature mappings
B.4.2 Solr autocomplete: match phrase prefix
B.4.3 Faceted browsing in Solr
B.4.4 Field collapsing
B.4.5 Suggestion and highlighting components
index
Symbols
Numerics
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Back cover
With applications for Solr and Elasticsearch Doug Turnbull John Berryman FOREWORD BY Trey Grainger M A N N I N G
Relevant Search
Relevant Search With applications for Solr and Elasticsearch DOUG TURNBULL JOHN BERRYMAN M A N N I N G SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2016 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Development editor: Marina Michaels Technical development editor: Aaron Colcord Copy editor: Sharon Wilkey Proofreader: Elizabeth Martin Technical proofreader: Valentin Crettaz Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781617292774 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 21 20 19 18 17 16
brief contents 1 ■ The search relevance problem 1 2 ■ Search—under the hood 16 3 ■ Debugging your first relevance problem 40 4 ■ Taming tokens 74 5 ■ Basic multifield search 107 6 ■ Term-centric search 137 7 ■ Shaping the relevance function 170 8 ■ Providing relevance feedback 204 9 ■ Designing a relevance-focused search application 232 10 ■ The relevance-centered enterprise 257 11 ■ Semantic and personalized search 279 v
contents xiii xv xix foreword preface acknowledgments about this book about the authors about the cover illustration xxiv xxiii xvii 1 The search relevance problem 1 1.1 1.2 Why is search relevance so hard? 3 Your goal: gaining the skills of a relevance engineer 2 What’s a “relevant” search result? 4 ■ Search: there’s no silver bullet! 6 1.3 Gaining insight from relevance research 6 Information retrieval 7 ■ Can we use information retrieval to solve relevance? 8 1.4 How do you solve relevance? 10 1.5 More than technology: curation, collaboration, and feedback 12 1.6 Summary 14 vii
分享到:
收藏