logo资料库

Solr in Action最新完整版.pdf

第1页 / 共666页
第2页 / 共666页
第3页 / 共666页
第4页 / 共666页
第5页 / 共666页
第6页 / 共666页
第7页 / 共666页
第8页 / 共666页
资料共666页,剩余部分请下载后查看
Front cover
brief contents
contents
foreword
preface
acknowledgments
about this book
Roadmap
How to use this book
Code conventions and downloads
Author Online
About the cover illustration
Part 1—Meet Solr
1 Introduction to Solr
1.1 Why do I need a search engine?
1.1.1 Managing text-centric data
1.1.2 Common search-engine use cases
1.2 What is Solr?
1.2.1 Information retrieval engine
1.2.2 Flexible schema management
1.2.3 Java web application
1.2.4 Multiple indexes in one server
1.2.5 Extendable (plugins)
1.2.6 Scalable
1.2.7 Fault-tolerant
1.3 Why Solr?
1.3.1 Solr for the software architect
1.3.2 Solr for the system administrator
1.3.3 Solr for the CEO
1.4 Features overview
1.4.1 User-experience features
1.4.2 Data-modeling features
1.4.3 New features in Solr 4
1.5 Summary
2 Getting to know Solr
2.1 Getting started
2.1.1 Installing Solr
2.1.2 Starting the Solr example server
2.1.3 Understanding Solr home
2.1.4 Indexing the example documents
2.2 Searching is what it’s all about
2.2.1 Exploring Solr’s query form
2.2.2 What comes back from Solr when you search
2.2.3 Ranked retrieval
2.2.4 Paging and sorting
2.2.5 Expanded search features
2.3 Tour of the Solr administration console
2.4 Adapting the example to your needs
2.5 Summary
3 Key Solr concepts
3.1 Searching, matching, and finding content
3.1.1 What is a document?
3.1.2 The fundamental search problem
3.1.3 The inverted index
3.1.4 Terms, phrases, and Boolean logic
3.1.5 Finding sets of documents
3.1.6 Phrase queries and term positions
3.1.7 Fuzzy matching
3.1.8 Quick recap
3.2 Relevancy
3.2.1 Default similarity
3.2.2 Term frequency
3.2.3 Inverse document frequency
3.2.4 Boosting
3.2.5 Normalization factors
3.3 Precision and Recall
3.3.1 Precision
3.3.2 Recall
3.3.3 Striking the right balance
3.4 Searching at scale
3.4.1 The denormalized document
3.4.2 Distributed searching
3.4.3 Clusters vs. servers
3.4.4 The limits of Solr
3.5 Summary
4 Configuring Solr
4.1 Overview of solrconfig.xml
4.1.1 Common XML data-structure and type elements
4.1.2 Applying configuration changes
4.1.3 Miscellaneous settings
4.2 Query request handling
4.2.1 Request-handling overview
4.2.2 Search handler
4.2.3 Browse request handler for Solritas: an example
4.2.4 Extending query processing with search components
4.3 Managing searchers
4.3.1 New searcher overview
4.3.2 Warming a new searcher
4.4 Cache management
4.4.1 Cache fundamentals
4.4.2 Filter cache
4.4.3 Query result cache
4.4.4 Document cache
4.4.5 Field value cache
4.5 Remaining configuration options
4.6 Summary
5 Indexing
5.1 Example microblog search application
5.1.1 Representing content for searching
5.1.2 Overview of the Solr indexing process
5.2 Designing your schema
5.2.1 Document granularity
5.2.2 Unique key
5.2.3 Indexed fields
5.2.4 Stored fields
5.2.5 Preview of schema.xml
5.3 Defining fields in schema.xml
5.3.1 Required field attributes
5.3.2 Multivalued fields
5.3.3 Dynamic fields
5.3.4 Copy fields
5.3.5 Unique key field
5.4 Field types for structured nontext fields
5.4.1 String fields
5.4.2 Date fields
5.4.3 Numeric fields
5.4.4 Advanced field type attributes
5.5 Sending documents to Solr for indexing
5.5.1 Indexing documents using XML or JSON
5.5.2 Using the SolrJ client library to add documents from Java
5.5.3 Other tools for importing documents into Solr
5.6 Update handler
5.6.1 Committing documents to the index
5.6.2 Transaction log
5.6.3 Atomic updates
5.7 Index management
5.7.1 Index storage
5.7.2 Segment merging
5.8 Summary
6 Text analysis
6.1 Analyzing microblog text
6.2 Basic text analysis
6.2.1 Analyzer
6.2.2 Tokenizer
6.2.3 Token filter
6.2.4 StandardTokenizer
6.2.5 Removing stop words with StopFilterFactory
6.2.6 LowerCaseFilterFactory—lowercase letters in terms
6.2.7 Testing your analysis with Solr’s analysis form
6.3 Defining a custom field type for microblog text
6.3.1 Collapsing repeated letters with PatternReplaceCharFilterFactory
6.3.2 Preserving hashtags, mentions, and hyphenated terms
6.3.3 Removing diacritical marks using ASCIIFoldingFilterFactory
6.3.4 Stemming with KStemFilterFactory
6.3.5 Injecting synonyms at query time with SynonymFilterFactory
6.3.6 Putting it all together
6.4 Advanced text analysis
6.4.1 Advanced field attributes
6.4.2 Per-language text analysis
6.4.3 Extending text analysis using a Solr plugin
6.5 Summary
Part 2—Core Solr capabilities
7 Performing queries and handling results
7.1 The anatomy of a Solr request
7.1.1 Request handlers
7.1.2 Search components
7.1.3 Query parsers
7.2 Working with query parsers
7.2.1 Specifying a query parser
7.2.2 Local params
7.3 Queries and filters
7.3.1 The fq and q parameters
7.3.2 Handling expensive filters
7.4 The default query parser (Lucene query parser)
7.4.1 Lucene query parser syntax
7.5 Handling user queries (eDisMax query parser)
7.5.1 eDisMax query parser overview
7.5.2 eDisMax query parameters
7.5.3 Searching across multiple fields
7.5.4 Boosting queries and phrases
7.5.5 Field aliasing
7.5.6 User-accessible fields
7.5.7 Minimum match
7.5.8 eDisMax benefits and drawbacks
7.6 Other useful query parsers
7.6.1 Field query parser
7.6.2 Term and Raw query parsers
7.6.3 Function and Function Range query parsers
7.6.4 Nested queries and the Nested query parser
7.6.5 Boost query parser
7.6.6 Prefix query parser
7.6.7 Spatial query parsers
7.6.8 Join query parser
7.6.9 Switch query parser
7.6.10 Surround query parser
7.6.11 Max Score query parser
7.6.12 Collapsing query parser
7.7 Returning results
7.7.1 Choosing a response format
7.7.2 Choosing fields to return
7.7.3 Paging through results
7.8 Sorting results
7.8.1 Sorting by fields
7.8.2 Sorting by functions
7.8.3 Fuzzy sorting
7.9 Debugging query results
7.9.1 Returning debug information
7.10 Summary
8 Faceted search
8.1 Navigating your content at a glance
8.2 Setting up test data
8.3 Field faceting
8.4 Query faceting
8.5 Range faceting
8.6 Filtering upon faceted values
8.6.1 Applying filters to your facets
8.6.2 Safely filtering on faceted values
8.7 Multiselect faceting, keys, and tags
8.7.1 Keys
8.7.2 Tags, excludes, and multiselect faceting
8.8 Beyond the basics
8.9 Summary
9 Hit highlighting
9.1 Overview of hit highlighting
9.2 How highlighting works
9.2.1 Set up a new Solr core for UFO sightings
9.2.2 Preprocess UFO sightings before indexing
9.2.3 Exploring the UFO sightings dataset
9.2.4 Hit highlighting out of the box
9.2.5 Nuts and bolts
9.2.6 Refining highlighter results
9.3 Improving performance using FastVectorHighlighter
9.4 PostingsHighlighter
9.5 Summary
10 Query suggestions
10.1 Spell-check
10.1.1 Indexing Wikipedia articles
10.1.2 Spell-check example
10.1.3 Spell-check search component
10.2 Autosuggesting query terms
10.2.1 Autosuggest request handler
10.2.2 Autosuggest search component
10.3 Suggesting document field values
10.3.1 Using n-grams for suggestions
10.3.2 N-gram-driven request handler
10.4 Suggesting queries based on user activity
10.5 Summary
11 Result grouping/ field collapsing
11.1 Result grouping vs. field collapsing
11.2 Skipping duplicate documents
11.3 Returning multiple documents per group
11.4 Grouping by functions and queries
11.4.1 Grouping by function
11.4.2 Grouping by query
11.5 Paging and sorting grouped results
11.6 Grouping gotchas
11.6.1 Faceting upon result groups
11.6.2 Distributed result grouping
11.6.3 Returning a flat list
11.6.4 Grouping on multivalued and tokenized fields
11.6.5 Grouping performance
11.7 Efficient field collapsing with the Collapsing query parser
11.8 Summary
12 Taking Solr to production
12.1 Developing a Solr distribution
12.2 Deploying Solr
12.2.1 Building your Solr distribution
12.2.2 Embedded Solr
12.3 Hardware and server configuration
12.3.1 RAM and SSDs
12.3.2 JVM settings
12.3.3 The index shuffle
12.3.4 Useful system tricks
12.4 Data acquisition strategies
12.5 Sharding and replication
12.5.1 Choosing to shard
12.5.2 Choosing to replicate
12.6 Solr core management
12.7 Managing clusters of servers
12.7.1 Load balancers and Solr health check
12.7.2 Generic vs. customized configuration
12.8 Querying and interacting with Solr
12.8.1 REST API
12.8.2 Available Solr client libraries
12.8.3 Using SolrJ from Java
12.9 Monitoring Solr’s performance
12.9.1 Solr’s Plugins / Stats page
12.9.2 Solr cache performance
12.9.3 Pulling stats from request handlers and MBeans
12.9.4 External monitoring options
12.9.5 Solr logs
12.9.6 Load testing
12.10 Upgrading between Solr versions
12.11 Summary
Part 3—Taking Solr to the next level
13 SolrCloud
13.1 Getting started with SolrCloud
13.1.1 Starting Solr in cloud mode
13.1.2 Motivation behind the SolrCloud architecture
13.2 Core concepts
13.2.1 Collections vs. cores
13.2.2 ZooKeeper
13.2.3 Choosing the number of shards and replicas
13.2.4 Cluster-state management
13.2.5 Shard-leader election
13.2.6 Important SolrCloud configuration settings
13.3 Distributed indexing
13.3.1 Document shard assignment
13.3.2 Adding documents
13.3.3 Near real-time search
13.3.4 Node recovery process
13.4 Distributed search
13.4.1 Multistage query process
13.4.2 Distributed search limitations
13.5 Collections API
13.5.1 Create a collection
13.5.2 Collection aliasing
13.6 Basic system-administration tasks
13.6.1 Configuration updates
13.6.2 Rolling restart
13.6.3 Restarting a failed node
13.6.4 Is node X active?
13.6.5 Adding a replica
13.6.6 Offsite backup
13.7 Advanced topics
13.7.1 Custom hashing
13.7.2 Shard splitting
13.8 Summary
14 Multilingual search
14.1 Why linguistic analysis matters
14.2 Stemming vs. lemmatization
14.3 Stemming in action
14.4 Handling edge cases
14.4.1 KeywordMarkerFilterFactory
14.4.2 StemmerOverrideFilterFactory
14.5 Available language libraries in Solr
14.5.1 Language-specific analyzer chains
14.5.2 Dictionary-based stemming (Hunspell)
14.6 Searching content in multiple languages
14.6.1 Separate field per language
14.6.2 Separate index per language
14.6.3 Multiple languages in one field
14.6.4 Creating a field type to handle multiple languages per field
14.7 Language identification
14.7.1 Update processors for language identification
14.7.2 Dynamically assigning detected language analyzers within a field
14.8 Summary
15 Complex query operations
15.1 Function queries
15.1.1 Function syntax
15.1.2 Searching on functions
15.1.3 Returning functions like fields
15.1.4 Sorting on functions
15.1.5 Available functions in Solr
15.1.6 Implementing a custom function
15.2 Geospatial search
15.2.1 Searching near a single point
15.2.2 Advanced geospatial search
15.3 Pivot faceting
15.4 Referencing external data
15.5 Cross-document and cross-index joins
15.6 Big data analytics with Solr
15.7 Summary
16 Mastering relevancy
16.1 The impact of relevancy tuning
16.2 Debugging the relevancy calculation
16.3 Relevancy boosting
16.3.1 Per-field boosting
16.3.2 Per-term boosting
16.3.3 Payload boosting
16.3.4 Function boosting
16.3.5 Term-proximity boosting
16.3.6 Elevating the relevancy of important documents
16.4 Pluggable Similarity class implementations
16.5 Personalized search and recommendations
16.5.1 Search vs. recommendations
16.5.2 Attribute-based matching
16.5.3 Hierarchical matching
16.5.4 More Like This
16.5.5 Concept-based matching
16.5.6 Geographical matching
16.5.7 Collaborative filtering
16.5.8 Hybrid approaches
16.6 Creating a personalized search experience
16.7 Running relevancy experiments
16.8 Summary
appendix A—Working with the Solr codebase
A.1 Pulling the right version of Solr
A.2 Setting up Solr in your IDE
A.3 Debugging Solr code
A.4 Downloading and applying Solr patches
A.5 Contributing patches
appendix B—Language-specific field type configurations
appendix C—Useful data import configurations
C.1 Indexing Wikipedia
C.2 Indexing Stack Exchange
index
Symbols
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Z
Back cover
Trey Grainger Timothy Potter FOREWORD BY Yonik Seeley M A N N I N G
Solr in Action TREY GRAINGER TIMOTHY POTTER M A N N I N G SHELTER ISLAND Download from BookDL (http://bookdl.com)
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: orders@manning.com ©2014 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Photographs in this book were created by Martin Evans and Jordan Hochenbaum, unless otherwise noted. Illustrations were created by Martin Evans, Joshua Noble, and Jordan Hochenbaum. Fritzing (fritzing.org) was used to create some of the circuit diagrams. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Development editors: Elizabeth Lexleigh, Susan Conant Copyeditor: Melinda Rankin Proofreader: Elizabeth Martin Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781617291029 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 19 18 17 16 15 14 Download from BookDL (http://bookdl.com)
PART 1 MEET SOLR. .................................................................1 brief contents Introduction to Solr 3 1 ■ 2 ■ Getting to know Solr 26 3 ■ Key Solr concepts 48 4 ■ Configuring Solr 82 5 ■ 6 ■ Text analysis 162 Indexing 116 PART 2 CORE SOLR CAPABILITIES ..........................................195 7 ■ Performing queries and handling results 197 8 ■ Faceted search 250 9 ■ Hit highlighting 281 10 ■ Query suggestions 306 11 ■ Result grouping/field collapsing 330 12 ■ Taking Solr to production 356 iii Download from BookDL (http://bookdl.com)
iv BRIEF CONTENTS PART 3 TAKING SOLR TO THE NEXT LEVEL.............................403 13 ■ SolrCloud 405 14 ■ Multilingual search 450 15 ■ Complex query operations 501 16 ■ Mastering relevancy 548 Download from BookDL (http://bookdl.com)
contents xv xvii foreword preface acknowledgments about this book xix xxi PART 1 MEET SOLR . .....................................................1 1 Introduction to Solr 3 1.1 Why do I need a search engine? 4 Managing text-centric data 4 Common search-engine use cases 7 1.2 What is Solr? 9 Information retrieval engine 11 ■ Flexible schema management 13 ■ Java web application 13 Multiple indexes in one server 15 ■ Extendable (plugins) 15 Scalable 15 ■ Fault-tolerant 16 1.3 Why Solr? 17 Solr for the software architect 17 ■ Solr for the system administrator 18 ■ Solr for the CEO 19 v Download from BookDL (http://bookdl.com)
vi CONTENTS 1.4 Features overview 19 User-experience features 19 ■ Data-modeling features 21 New features in Solr 4 23 1.5 Summary 24 2 Getting to know Solr 26 2.1 Getting started 27 Installing Solr 27 ■ Starting the Solr example server 28 Understanding Solr home 32 ■ Indexing the example documents 33 2.2 Searching is what it’s all about 34 Exploring Solr’s query form 34 ■ What comes back from Solr when you search 38 ■ Ranked retrieval 39 ■ Paging and sorting 40 Expanded search features 41 2.3 Tour of the Solr administration console 43 2.4 Adapting the example to your needs 45 2.5 Summary 46 3 Key Solr concepts 48 3.1 Searching, matching, and finding content 49 What is a document? 49 ■ The fundamental search problem 50 The inverted index 53 ■ Terms, phrases, and Boolean logic 54 Finding sets of documents 56 ■ Phrase queries and term positions 59 ■ Fuzzy matching 60 ■ Quick recap 65 3.2 Relevancy 65 Default similarity 65 ■ Term frequency 67 Inverse document frequency 68 ■ Boosting 69 Normalization factors 69 3.3 Precision and Recall 71 Precision 72 ■ Recall 73 ■ Striking the right balance 73 3.4 Searching at scale 74 The denormalized document 75 ■ Distributed searching 77 Clusters vs. servers 78 ■ The limits of Solr 79 3.5 Summary 80 4 Configuring Solr 82 4.1 Overview of solrconfig.xml 85 Common XML data-structure and type elements 87 Applying configuration changes 87 ■ Miscellaneous settings 88 Download from BookDL (http://bookdl.com)
CONTENTS vii 4.2 Query request handling 90 Request-handling overview 90 ■ Search handler 93 Browse request handler for Solritas: an example 94 Extending query processing with search components 98 4.3 Managing searchers 103 New searcher overview 103 ■ Warming a new searcher 104 4.4 Cache management 107 Cache fundamentals 107 ■ Filter cache 109 Query result cache 112 ■ Document cache 113 Field value cache 113 4.5 Remaining configuration options 114 4.6 Summary 114 5 Indexing 116 5.1 Example microblog search application 117 Representing content for searching 117 Overview of the Solr indexing process 119 5.2 Designing your schema 121 Document granularity 121 ■ Unique key 122 Indexed fields 123 ■ Stored fields 123 Preview of schema.xml 124 5.3 Defining fields in schema.xml 125 Required field attributes 126 ■ Multivalued fields 127 Dynamic fields 128 ■ Copy fields 131 ■ Unique key field 133 5.4 Field types for structured nontext fields 133 String fields 134 ■ Date fields 135 ■ Numeric fields 137 Advanced field type attributes 138 5.5 Sending documents to Solr for indexing 141 Indexing documents using XML or JSON 141 ■ Using the SolrJ client library to add documents from Java 144 ■ Other tools for importing documents into Solr 146 5.6 Update handler 147 Committing documents to the index 148 ■ Transaction log 151 Atomic updates 152 5.7 Index management 155 Index storage 155 ■ Segment merging 158 5.8 Summary 160 Download from BookDL (http://bookdl.com)
分享到:
收藏