logo资料库

Taming Text(驯服文本 英文版).pdf

第1页 / 共322页
第2页 / 共322页
第3页 / 共322页
第4页 / 共322页
第5页 / 共322页
第6页 / 共322页
第7页 / 共322页
第8页 / 共322页
资料共322页,剩余部分请下载后查看
Taming Text
brief contents
contents
foreword
preface
acknowledgments
Grant Ingersoll
Tom Morton
Drew Farris
about this book
Who should read this book
Roadmap
Code conventions and downloads
Author Online
about the cover illustration
1 Getting started taming text
1.1 Why taming text is important
1.2 Preview: A fact-based question answering system
1.2.1 Hello, Dr. Frankenstein
1.3 Understanding text is hard
1.4 Text, tamed
1.5 Text and the intelligent app: search and beyond
1.5.1 Searching and matching
1.5.2 Extracting information
1.5.3 Grouping information
1.5.4 An intelligent application
1.6 Summary
1.7 Resources
2 Foundations of taming text
2.1 Foundations of language
2.1.1 Words and their categories
2.1.2 Phrases and clauses
2.1.3 Morphology
2.2 Common tools for text processing
2.2.1 String manipulation tools
2.2.2 Tokens and tokenization
2.2.3 Part of speech assignment
2.2.4 Stemming
2.2.5 Sentence detection
2.2.6 Parsing and grammar
2.2.7 Sequence modeling
2.3 Preprocessing and extracting content from common file formats
2.3.1 The importance of preprocessing
2.3.2 Extracting content using Apache Tika
2.4 Summary
2.5 Resources
3 Searching
3.1 Search and faceting example: Amazon.com
3.2 Introduction to search concepts
3.2.1 Indexing content
3.2.2 User input
3.2.3 Ranking documents with the vector space model
3.2.4 Results display
3.3 Introducing the Apache Solr search server
3.3.1 Running Solr for the first time
3.3.2 Understanding Solr concepts
3.4 Indexing content with Apache Solr
3.4.1 Indexing using XML
3.4.2 Extracting and indexing content using Solr and Apache Tika
3.5 Searching content with Apache Solr
3.5.1 Solr query input parameters
3.5.2 Faceting on extracted content
3.6 Understanding search performance factors
3.6.1 Judging quality
3.6.2 Judging quantity
3.7 Improving search performance
3.7.1 Hardware improvements
3.7.2 Analysis improvements
3.7.3 Query performance improvements
3.7.4 Alternative scoring models
3.7.5 Techniques for improving Solr performance
3.8 Search alternatives
3.9 Summary
3.10 Resources
4 Fuzzy string matching
4.1 Approaches to fuzzy string matching
4.1.1 Character overlap measures
4.1.2 Edit distance measures
4.1.3 N-gram edit distance
4.2 Finding fuzzy string matches
4.2.1 Using prefixes for matching with Solr
4.2.2 Using a trie for prefix matching
4.2.3 Using n-grams for matching
4.3 Building fuzzy string matching applications
4.3.1 Adding type-ahead to search
4.3.2 Query spell-checking for search
4.3.3 Record matching
4.4 Summary
4.5 Resources
5 Identifying people, places, and things
5.1 Approaches to named-entity recognition
5.1.1 Using rules to identify names
5.1.2 Using statistical classifiers to identify names
5.2 Basic entity identification with OpenNLP
5.2.1 Finding names with OpenNLP
5.2.2 Interpreting names identified by OpenNLP
5.2.3 Filtering names based on probability
5.3 In-depth entity identification with OpenNLP
5.3.1 Identifying multiple entity types with OpenNLP
5.3.2 Under the hood: how OpenNLP identifies names
5.4 Performance of OpenNLP
5.4.1 Quality of results
5.4.2 Runtime performance
5.4.3 Memory usage in OpenNLP
5.5 Customizing OpenNLP entity identification for a new domain
5.5.1 The whys and hows of training a model
5.5.2 Training an OpenNLP model
5.5.3 Altering modeling inputs
5.5.4 A new way to model names
5.6 Summary
5.7 Further reading
6 Clustering text
6.1 Google News document clustering
6.2 Clustering foundations
6.2.1 Three types of text to cluster
6.2.2 Choosing a clustering algorithm
6.2.3 Determining similarity
6.2.4 Labeling the results
6.2.5 How to evaluate clustering results
6.3 Setting up a simple clustering application
6.4 Clustering search results using Carrot 2
6.4.1 Using the Carrot 2 API
6.4.2 Clustering Solr search results using Carrot 2
6.5 Clustering document collections with Apache Mahout
6.5.1 Preparing the data for clustering
6.5.2 K-Means clustering
6.6 Topic modeling using Apache Mahout
6.7 Examining clustering performance
6.7.1 Feature selection and reduction
6.7.2 Carrot2 performance and quality
6.7.3 Mahout clustering benchmarks
6.8 Acknowledgments
6.9 Summary
6.10 References
7 Classification, categorization, and tagging
7.1 Introduction to classification and categorization
7.2 The classification process
7.2.1 Choosing a classification scheme
7.2.2 Identifying features for text categorization
7.2.3 The importance of training data
7.2.4 Evaluating classifier performance
7.2.5 Deploying a classifier into production
7.3 Building document categorizers using Apache Lucene
7.3.1 Categorizing text with Lucene
7.3.2 Preparing the training data for the MoreLikeThis categorizer
7.3.3 Training the MoreLikeThis categorizer
7.3.4 Categorizing documents with the MoreLikeThis categorizer
7.3.5 Testing the MoreLikeThis categorizer
7.3.6 MoreLikeThis in production
7.4 Training a naive Bayes classifier using Apache Mahout
7.4.1 Categorizing text using naive Bayes classification
7.4.2 Preparing the training data
7.4.3 Withholding test data
7.4.4 Training the classifier
7.4.5 Testing the classifier
7.4.6 Improving the bootstrapping process
7.4.7 Integrating the Mahout Bayes classifier with Solr
7.5 Categorizing documents with OpenNLP
7.5.1 Regression models and maximum entropy document categorization
7.5.2 Preparing training data for the maximum entropy document categorizer
7.5.3 Training the maximum entropy document categorizer
7.5.4 Testing the maximum entropy document classifier
7.5.5 Maximum entropy document categorization in production
7.6 Building a tag recommender using Apache Solr
7.6.1 Collecting training data for tag recommendations
7.6.2 Preparing the training data
7.6.3 Training the Solr tag recommender
7.6.4 Creating tag recommendations
7.6.5 Evaluating the tag recommender
7.7 Summary
7.8 References
8 Building an example question answering system
8.1 Basics of a question answering system
8.2 Installing and running the QA code
8.3 A sample question answering architecture
8.4 Understanding questions and producing answers
8.4.1 Training the answer type classifier
8.4.2 Chunking the query
8.4.3 Computing the answer type
8.4.4 Generating the query
8.4.5 Ranking candidate passages
8.5 Steps to improve the system
8.6 Summary
8.7 Resources
9 Untamed text: exploring the next frontier
9.1 Semantics, discourse, and pragmatics: exploring higher levels of NLP
9.1.1 Semantics
9.1.2 Discourse
9.1.3 Pragmatics
9.2 Document and collection summarization
9.3 Relationship extraction
9.3.1 Overview of approaches
9.3.2 Evaluation
9.3.3 Tools for relationship extraction
9.4 Identifying important content and people
9.4.1 Global importance and authoritativeness
9.4.2 Personal importance
9.4.3 Resources and pointers on importance
9.5 Detecting emotions via sentiment analysis
9.5.1 History and review
9.5.2 Tools and data needs
9.5.3 A basic polarity algorithm
9.5.4 Advanced topics
9.5.5 Open source libraries for sentiment analysis
9.6 Cross-language information retrieval
9.7 Summary
9.8 References
index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Taming Text
Taming Text HOW TO FIND, ORGANIZE, AND MANIPULATE IT GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. FARRIS M A N N I N G SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: orders@manning.com ©2013 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Development editor: Jeff Bleiel Technical proofreader: Steven Rowe Copyeditor: Benjamin Berg Proofreader: Katie Tennant Typesetter: Dottie Marsico Cover designer: Marija Tudor ISBN 9781933988382 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 18 17 16 15 14 13
brief contents 1 ■ Getting started taming text 1 2 ■ Foundations of taming text 16 3 ■ Searching 37 4 ■ Fuzzy string matching 84 5 ■ 6 ■ Clustering text 140 7 ■ Classification, categorization, and tagging 175 8 ■ Building an example question answering system 240 9 ■ Untamed text: exploring the next frontier 260 Identifying people, places, and things 115 v
xiv foreword xiii preface acknowledgments about this book about the cover illustration xxii xvii xix contents 1 1 Getting started taming text 1.1 Why taming text is important 2 1.2 Preview: A fact-based question answering system 4 Hello, Dr. Frankenstein 5 1.3 Understanding text is hard 8 1.4 Text, tamed 10 1.5 Text and the intelligent app: search and beyond 11 Searching and matching 12 ■ Extracting information 13 Grouping information 13 ■ An intelligent application 14 1.6 Summary 14 1.7 Resources 14 2 Foundations of taming text 2.1 Foundations of language 17 16 Words and their categories 18 ■ Phrases and clauses 19 Morphology 20 vii
分享到:
收藏