Cover
Preface
Contents
1 Overview of Text Mining
1.1 What's Special About Text Mining?
1.1.1 Structured or Unstructured Data?
1.1.2 Is Text Different from Numbers?
1.2 What Types of Problems Can Be Solved?
1.3 Document Classification
1.4 Information Retrieval
1.5 Clustering and Organizing Documents
1.6 Information Extraction
1.7 Prediction and Evaluation
1.8 The Next Chapters
1.9 Summary
1.10 Historical and Bibliographical Remarks
1.11 Questions and Exercises
2 From Textual Information to Numerical Vectors
2.1 Collecting Documents
2.2 Document Standardization
2.3 Tokenization
2.4 Lemmatization
2.4.1 Inflectional Stemming
2.4.2 Stemming to a Root
2.5 Vector Generation for Prediction
2.5.1 Multiword Features
2.5.2 Labels for the Right Answers
2.5.3 Feature Selection by Attribute Ranking
2.6 Sentence Boundary Determination
2.7 Part-of-Speech Tagging
2.8 Word Sense Disambiguation
2.9 Phrase Recognition
2.10 Named Entity Recognition
2.11 Parsing
2.12 Feature Generation
2.13 Summary
2.14 Historical and Bibliographical Remarks
2.15 Questions and Exercises
3 Using Text for Prediction
3.1 Recognizing that Documents Fit a Pattern
3.2 How Many Documents Are Enough?
3.3 Document Classification
3.4 Learning to Predict from Text
3.4.1 Similarity and Nearest-Neighbor Methods
3.4.2 Document Similarity
3.4.3 Decision Rules
3.4.4 Decision Trees
3.4.5 Scoring by Probabilities
3.4.6 Linear Scoring Methods
3.5 Evaluation of Performance
3.5.1 Estimating Current and Future Performance
3.5.2 Getting the Most from a Learning Method
3.5.3 Errors and Pitfalls in Big Data Evaluation
3.6 Applications
3.7 Graph Models for Social Networks
3.8 Summary
3.9 Historical and Bibliographical Remarks
3.10 Questions and Exercises
4 Information Retrieval and Text Mining
4.1 Is Information Retrieval a Form of Text Mining?
4.2 Key Word Search
4.3 Nearest-Neighbor Methods
4.4 Measuring Similarity
4.4.1 Shared Word Count
4.4.2 Word Count and Bonus
4.4.3 Cosine Similarity
4.5 Web-Based Document Search
4.5.1 Link Analysis
4.6 Document Matching
4.7 Inverted Lists
4.8 Evaluation of Performance
4.9 Summary
4.10 Historical and Bibliographical Remarks
4.11 Questions and Exercises
5 Finding Structure in a Document Collection
5.1 Clustering Documents by Similarity
5.2 Similarity of Composite Documents
5.2.1 k-Means Clustering
5.2.2 Hierarchical Clustering
5.2.3 The EM Algorithm
5.3 What Do a Cluster's Labels Mean?
5.4 Applications
5.5 Evaluation of Performance
5.6 Summary
5.7 Historical and Bibliographical Remarks
5.8 Questions and Exercises
6 Looking for Information in Documents
6.1 Goals of Information Extraction
6.2 Finding Patterns and Entities from Text
6.2.1 Entity Extraction as Sequential Tagging
6.2.2 Tag Prediction as Classification
6.2.3 The Maximum Entropy Method
6.2.4 Linguistic Features and Encoding
6.2.5 Local Sequence Prediction Models
6.2.6 Global Sequence Prediction Models
6.3 Coreference and Relationship Extraction
6.3.1 Coreference Resolution
6.3.2 Relationship Extraction
6.4 Template Filling and Database Construction
6.5 Applications
6.5.1 Information Retrieval
6.5.2 Commercial Extraction Systems
6.5.3 Criminal Justice
6.5.4 Intelligence
6.6 Summary
6.7 Historical and Bibliographical Remarks
6.8 Questions and Exercises
7 Data Sources for Prediction: Databases, Hybrid Data and the Web
7.1 Ideal Models of Data
7.1.1 Ideal Data for Prediction
7.1.2 Ideal Data for Text and Unstructured Data
7.1.3 Hybrid and Mixed Data
7.2 Practical Data Sourcing
7.3 Prototypical Examples
7.3.1 Web-Based Spreadsheet Data
7.3.2 Web-Based XML Data
7.3.3 Opinion Data and Sentiment Analysis
7.4 Hybrid Example: Independent Sources of Numerical and Text Data
7.5 Mixed Data in Standard Table Format
7.6 Summary
7.7 Historical and Bibliographical Remarks
7.8 Questions and Exercises
8 Case Studies
8.1 Market Intelligence from the Web
8.1.1 The Problem
8.1.2 Solution Overview
8.1.3 Methods and Procedures
8.1.4 System Deployment
8.2 Lightweight Document Matching for Digital Libraries
8.2.1 The Problem
8.2.2 Solution Overview
8.2.3 Methods and Procedures
8.2.4 System Deployment
8.3 Generating Model Cases for Help Desk Applications
8.3.1 The Problem
8.3.2 Solution Overview
8.3.3 Methods and Procedures
8.3.4 System Deployment
8.4 Assigning Topics to News Articles
8.4.1 The Problem
8.4.2 Solution Overview
8.4.3 Methods and Procedures
8.4.4 System Deployment
8.5 E-mail Filtering
8.5.1 The Problem
8.5.2 Solution Overview
8.5.3 Methods and Procedures
8.5.4 System Deployment
8.6 Search Engines
8.6.1 The Problem
8.6.2 Solution Overview
8.6.3 Methods and Procedures
8.6.4 System Deployment
8.7 Extracting Named Entities from Documents
8.7.1 The Problem
8.7.2 Solution Overview
8.7.3 Methods and Procedures
8.7.4 System Deployment
8.8 Mining Social Media
8.8.1 The Problem
8.8.2 Solution Overview
8.8.3 Methods and Procedures
8.8.4 System Deployment
8.9 Customized Newspapers
8.9.1 The Problem
8.9.2 Solution Overview
8.9.3 Methods and Procedures
8.9.4 System Deployment
8.10 Summary
8.11 Historical and Bibliographical Remarks
8.12 Questions and Exercises
9 Emerging Directions
9.1 Summarization
9.2 Active Learning
9.3 Learning with Unlabeled Data
9.4 Different Ways of Collecting Samples
9.4.1 Ensembles and Voting Methods
9.4.2 Online Learning
9.4.3 Deep Learning
9.4.4 Cost-Sensitive Learning
9.4.5 Unbalanced Samples and Rare Events
9.5 Distributed Text Mining
9.6 Learning to Rank
9.7 Question Answering
9.8 Summary
9.9 Historical and Bibliographical Remarks
9.10 Questions and Exercises
References
Author Index
Subject Index