logo资料库

Fundamentals of Predictive Text Mining(2nd) 无水印pdf.pdf

第1页 / 共249页
第2页 / 共249页
第3页 / 共249页
第4页 / 共249页
第5页 / 共249页
第6页 / 共249页
第7页 / 共249页
第8页 / 共249页
资料共249页,剩余部分请下载后查看
Cover
Preface
Contents
1 Overview of Text Mining
1.1 What's Special About Text Mining?
1.1.1 Structured or Unstructured Data?
1.1.2 Is Text Different from Numbers?
1.2 What Types of Problems Can Be Solved?
1.3 Document Classification
1.4 Information Retrieval
1.5 Clustering and Organizing Documents
1.6 Information Extraction
1.7 Prediction and Evaluation
1.8 The Next Chapters
1.9 Summary
1.10 Historical and Bibliographical Remarks
1.11 Questions and Exercises
2 From Textual Information to Numerical Vectors
2.1 Collecting Documents
2.2 Document Standardization
2.3 Tokenization
2.4 Lemmatization
2.4.1 Inflectional Stemming
2.4.2 Stemming to a Root
2.5 Vector Generation for Prediction
2.5.1 Multiword Features
2.5.2 Labels for the Right Answers
2.5.3 Feature Selection by Attribute Ranking
2.6 Sentence Boundary Determination
2.7 Part-of-Speech Tagging
2.8 Word Sense Disambiguation
2.9 Phrase Recognition
2.10 Named Entity Recognition
2.11 Parsing
2.12 Feature Generation
2.13 Summary
2.14 Historical and Bibliographical Remarks
2.15 Questions and Exercises
3 Using Text for Prediction
3.1 Recognizing that Documents Fit a Pattern
3.2 How Many Documents Are Enough?
3.3 Document Classification
3.4 Learning to Predict from Text
3.4.1 Similarity and Nearest-Neighbor Methods
3.4.2 Document Similarity
3.4.3 Decision Rules
3.4.4 Decision Trees
3.4.5 Scoring by Probabilities
3.4.6 Linear Scoring Methods
3.5 Evaluation of Performance
3.5.1 Estimating Current and Future Performance
3.5.2 Getting the Most from a Learning Method
3.5.3 Errors and Pitfalls in Big Data Evaluation
3.6 Applications
3.7 Graph Models for Social Networks
3.8 Summary
3.9 Historical and Bibliographical Remarks
3.10 Questions and Exercises
4 Information Retrieval and Text Mining
4.1 Is Information Retrieval a Form of Text Mining?
4.2 Key Word Search
4.3 Nearest-Neighbor Methods
4.4 Measuring Similarity
4.4.1 Shared Word Count
4.4.2 Word Count and Bonus
4.4.3 Cosine Similarity
4.5 Web-Based Document Search
4.5.1 Link Analysis
4.6 Document Matching
4.7 Inverted Lists
4.8 Evaluation of Performance
4.9 Summary
4.10 Historical and Bibliographical Remarks
4.11 Questions and Exercises
5 Finding Structure in a Document Collection
5.1 Clustering Documents by Similarity
5.2 Similarity of Composite Documents
5.2.1 k-Means Clustering
5.2.2 Hierarchical Clustering
5.2.3 The EM Algorithm
5.3 What Do a Cluster's Labels Mean?
5.4 Applications
5.5 Evaluation of Performance
5.6 Summary
5.7 Historical and Bibliographical Remarks
5.8 Questions and Exercises
6 Looking for Information in Documents
6.1 Goals of Information Extraction
6.2 Finding Patterns and Entities from Text
6.2.1 Entity Extraction as Sequential Tagging
6.2.2 Tag Prediction as Classification
6.2.3 The Maximum Entropy Method
6.2.4 Linguistic Features and Encoding
6.2.5 Local Sequence Prediction Models
6.2.6 Global Sequence Prediction Models
6.3 Coreference and Relationship Extraction
6.3.1 Coreference Resolution
6.3.2 Relationship Extraction
6.4 Template Filling and Database Construction
6.5 Applications
6.5.1 Information Retrieval
6.5.2 Commercial Extraction Systems
6.5.3 Criminal Justice
6.5.4 Intelligence
6.6 Summary
6.7 Historical and Bibliographical Remarks
6.8 Questions and Exercises
7 Data Sources for Prediction: Databases, Hybrid Data and the Web
7.1 Ideal Models of Data
7.1.1 Ideal Data for Prediction
7.1.2 Ideal Data for Text and Unstructured Data
7.1.3 Hybrid and Mixed Data
7.2 Practical Data Sourcing
7.3 Prototypical Examples
7.3.1 Web-Based Spreadsheet Data
7.3.2 Web-Based XML Data
7.3.3 Opinion Data and Sentiment Analysis
7.4 Hybrid Example: Independent Sources of Numerical and Text Data
7.5 Mixed Data in Standard Table Format
7.6 Summary
7.7 Historical and Bibliographical Remarks
7.8 Questions and Exercises
8 Case Studies
8.1 Market Intelligence from the Web
8.1.1 The Problem
8.1.2 Solution Overview
8.1.3 Methods and Procedures
8.1.4 System Deployment
8.2 Lightweight Document Matching for Digital Libraries
8.2.1 The Problem
8.2.2 Solution Overview
8.2.3 Methods and Procedures
8.2.4 System Deployment
8.3 Generating Model Cases for Help Desk Applications
8.3.1 The Problem
8.3.2 Solution Overview
8.3.3 Methods and Procedures
8.3.4 System Deployment
8.4 Assigning Topics to News Articles
8.4.1 The Problem
8.4.2 Solution Overview
8.4.3 Methods and Procedures
8.4.4 System Deployment
8.5 E-mail Filtering
8.5.1 The Problem
8.5.2 Solution Overview
8.5.3 Methods and Procedures
8.5.4 System Deployment
8.6 Search Engines
8.6.1 The Problem
8.6.2 Solution Overview
8.6.3 Methods and Procedures
8.6.4 System Deployment
8.7 Extracting Named Entities from Documents
8.7.1 The Problem
8.7.2 Solution Overview
8.7.3 Methods and Procedures
8.7.4 System Deployment
8.8 Mining Social Media
8.8.1 The Problem
8.8.2 Solution Overview
8.8.3 Methods and Procedures
8.8.4 System Deployment
8.9 Customized Newspapers
8.9.1 The Problem
8.9.2 Solution Overview
8.9.3 Methods and Procedures
8.9.4 System Deployment
8.10 Summary
8.11 Historical and Bibliographical Remarks
8.12 Questions and Exercises
9 Emerging Directions
9.1 Summarization
9.2 Active Learning
9.3 Learning with Unlabeled Data
9.4 Different Ways of Collecting Samples
9.4.1 Ensembles and Voting Methods
9.4.2 Online Learning
9.4.3 Deep Learning
9.4.4 Cost-Sensitive Learning
9.4.5 Unbalanced Samples and Rare Events
9.5 Distributed Text Mining
9.6 Learning to Rank
9.7 Question Answering
9.8 Summary
9.9 Historical and Bibliographical Remarks
9.10 Questions and Exercises
References
Author Index
Subject Index
Texts in Computer Science Sholom M. Weiss Nitin Indurkhya Tong Zhang Fundamentals of Predictive Text Mining Second Edition
Texts in Computer Science Series editors David Gries Fred B. Schneider
More information about this series at http://www.springer.com/series/3191
Sholom M. Weiss Nitin Indurkhya Tong Zhang Fundamentals of Predictive Text Mining Second Edition 123
Sholom M. Weiss Department of Computer Science Rutgers University Piscataway, NJ USA Tong Zhang Department of Statistics, Hill Center Rutgers University Piscataway, NJ USA Nitin Indurkhya School of Computer Science and Engineering University of New South Wales Sydney, NSW Australia Series editors David Gries Department of Computer Science Cornell University Ithaca, NY USA Fred B. Schneider Department of Computer Science Cornell University Ithaca, NY USA ISSN 1868-0941 Texts in Computer Science ISBN 978-1-4471-6749-5 DOI 10.1007/978-1-4471-6750-1 ISSN 1868-095X (electronic) ISBN 978-1-4471-6750-1 (eBook) Library of Congress Control Number: 2015946744 Springer London Heidelberg New York Dordrecht © Springer-Verlag London 2010, 2015 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. trademarks, service marks, etc. Printed on acid-free paper Springer-Verlag London Ltd. is part of Springer Science+Business Media (www.springer.com)
Preface Previously we authored “Text Mining: Predictive Methods for Analyzing Unstructured Information.” That book was geared mostly to professional practi- tioners, but was adaptable to course work with some effort by the instructor. Many topics were evolving, and this was one of the earliest efforts to collect material for predictive text mining. Since then, the book has seen extensive use in education, by ourselves, and other instructors, with positive responses from students. With more data sourced from the Internet, the field has seen very rapid growth with many new techniques that would interest practitioners. Given the amount of supplementary new material we had begun using, a new edition was clearly needed. A year ago, our publisher asked us to update the book and to add material that would extend its use as a textbook. We have revised many sections, adding new material to reflect the increased use of the web. Exercises and summaries are also provided. The prediction problem, looking for predictive patterns in data, has been widely studied. Strong methods are available to the practitioner. These methods process structured numerical information, where uniform measurements are taken over a sample of data. Text is often described as unstructured information. So, it would seem, text and numerical data are different, requiring different methods. Or are they? In our view, a prediction problem can be solved by the same methods, whether the data are structured numerical measurements or unstructured text. Text and documents can be transformed into measured values, such as the presence or absence of words, and the same methods that have proven successful for predictive data mining can be applied to text. Yet, there are key differences. Evaluation techniques must be adapted to the chronological order of publication and to alternative measures of error. Because the data are documents, more specialized analytical methods may be preferred for text. Moreover, the methods must be modified to accommodate very high dimensions: tens of thousands of words and documents. Still, the central themes are similar. v
vi Preface Our view of text mining allows us to unify the concepts of different fields. No longer is “natural language processing” the sole domain of linguists and their allied computer specialists. No longer is search engine technology distinct from other forms of machine learning. Ours is an open view. We welcome you to try your hand at learning from data, whether numerical or text. Large text collections, often readily available on the Internet, contain valuable information that can be mined with today’s tools instead of waiting for tomorrow’s linguistic techniques. While others search for the essence of language understanding, we can immediately look for recurring word patterns in large collections of digital documents. Our main theme is a strictly empirical view of text mining and an application of well-known analytical methods. Our presentation has a pragmatic bent with numerous references in the research literature for you to follow when so inclined. We want to be practical, yet inclusive of the wide community that might be interested in applications of text mining. We concentrate on predictive learning methods but also look at information retrieval and search engines, as well as clustering methods. We illustrate by examples and case studies. While some analytical methods may be highly developed, predictive text mining is an emerging area of application. We have tried to summarize our experiences and provide the tools and techniques for your own experiments. Audience Our book is aimed at IT professionals and managers as well as advanced under- graduate computer science students and beginning graduate students. Some back- ground in data mining is beneficial but is not essential. A few sections discuss advanced concepts that require mathematical maturity for a proper understanding. In such sections, intuitive explanations are also provided that may suffice for the less advanced reader. Most parts of the book can be read and understood by anyone with a sufficient analytic bend. If you are looking to do research in the area, the material in this book will provide direction in expanding your horizons. If you want to be a practitioner of text mining, you can read about our recommended methods and our descriptions of case studies. For Instructors The material in this book has been successfully used for education in a variety of ways ranging from short intensive one-week courses to twelve-week full semester courses. In short courses, the mathematical material can be skipped. The exercises
Preface vii have undergone class-testing over several years. Each chapter has the following accompanying material: a chapter summary exercises. In addition, numerous examples and figures are interlaced throughout the book. Slides, sample solutions to selected exercises and suggestions for using the book in courses are are available from the publisher’s companion site for this book. Optional Software AI Data-Miner LLC has provided a free software license for those who have purchased the book. The software, which implements many of the methods dis- cussed in the book, can be downloaded from the data-miner.com Web site. Linux scripts for many examples are also available for download. The software requires familiarity with running command-line programs and editing configuration files. See http://www.data-miner.com for details. Second Edition Updates The book has been thoroughly revised and updated to reflect developments in the field since the first edition was published. The following new sections have been added to the second edition: Deep Learning Graph Modeling Mining Social Media Errors and Pitfalls in Big Data Evaluation Twitter Sentiment Analysis Introduction to Dependency Parsing. Acknowledgments Fred Damerau, our colleague and mentor, was a co-author of our original book. He is no longer with us, and his contributions to our project, especially his expertise in linguistics, were immeasurable. Some of the case studies in Chap. 8 are based on our prior publications. In those projects, we acknowledge the participation of Chidanand Apté, Radu Florian, Abraham Ittycheriah, Vijay Iyengar, Hongyan Jing, David Johnson, Frank Oles, Naval Verma, and Brian White. Arindam Banerjee
分享到:
收藏