logo资料库

Text Data Management and Analysis 无水印pdf.pdf

第1页 / 共531页
第2页 / 共531页
第3页 / 共531页
第4页 / 共531页
第5页 / 共531页
第6页 / 共531页
第7页 / 共531页
第8页 / 共531页
资料共531页,剩余部分请下载后查看
Cover
Copyright
Contents
Preface
PART I. OVERVIEW AND BACKGROUND
1. Introduction
2. Background
3. Text Data Understanding
4. MeTA: A Unified Toolkit for Text Data Management and Analysis
PART II. TEXT DATA ACCESS
5. Overview of Text Data Access
6. Retrieval Models
7. Feedback
8. Search Engine Implementation
9. Search Engine Evaluation
10. Web Search
11. Recommender Systems
PART III. TEXT DATA ANALYSIS
12. Overview of Text Data Analysis
13. Word Association Mining
14. Text Clustering
15. Text Categorization
16. Text Summarization
17. Topic Analysis
18. Opinion Mining and Sentiment Analysis
19. Joint Analysis of Text and Structured Data
PART IV. UNIFIED TEXT DATA MANAGEMENT ANALYSIS SYSTEM
20. Toward a Unified System for Text Managment and Analysis
App. A. Bayesian Statistics
App. B. Expectation-Maximization
App C. KL-divergence and Dirichlet Prior Smoothing
References
Index
Authors' Biographies
Text Data Management and Analysis A Practical Introduction to Information Retrieval and Text Mining ChengXiang Zhai and Sean Massung Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media such as blog articles, forum posts, product reviews, and tweets. This has led to an increasing demand for powerful software tools to help people manage and analyze vast amounts of text data ef- fectively and efficiently. Unlike data generated by a computer system or sensors, text data are usually generated directly by humans, and capture semantically rich content. As such, text data are especially valuable for discovering knowledge about human opinions and preferenc- es, in addition to many other kinds of knowledge that we encode in text. In contrast to struc- tured data, which conform to well-defined schemas (thus are relatively easy for computers to handle), text has less explicit structure, requiring computer processing toward understanding of the content encoded in text. The current technology of natural language processing has not yet reached a point to enable a computer to precisely understand natural language text, but a wide range of statistical and heuristic approaches to management and analysis of text data have been developed over the past few decades. They are usually very robust and can be applied to analyze and manage text data in any natural language, and about any topic. This book provides a systematic introduction to many of these approaches, with an em- phasis on covering the most useful knowledge and skills required to build a variety of prac- tically useful text information systems. Because humans can understand natural languages far better than computers can, effective involvement of humans in a text information system is generally needed and text information systems often serve as intelligent assistants for hu- mans. Depending on how a text information system collaborates with humans, we distinguish two kinds of text information systems. The first is information retrieval systems which include search engines and recommender systems; they assist users in finding from a large collection of text data the most relevant text data that are actually needed for solving a specific applica- tion problem, thus effectively turning big raw text data into much smaller relevant text data that can be more easily processed by humans. The second is text mining application systems; they can assist users in analyzing patterns in text data to extract and discover useful action- able knowledge directly useful for task completion or decision making, thus providing more direct task support for users. ABOUT ACM BOOKS MC& ACM Books is a new series of high quality books for the computer science community, published by ACM in collaboration with Morgan & Claypool Publishers. ACM Books publications are widely distributed in both print and digital formats through booksellers and to libraries (and library consortia) and individual ACM members via the ACM Digital Library platform. ISBN: 978-1-97000-116-7 9 0000 B O O K S . A C M . O R G • W W W . M O R G A N C L A Y P O O L . C O M 9 78 1 970 00 1 1 67 Z H A I • M A S S U N G T e x t D a t a M a n a g e m e n t a n d A n a l y s i s A C M | M O R G A N & C L A Y P O O L Text Data Management and Analysis A Practical Introduction to Information Retrieval and Text Mining ChengXiang Zhai Sean Massung MC&
Text Data Management and Analysis
ACM Books Editor in Chief M. Tamer ¨Ozsu, University of Waterloo ACM Books is a new series of high-quality books for the computer science community, published by ACM in collaboration with Morgan & Claypool Publishers. ACM Books publications are widely distributed in both print and digital formats through booksellers and to libraries (and library consortia) and individual ACM members via the ACM Digital Library platform. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining ChengXiang Zhai, University of Illinois at Urbana–Champaign Sean Massung, University of Illinois at Urbana–Champaign 2016 An Architecture for Fast and General Data Processing on Large Clusters Matei Zaharia, Massachusetts Institute of Technology 2016 Reactive Internet Programming: State Chart XML in Action Franck Barbier, University of Pau, France 2016 Verified Functional Programming in Agda Aaron Stump, The University of Iowa 2016 The VR Book: Human-Centered Design for Virtual Reality Jason Jerald, NextGen Interactions 2016 Ada’s Legacy: Cultures of Computing from the Victorian to the Digital Age Robin Hammerman, Stevens Institute of Technology Andrew L. Russell, Stevens Institute of Technology 2016 Edmund Berkeley and the Social Responsibility of Computer Professionals Bernadette Longo, New Jersey Institute of Technology 2015 Candidate Multilinear Maps Sanjam Garg, University of California, Berkeley 2015
Smarter than Their Machines: Oral Histories of Pioneers in Interactive Computing John Cullinane, Northeastern University; Mossavar-Rahmani Center for Business and Government, John F. Kennedy School of Government, Harvard University 2015 A Framework for Scientific Discovery through Video Games Seth Cooper, University of Washington 2014 Trust Extension as a Mechanism for Secure Code Execution on Commodity Computers Bryan Jeffrey Parno, Microsoft Research 2014 Embracing Interference in Wireless Systems Shyamnath Gollakota, University of Washington 2014
Text Data Management and Analysis A Practical Introduction to Information Retrieval and Text Mining ChengXiang Zhai University of Illinois at Urbana–Champaign Sean Massung University of Illinois at Urbana–Champaign ACM Books #12
Copyright © 2016 by the Association for Computing Machinery and Morgan & Claypool Publishers All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews—without the prior permission of the publisher. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan & Claypool is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Text Data Management and Analysis ChengXiang Zhai and Sean Massung books.acm.org www.morganclaypoolpublishers.com ISBN: 978-1-97000-119-8 hardcover ISBN: 978-1-97000-116-7 paperback ISBN: 978-1-97000-117-4 ebook ISBN: 978-1-97000-118-1 ePub Series ISSN: 2374-6769 print 2374-6777 electronic DOIs: 10.1145/2915031 Book 10.1145/2915031.2915032 Preface 10.1145/2915031.2915033 Chapter 1 10.1145/2915031.2915034 Chapter 2 10.1145/2915031.2915035 Chapter 3 10.1145/2915031.2915036 Chapter 4 10.1145/2915031.2915037 Chapter 5 10.1145/2915031.2915038 Chapter 6 10.1145/2915031.2915039 Chapter 7 10.1145/2915031.2915040 Chapter 8 10.1145/2915031.2915041 Chapter 9 10.1145/2915031.2915042 Chapter 10 10.1145/2915031.2915043 Chapter 11 10.1145/2915031.2915044 Chapter 12 10.1145/2915031.2915045 Chapter 13 10.1145/2915031.2915046 Chapter 14 10.1145/2915031.2915047 Chapter 15 10.1145/2915031.2915048 Chapter 16 10.1145/2915031.2915049 Chapter 17 10.1145/2915031.2915050 Chapter 18 10.1145/2915031.2915051 Chapter 19 10.1145/2915031.2915052 Chapter 20 10.1145/2915031.2915053 Appendices 10.1145/2915031.2915054 References 10.1145/2915031.2915055 Index A publication in the ACM Books series, #12 Editor in Chief: M. Tamer ¨Ozsu, University of Waterloo Area Editor: Edward A. Fox, Virginia Tech First Edition 10 9 8 7 6 5 4 3 2 1
To Mei and Alex To Kai
分享到:
收藏