Text Data Management and Analysis
A Practical Introduction to Information Retrieval and Text Mining
ChengXiang Zhai and Sean Massung
Recent years have seen a dramatic growth of natural language text data, including web pages,
news articles, scientific literature, emails, enterprise documents, and social media such as
blog articles, forum posts, product reviews, and tweets. This has led to an increasing demand
for powerful software tools to help people manage and analyze vast amounts of text data ef-
fectively and efficiently. Unlike data generated by a computer system or sensors, text data are
usually generated directly by humans, and capture semantically rich content. As such, text
data are especially valuable for discovering knowledge about human opinions and preferenc-
es, in addition to many other kinds of knowledge that we encode in text. In contrast to struc-
tured data, which conform to well-defined schemas (thus are relatively easy for computers to
handle), text has less explicit structure, requiring computer processing toward understanding
of the content encoded in text. The current technology of natural language processing has
not yet reached a point to enable a computer to precisely understand natural language text,
but a wide range of statistical and heuristic approaches to management and analysis of text
data have been developed over the past few decades. They are usually very robust and can be
applied to analyze and manage text data in any natural language, and about any topic.
This book provides a systematic introduction to many of these approaches, with an em-
phasis on covering the most useful knowledge and skills required to build a variety of prac-
tically useful text information systems. Because humans can understand natural languages
far better than computers can, effective involvement of humans in a text information system
is generally needed and text information systems often serve as intelligent assistants for hu-
mans. Depending on how a text information system collaborates with humans, we distinguish
two kinds of text information systems. The first is information retrieval systems which include
search engines and recommender systems; they assist users in finding from a large collection
of text data the most relevant text data that are actually needed for solving a specific applica-
tion problem, thus effectively turning big raw text data into much smaller relevant text data
that can be more easily processed by humans. The second is text mining application systems;
they can assist users in analyzing patterns in text data to extract and discover useful action-
able knowledge directly useful for task completion or decision making, thus providing more
direct task support for users.
ABOUT ACM BOOKS
MC&
ACM Books is a new series of high quality books for
the computer science community, published by ACM
in collaboration with Morgan & Claypool Publishers.
ACM Books publications are widely distributed in
both print and digital formats through booksellers
and to libraries (and library consortia) and individual ACM members via the ACM
Digital Library platform.
ISBN: 978-1-97000-116-7
9 0000
B O O K S . A C M . O R G • W W W . M O R G A N C L A Y P O O L . C O M
9 78 1 970 00 1 1 67
Z
H
A
I
•
M
A
S
S
U
N
G
T
e
x
t
D
a
t
a
M
a
n
a
g
e
m
e
n
t
a
n
d
A
n
a
l
y
s
i
s
A
C
M
|
M
O
R
G
A
N
&
C
L
A
Y
P
O
O
L
Text Data
Management
and Analysis
A Practical Introduction
to Information Retrieval
and Text Mining
ChengXiang Zhai
Sean Massung
MC&
Text Data Management
and Analysis
ACM Books
Editor in Chief
M. Tamer ¨Ozsu, University of Waterloo
ACM Books is a new series of high-quality books for the computer science community,
published by ACM in collaboration with Morgan & Claypool Publishers. ACM Books
publications are widely distributed in both print and digital formats through booksellers
and to libraries (and library consortia) and individual ACM members via the ACM Digital
Library platform.
Text Data Management and Analysis: A Practical Introduction to Information
Retrieval and Text Mining
ChengXiang Zhai, University of Illinois at Urbana–Champaign
Sean Massung, University of Illinois at Urbana–Champaign
2016
An Architecture for Fast and General Data Processing on Large Clusters
Matei Zaharia, Massachusetts Institute of Technology
2016
Reactive Internet Programming: State Chart XML in Action
Franck Barbier, University of Pau, France
2016
Verified Functional Programming in Agda
Aaron Stump, The University of Iowa
2016
The VR Book: Human-Centered Design for Virtual Reality
Jason Jerald, NextGen Interactions
2016
Ada’s Legacy: Cultures of Computing from the Victorian to the Digital Age
Robin Hammerman, Stevens Institute of Technology
Andrew L. Russell, Stevens Institute of Technology
2016
Edmund Berkeley and the Social Responsibility of Computer Professionals
Bernadette Longo, New Jersey Institute of Technology
2015
Candidate Multilinear Maps
Sanjam Garg, University of California, Berkeley
2015
Smarter than Their Machines: Oral Histories of Pioneers in Interactive Computing
John Cullinane, Northeastern University; Mossavar-Rahmani Center for Business
and Government, John F. Kennedy School of Government, Harvard University
2015
A Framework for Scientific Discovery through Video Games
Seth Cooper, University of Washington
2014
Trust Extension as a Mechanism for Secure Code Execution on Commodity
Computers
Bryan Jeffrey Parno, Microsoft Research
2014
Embracing Interference in Wireless Systems
Shyamnath Gollakota, University of Washington
2014
Text Data Management
and Analysis
A Practical Introduction to Information
Retrieval and Text Mining
ChengXiang Zhai
University of Illinois at Urbana–Champaign
Sean Massung
University of Illinois at Urbana–Champaign
ACM Books #12
Copyright © 2016 by the Association for Computing Machinery
and Morgan & Claypool Publishers
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means—electronic, mechanical, photocopy,
recording, or any other except for brief quotations in printed reviews—without the prior
permission of the publisher.
Designations used by companies to distinguish their products are often claimed as
trademarks or registered trademarks. In all instances in which Morgan & Claypool is aware
of a claim, the product names appear in initial capital or all capital letters. Readers, however,
should contact the appropriate companies for more complete information regarding
trademarks and registration.
Text Data Management and Analysis
ChengXiang Zhai and Sean Massung
books.acm.org
www.morganclaypoolpublishers.com
ISBN: 978-1-97000-119-8 hardcover
ISBN: 978-1-97000-116-7 paperback
ISBN: 978-1-97000-117-4 ebook
ISBN: 978-1-97000-118-1 ePub
Series ISSN: 2374-6769 print 2374-6777 electronic
DOIs:
10.1145/2915031 Book
10.1145/2915031.2915032 Preface
10.1145/2915031.2915033 Chapter 1
10.1145/2915031.2915034 Chapter 2
10.1145/2915031.2915035 Chapter 3
10.1145/2915031.2915036 Chapter 4
10.1145/2915031.2915037 Chapter 5
10.1145/2915031.2915038 Chapter 6
10.1145/2915031.2915039 Chapter 7
10.1145/2915031.2915040 Chapter 8
10.1145/2915031.2915041 Chapter 9
10.1145/2915031.2915042 Chapter 10
10.1145/2915031.2915043 Chapter 11
10.1145/2915031.2915044 Chapter 12
10.1145/2915031.2915045 Chapter 13
10.1145/2915031.2915046 Chapter 14
10.1145/2915031.2915047 Chapter 15
10.1145/2915031.2915048 Chapter 16
10.1145/2915031.2915049 Chapter 17
10.1145/2915031.2915050 Chapter 18
10.1145/2915031.2915051 Chapter 19
10.1145/2915031.2915052 Chapter 20
10.1145/2915031.2915053 Appendices
10.1145/2915031.2915054 References
10.1145/2915031.2915055 Index
A publication in the ACM Books series, #12
Editor in Chief: M. Tamer ¨Ozsu, University of Waterloo
Area Editor: Edward A. Fox, Virginia Tech
First Edition
10 9 8 7 6 5 4 3 2 1
To Mei and Alex
To Kai