Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
Data Mining
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
DATA CLUSTERING
Algorithms and Applications
Research on the problem of clustering tends to be fragmented across the
pattern recognition, database, data mining, and machine learning communities.
Addressing this problem in a unified way, Data Clustering: Algorithms and
Applications provides complete coverage of the entire area of clustering, from
basic methods to more refined and complex data clustering approaches. It
pays special attention to recent issues in graphs, social networks, and other
domains.
The book focuses on three primary aspects of data clustering:
• Methods, describing key techniques commonly used for clustering, such
as feature selection, agglomerative clustering, partitional clustering,
density-based clustering, probabilistic clustering, grid-based clustering,
spectral clustering, and nonnegative matrix factorization
Domains, covering methods used for different domains of data, such as
categorical data, text data, multimedia data, graph data, biological data,
stream data, uncertain data, time series clustering, high-dimensional
clustering, and big data
Variations and Insights, discussing important variations of the clustering
process, such as semisupervised clustering, interactive clustering,
multiview clustering, cluster ensembles, and cluster validation
In this book, top researchers from around the world explore the characteristics
of clustering problems in a variety of application areas. They also explain how
to glean detailed insight from the clustering process—including how to verify
the quality of the underlying clusters—through supervision, human intervention,
or the automated generation of alternative clusters.
K15510
D
A
T
A
C
L
U
S
T
E
R
N
G
I
A
g
g
a
r
w
a
l
R
e
d
d
y
K15510_Cover.indd 1
7/24/13 2:46 PM
•
•
•
DATA CLUSTERINGAlgorithms and Applications© 2014 by Taylor & Francis Group, LLC
Chapman & Hall/CRC Data Mining and Knowledge Discovery SeriesPUBLISHED TITLESSERIES EDITORVipin KumarUniversity of MinnesotaDepartment of Computer Science and EngineeringMinneapolis, Minnesota, U.S.A.AIMS AND SCOPEThis series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis. This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and hand-books. The inclusion of concrete examples and applications is highly encouraged. The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues. ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY Michael J. Way, Jeffrey D. Scargle, Kamal M. Ali, and Ashok N. SrivastavaBIOLOGICAL DATA MINING Jake Y. Chen and Stefano LonardiCOMPUTATIONAL INTELLIGENT DATA ANALYSIS FOR SUSTAINABLE DEVELOPMENT Ting Yu, Nitesh V. Chawla, and Simeon SimoffCOMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi MotodaCONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY, AND APPLICATIONS Sugato Basu, Ian Davidson, and Kiri L. WagstaffCONTRAST DATA MINING: CONCEPTS, ALGORITHMS, AND APPLICATIONS Guozhu Dong and James BaileyDATA CLUSTERING: ALGORITHMS AND APPLICATIONS Charu C. Aggarawal and Chandan K. ReddyDATA CLUSTERING IN C++: AN OBJECT-ORIENTED APPROACH Guojun GanDATA MINING FOR DESIGN AND MARKETING Yukio Ohsawa and Katsutoshi Yada DATA MINING WITH R: LEARNING WITH CASE STUDIES Luís TorgoFOUNDATIONS OF PREDICTIVE ANALYTICS James Wu and Stephen CoggeshallGEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, SECOND EDITION Harvey J. Miller and Jiawei HanHANDBOOK OF EDUCATIONAL DATA MINING Cristóbal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan S.J.d. Baker© 2014 by Taylor & Francis Group, LLC
INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS Vagelis HristidisINTELLIGENT TECHNOLOGIES FOR WEB APPLICATIONS Priti Srinivas Sajja and Rajendra AkerkarINTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING: CONCEPTS AND TECHNIQUES Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. YuKNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT David SkillicornKNOWLEDGE DISCOVERY FROM DATA STREAMS João GamaMACHINE LEARNING AND KNOWLEDGE DISCOVERY FOR ENGINEERING SYSTEMS HEALTH MANAGEMENT Ashok N. Srivastava and Jiawei HanMINING SOFTWARE SPECIFICATIONS: METHODOLOGIES AND APPLICATIONS David Lo, Siau-Cheng Khoo, Jiawei Han, and Chao LiuMULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO CONCEPTS AND THEORY Zhongfei Zhang and Ruofei ZhangMUSIC DATA MINING Tao Li, Mitsunori Ogihara, and George TzanetakisNEXT GENERATION OF DATA MINING Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin KumarPRACTICAL GRAPH MINING WITH R Nagiza F. Samatova, William Hendrix, John Jenkins, Kanchana Padmanabhan, and Arpan Chakraborty RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS, AND APPLICATIONS Bo Long, Zhongfei Zhang, and Philip S. YuSERVICE-ORIENTED DISTRIBUTED KNOWLEDGE DISCOVERY Domenico Talia and Paolo TrunfioSPECTRAL FEATURE SELECTION FOR DATA MINING Zheng Alan Zhao and Huan LiuSTATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION George FernandezSUPPORT VECTOR MACHINES: OPTIMIZATION BASED THEORY, ALGORITHMS, AND EXTENSIONS Naiyang Deng, Yingjie Tian, and Chunhua ZhangTEMPORAL DATA MINING Theophano MitsaTEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS Ashok N. Srivastava and Mehran SahamiTHE TOP TEN ALGORITHMS IN DATA MINING Xindong Wu and Vipin Kumar UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX DECOMPOSITIONS David Skillicorn© 2014 by Taylor & Francis Group, LLC
© 2014 by Taylor & Francis Group, LLC
DATA CLUSTERINGAlgorithms and ApplicationsEdited byCharu C. AggarwalChandan K. Reddy© 2014 by Taylor & Francis Group, LLC
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2014 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Version Date: 20130508
International Standard Book Number-13: 978-1-4665-5822-9 (eBook - PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
© 2014 by Taylor & Francis Group, LLC
Contents
Preface
Editor Biographies
Contributors
1 An Introduction to Cluster Analysis
Charu C. Aggarwal
1.1
1.2
1.2.6
1.2.7
1.3
1.4
1.5
. . .
. . .
.
. . .
. . . .
. . . .
. . . .
Introduction . .
Common Techniques Used in Cluster Analysis
1.2.1
1.2.2
1.2.3
1.2.4
1.2.5
. . .
.
. . .
. .
. . .
. . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Feature Selection Methods .
. . . .
. . . .
. . . .
Probabilistic and Generative Models
. . . .
. . . .
Distance-Based Algorithms . . . .
. . . .
Density- and Grid-Based Methods .
. . . .
Leveraging Dimensionality Reduction Methods
. . . .
1.2.5.1 Generative Models for Dimensionality Reduction . .
. . . .
1.2.5.2 Matrix Factorization and Co-Clustering . . .
. . . .
. . . .
. . . .
1.2.5.3
Spectral Methods . . . .
. . . .
. . . .
The High Dimensional Scenario . .
. . . .
Scalable Techniques for Cluster Analysis . . . .
. . . .
. . . .
. . . .
. . . .
1.2.7.1
.
. . . .
. . . .
. . . .
1.2.7.2
. . . .
. . . .
. . . .
1.2.7.3
. . . .
. . . .
. . . .
Data Types Studied in Cluster Analysis
. . . .
. . . .
. . . .
1.3.1
. . . .
. . . .
. . . .
1.3.2
. . . .
. . . .
. . . .
1.3.3
. . . .
. . . .
. . . .
1.3.4
. . . .
. . . .
. . . .
1.3.5
. . . .
. . . .
. . . .
1.3.6
. . . .
1.3.7
. . . .
. . . .
. . . .
Insights Gained from Different Variations of Cluster Analysis . .
. . .
1.4.1
. . . .
. . . .
. . . .
. . . .
1.4.2
. . .
. . . .
. . . .
1.4.3 Multiview and Ensemble-Based Insights
. . . .
. . . .
. . .
1.4.4
Discussion and Conclusions
. . .
. . . .
. . . .
I/O Issues in Database Management
Streaming Algorithms
. . .
The Big Data Framework . . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. .
Clustering Categorical Data . . . .
Clustering Text Data . . . .
. . . .
Clustering Multimedia Data . . . .
Clustering Time-Series Data . . . .
Clustering Discrete Sequences . . .
. . . .
Clustering Network Data
.
Clustering Uncertain Data .
. . . .
Validation-Based Insights .
. . . .
Visual Insights . . .
Supervised Insights
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xxi
xxiii
xxv
1
2
3
4
4
5
7
8
8
8
10
11
13
13
14
14
15
15
16
16
17
17
18
19
19
20
20
21
21
22
vii
© 2014 by Taylor & Francis Group, LLC