SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page i
Data Clustering
SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page ii
ASA-SIAM Series on
Statistics and Applied Probability
The ASA-SIAM Series on Statistics and Applied Probability is published
jointly by the American Statistical Association and the Society for Industrial and Applied Mathematics.
The series consists of a broad spectrum of books on topics in statistics and applied probability. The
purpose of the series is to provide inexpensive, quality publications of interest to the intersecting
membership of the two societies.
Editorial Board
Martin T. Wells
Cornell University, Editor-in-Chief
H. T. Banks
North Carolina State University
Douglas M. Hawkins
University of Minnesota
Susan Holmes
Stanford University
Lisa LaVange
University of North Carolina
David Madigan
Rutgers University
Mark van der Laan
University of California, Berkeley
Gan, G., Ma, C., and Wu, J., Data Clustering: Theory, Algorithms, and Applications
Hubert, L., Arabie, P., and Meulman, J., The Structural Representation of Proximity Matrices with MATLAB
Nelson, P. R., Wludyka, P. S., and Copeland, K. A. F., The Analysis of Means: A Graphical Method for
Comparing Means, Rates, and Proportions
Burdick, R. K., Borror, C. M., and Montgomery, D. C., Design and Analysis of Gauge R&R Studies: Making
Decisions with Confidence Intervals in Random and Mixed ANOVA Models
Albert, J., Bennett, J., and Cochran, J. J., eds., Anthology of Statistics in Sports
Smith, W. F., Experimental Design for Formulation
Baglivo, J. A., Mathematica Laboratories for Mathematical Statistics: Emphasizing Simulation and
Computer Intensive Methods
Lee, H. K. H., Bayesian Nonparametrics via Neural Networks
O’Gorman, T. W., Applied Adaptive Statistical Methods: Tests of Significance and Confidence Intervals
Ross, T. J., Booker, J. M., and Parkinson, W. J., eds., Fuzzy Logic and Probability Applications: Bridging the Gap
Nelson, W. B., Recurrent Events Data Analysis for Product Repairs, Disease Recurrences, and Other
Applications
Mason, R. L. and Young, J. C., Multivariate Statistical Process Control with Industrial Applications
Smith, P. L., A Primer for Sampling Solids, Liquids, and Gases: Based on the Seven Sampling Errors of
Pierre Gy
Meyer, M. A. and Booker, J. M., Eliciting and Analyzing Expert Judgment: A Practical Guide
Latouche, G. and Ramaswami, V., Introduction to Matrix Analytic Methods in Stochastic Modeling
Peck, R., Haugh, L., and Goodman, A., Statistical Case Studies: A Collaboration Between Academe and
Industry, Student Edition
Peck, R., Haugh, L., and Goodman, A., Statistical Case Studies: A Collaboration Between Academe and
Industry
Barlow, R., Engineering Reliability
Czitrom, V. and Spagon, P. D., Statistical Case Studies for Industrial Process Improvement
SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page iii
Data Clustering
Theory, Algorithms,
and Applications
Guojun Gan
York University
Toronto, Ontario, Canada
Chaoqun Ma
Hunan University
Changsha, Hunan, People’s Republic of China
Jianhong Wu
York University
Toronto, Ontario, Canada
Society for Industrial and Applied Mathematics
Philadelphia, Pennsylvania
American Statistical Association
Alexandria, Virginia
SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page iv
The correct bibliographic citation for this book is as follows: Gan, Guojun, Chaoqun Ma, and Jianhong
Wu, Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM Series on Statistics and Applied
Probability, SIAM, Philadelphia, ASA, Alexandria, VA, 2007.
Copyright © 2007 by the American Statistical Association and the Society for Industrial and Applied
Mathematics.
10 9 8 7 6 5 4 3 2 1
All rights reserved. Printed in the United States of America. No part of this book may be reproduced,
stored, or transmitted in any manner without the written permission of the publisher. For information,
write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center,
Philadelphia, PA 19104-2688.
Trademarked names may be used in this book without the inclusion of a trademark symbol. These
names are intended in an editorial context only; no infringement of trademark is intended.
Library of Congress Cataloging-in-Publication Data
Gan, Guojun, 1979-
Data clustering : theory, algorithms, and applications / Guojun Gan, Chaoqun Ma,
Jianhong Wu.
p. cm. – (ASA-SIAM series on statistics and applied probability ; 20)
Includes bibliographical references and index.
ISBN: 978-0-898716-23-8 (alk. paper)
1. Cluster analysis. 2. Cluster analysis—Data processing. I. Ma, Chaoqun, Ph.D. II.
Wu, Jianhong. III. Title.
QA278.G355 2007
519.5’3—dc22
2007061713
is a registered trademark.
Contents
List of Figures
List of Tables
List of Algorithms
Preface
xiii
xv
xvii
xix
1
I
1
2
1.3
1.4
1.5
1.6
Clustering, Data, and Similarity Measures
Data Clustering
1.1
1.2
. . . . . .
. . . . . .
. . . . . .
Records and Attributes . . . . .
Distances and Similarities
Clusters, Centers, and Modes . . . . .
Hard Clustering and Fuzzy Clustering . . . . .
Validity Indices . . . . .
. . . . . .
. . . . . .
. . . . .
. . . . . .
. . . . . . . . . .
Definition of Data Clustering . . . . .
The Vocabulary of Clustering . . . . .
1.2.1
1.2.2
1.2.3
1.2.4
1.2.5
Clustering Processes . . . . .
Dealing with Missing Values
Resources for Clustering . . . . .
1.5.1
1.5.2
1.5.3
1.5.4
1.5.5
Summary .
. . . . . . . .
.
.
.
.
Surveys and Reviews on Clustering . . . .
Books on Clustering . . . . .
Journals . . . .
Conference Proceedings . . . . .
Data Sets . . . .
. . . . . . . . .
. . . . . .
. . . . .
. . . . . . . . .
. . . . . .
. . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . .
.
.
. . . . . . . . .
. . . . . . .
. . . . . .
.
. . .
. .
3
3
5
5
5
6
7
8
8
. 10
. . . . 12
. . . . . . . . . 12
. . . . . . . . . . 12
. . . . . . . . . . 13
. . . . . . . . 15
. . . . . . . . . 17
. . . . 17
. . . .
. . . . . . . . . .
. . . . . .
. . . . .
. . . . . .
. . . . . . . . . .
. . . . . .
. . . . . . . . . .
. . . . .
. . . . . . . . . . .
Data Types
2.1
2.2
2.3
2.4
2.5
2.6
. . . .
Categorical Data . . . . .
Binary Data .
Transaction Data . . . . .
Symbolic Data . . . . .
Time Series
Summary .
. . . .
.
.
.
.
.
. . . . . .
. . . . . .
. . . . . . . . . .
. . . . . .
. . . . . . . . . .
. . . .
. . . . . .
. . . . . . . . . .
. . . . . .
. . . . . . . . . .
. . . .
. . . . . .
. . . . . . . . . .
. . . .
. . . . . . . . . .
. . . .
v
. . . .
19
. . . . 19
. . . . . . 21
. . . . 23
. . . . . 23
. . . . . . 24
. . . . . . . . 24
. . . .
vi
3
4
5
6
Scale Conversion
3.1
. . . . . . . . . . .
. . .
. . . . . . . . .
Interval to Ordinal
. . . .
Interval to Nominal . . . .
Ordinal to Nominal . . . .
Nominal to Ordinal . . . .
Ordinal to Interval
. . . .
Other Conversions . . . .
Introduction .
3.1.1
3.1.2
3.1.3
3.1.4
3.1.5
3.1.6
Categorization of Numerical Data . . . . .
3.2.1
3.2.2
3.2.3
Summary .
Direct Categorization . . . . .
Cluster-based Categorization . . . . .
Automatic Categorization . . . . .
. . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . .
. . . . .
.
.
.
. . . . . .
. . . . . .
. . . . . . . . . . .
3.2
3.3
Contents
25
. . 25
. . . . 25
. . . . .
. . . . 27
. . . . .
. . . . 28
. . . . .
. . . . 28
. . . . .
. . . . 29
. . . . .
. . . . .
. . . . 29
. . . . . . . . . 30
. . . . . . . . . . 30
. . . . . . 31
. . . . . . . 37
. . . . 41
. . . .
. . . .
43
. . 43
. . 46
. . . 46
. . . . 48
. 49
. . . . . . . . 51
. . . . . .
. . . .
Data Standardization and Transformation
4.1
4.2
. . . . . .
. . . . . .
Data Standardization . . . . .
Data Transformation . . . . .
4.2.1
4.2.2
4.2.3
Summary .
. . . . . .
. .
.
.
4.3
Principal Component Analysis . . . .
SVD . . . . .
The Karhunen-Loève Transformation . . . . .
. . . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . .
. . . . . . . . . .
. . . . . . . . . .
.
.
.
.
.
.
.
.
. . . . .
. . . . .
. . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
Data Visualization
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
Sammon’s Mapping . . . .
MDS .
SOM .
Class-preserving Projections
Parallel Coordinates . . . .
Tree Maps .
Categorical Data Visualization . . . . .
Other Visualization Techniques . . . . .
Summary .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . .
. . . . .
. . . .
.
.
.
.
. . . . . . . . .
. . . . .
. . . .
. . . . . .
. . . . . .
53
. . . . . . . . . 53
. . . . . . . . . . 54
. . . . . . . . . . 56
. . . . .
. . . . 59
. . . . . . . . . 60
. . . . . . . . . . . 61
. . . . . . . . . . 62
. . . . . . . . . . 65
. . . . 65
. . . . .
. . . .
. . . .
. . . . . . . . . . .
Similarity and Dissimilarity Measures
6.1
. . . . . . .
.
. . . . . .
Preliminaries .
Proximity Matrix . . . . .
6.1.1
Proximity Graph . . . . .
6.1.2
Scatter Matrix . . . . .
6.1.3
6.1.4
Covariance Matrix . . . . .
Measures for Numerical Data . . . . .
Euclidean Distance . . . . .
6.2.1
6.2.2
Manhattan Distance
Maximum Distance . . . . .
6.2.3
Minkowski Distance . . . . .
6.2.4
6.2.5
Mahalanobis Distance . . . . .
. . . . .
. . . . . .
. . . . . .
. . . . . . . . . .
. . . .
. . . . . . . . . .
. . . . . . . . . .
67
. . 67
. . 68
. . 69
. . . . 69
. 70
. 71
. 71
. . . . . . . . . . 71
. 72
. . . . . . . . . . 72
. . . . . . . . . 72
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . .
6.2
Contents
vii
. . . . . .
. . . . . . . . .
. . . . . .
. . . . . .
. . . . .
. . . . . .
. . . . .
. . . . . .
. . . . . .
. . . . .
. . . . . . . . .
. . . . .
. . . . . .
. . . . . . . . . .
. . . . . .
. . . . . . . . . .
. . . . . .
. . . . .
. . . . . . . . .
. . . . . .
Average Distance . . . . .
Other Distances . . . . .
The Simple Matching Distance . . . . .
Other Matching Coefficients . . . . .
A General Similarity Coefficient . . . .
A General Distance Coefficient . . . . .
A Generalized Minkowski Distance . . . . .
. . . . . . . . .
6.2.6
6.2.7
Measures for Categorical Data
6.3.1
6.3.2
Measures for Binary Data . . . .
Measures for Mixed-type Data
6.5.1
6.5.2
6.5.3
Measures for Time Series Data . . . .
6.6.1
6.6.2
6.6.3
6.6.4
6.6.5
6.6.6
6.6.7
Other Measures
6.7.1
6.7.2
6.7.3
Similarity and Dissimilarity Measures between Clusters . . . .
6.8.1
6.8.2
6.8.3
6.8.4
6.8.5
Similarity and Dissimilarity between Variables . . . . .
6.9.1
6.9.2
6.9.3
6.9.4
Summary .
. . 73
. . . 74
. . . . . . . . . . 74
. . . . . 76
. . . . . . 76
. . . . . . 77
. . . . . . . . . . 79
. . 79
. . . . . 80
. . 81
. . . 83
The Minkowski Distance . . . .
. 84
Time Series Preprocessing . . . . .
. . . . . . . 85
Dynamic Time Warping . . . . .
. . . . . . . . 87
Measures Based on Longest Common Subsequences . . . . 88
. . . . . 90
Measures Based on Probabilistic Models
. 91
Measures Based on Landmark Models . . . . .
Evaluation . . . . .
. 92
. . . . 92
. . . . 93
. . . 93
. . . 94
. . . . 94
. . . . . . . 94
. . . . 95
. . . . 95
. . . . 96
. 96
. . 98
. . . 98
. . . . 101
. . . 103
. . 105
. . . . . . . . 106
Pearson’s Correlation Coefficients . . . . .
Measures Based on the Chi-square Statistic . . . . .
Measures Based on Optimal Class Prediction . . . . .
Group-based Distance . . . .
The Mean-based Distance . . . . .
The Nearest Neighbor Distance . . . . .
The Farthest Neighbor Distance . . . . .
The Average Neighbor Distance . . . . .
Lance-Williams Formula . . . .
The Cosine Similarity Measure
A Link-based Similarity Measure
Support
. . . . . .
. . . . . .
. . . . . .
. . . .
. . . . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . . . . . . . .
. . . . . .
. . . . . . . . . .
. . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . .
. . . . . . . . .
. . . . .
. . . . . .
. . . . . . . . . .
. . . . .
. . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . .
. . . . .
.
.
.
.
. . . .
. . . . .
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
II Clustering Algorithms
107
7
Hierarchical Clustering Techniques
7.1
. . . . .
. . . . . . . . . .
. . . .
. . . . . . . . . .
. . . . . .
. . . . . .
Representations of Hierarchical Clusterings
7.1.1
7.1.2
7.1.3
7.1.4
7.1.5
7.1.6
7.1.7
n-tree . . . . .
Dendrogram . . . . .
Banner
Pointer Representation . . . .
Packed Representation . . . .
Icicle Plot . . . . .
. . . . . .
Other Representations
. . . . . .
. . . . .
. . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . .
. . . . . .
109
. . . 109
. . . . 110
. . . . 110
. . . 112
. . 112
. . 114
. . 115
. . . . . . . . . 115
. . . .
. . . . .
. . . . .
. . . .
viii
7.2
7.3
7.4
7.5
Contents
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . .
. . . . .
. . . . . . . . . .
. . . . . . . . .
The Single-link Method . . . . .
The Complete Link Method . . . . .
The Group Average Method . . . . .
The Weighted Group Average Method . . . . .
The Centroid Method . . . .
The Median Method . . . .
Ward’s Method . . . . .
Other Agglomerative Methods . . . .
Agglomerative Hierarchical Methods . . . . .
7.2.1
7.2.2
7.2.3
7.2.4
7.2.5
7.2.6
7.2.7
7.2.8
Divisive Hierarchical Methods
. . . . .
Several Hierarchical Algorithms . . . . .
7.4.1
7.4.2
7.4.3
7.4.4
7.4.5
7.4.6
7.4.7
7.4.8
Summary .
. . . . . . . 116
. . . . . . . . 118
. . . . . . 120
. . . . . . 122
. 125
. . . 126
. . . 130
. . . 132
. . . 137
. . . . . . . . . . 137
. . . . . . . . . . 138
SLINK . . . .
. . . . . . . . . . 138
Single-link Algorithms Based on Minimum Spanning Trees 140
. . . . . . . . . . 141
CLINK . . . .
. . . . . . . . . . 144
BIRCH . . . .
. . . . . . . . . . . 144
CURE . . . .
. . . . . . . . . . 145
DIANA . . . .
DISMEA . . . .
. . . . . . . . . 147
. . . . . .
Edwards and Cavalli-Sforza Method . . . . .
. . 147
. . . .
. . . . . . . . 149
. . . . . . . . .
. . . . . . . . .
. . . . . .
. . . . . .
. . . . .
. . . . . . . . . .
. . . . .
. . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . .
. . . . .
. . . . .
. . . . .
. .
.
.
.
. . . . . .
Fuzzy Clustering Algorithms
8.1
8.2
8.3
8.4
8.5
8.6
Fuzzy Sets .
Fuzzy Relations . . . .
Fuzzy k-means .
Fuzzy k-modes .
The c-means Method . . . .
Summary .
. . . . . .
. . . . . . .
. . . . . . . .
. .
.
.
. . . . . . . . .
. . . . . . . . .
. . . .
. . . . .
. . . . . . . . .
. . . .
. . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . .
. . . .
8
9
151
. . . . . . . . . . . 151
. . . . . . . . . . . 153
. . . . . . . . . 154
. 156
. . . . . . . . 158
. . . . . . . . 159
. . . .
161
. . . . 161
. . . . . . . 164
. . . . . . . . . 165
. . 165
. . . . . . . 166
. . . . . . . . .
Center-based Clustering Algorithms
9.1
9.2
. . . . . . . . . .
. . . . . .
The k-means Algorithm . . . . .
Variations of the k-means Algorithm . . . . .
9.2.1
9.2.2
9.2.3
9.2.4
The Continuous k-means Algorithm . . . .
The Compare-means Algorithm . . . .
The Sort-means Algorithm . . . . .
. . . . . .
Acceleration of the k-means Algorithm with the
kd-tree
Other Acceleration Methods . . . . .
. . . . . . . . .
. . . . . .
. . . . .
. . . .
. . . . . .
. . . . . .
9.2.5
The Trimmed k-means Algorithm . . . . .
The x-means Algorithm . . . . .
The k-harmonic Means Algorithm . . . . .
The Mean Shift Algorithm . . . . .
MEC .
The k-modes Algorithm (Huang)
9.8.1
The k-modes Algorithm (Chaturvedi et al.) . . . .
. . . . .
Initial Modes Selection . . . . .
. . . . . . . . . .
. . . . . .
. . . .
. .
.
.
.
.
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . . 167
. . . . . . 168
. . . . . . . . . 169
. . . . 170
. . . . . . . . . 171
. . . 173
. . . . 175
. . . . . . . . . 176
. . . . . . . . . 178
. . 178
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
9.3
9.4
9.5
9.6
9.7
9.8
9.9