logo资料库

Data clustering theory ,algorithms and applications.pdf

第1页 / 共488页
第2页 / 共488页
第3页 / 共488页
第4页 / 共488页
第5页 / 共488页
第6页 / 共488页
第7页 / 共488页
第8页 / 共488页
资料共488页,剩余部分请下载后查看
SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page i Data Clustering
SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page ii ASA-SIAM Series on Statistics and Applied Probability The ASA-SIAM Series on Statistics and Applied Probability is published jointly by the American Statistical Association and the Society for Industrial and Applied Mathematics. The series consists of a broad spectrum of books on topics in statistics and applied probability. The purpose of the series is to provide inexpensive, quality publications of interest to the intersecting membership of the two societies. Editorial Board Martin T. Wells Cornell University, Editor-in-Chief H. T. Banks North Carolina State University Douglas M. Hawkins University of Minnesota Susan Holmes Stanford University Lisa LaVange University of North Carolina David Madigan Rutgers University Mark van der Laan University of California, Berkeley Gan, G., Ma, C., and Wu, J., Data Clustering: Theory, Algorithms, and Applications Hubert, L., Arabie, P., and Meulman, J., The Structural Representation of Proximity Matrices with MATLAB Nelson, P. R., Wludyka, P. S., and Copeland, K. A. F., The Analysis of Means: A Graphical Method for Comparing Means, Rates, and Proportions Burdick, R. K., Borror, C. M., and Montgomery, D. C., Design and Analysis of Gauge R&R Studies: Making Decisions with Confidence Intervals in Random and Mixed ANOVA Models Albert, J., Bennett, J., and Cochran, J. J., eds., Anthology of Statistics in Sports Smith, W. F., Experimental Design for Formulation Baglivo, J. A., Mathematica Laboratories for Mathematical Statistics: Emphasizing Simulation and Computer Intensive Methods Lee, H. K. H., Bayesian Nonparametrics via Neural Networks O’Gorman, T. W., Applied Adaptive Statistical Methods: Tests of Significance and Confidence Intervals Ross, T. J., Booker, J. M., and Parkinson, W. J., eds., Fuzzy Logic and Probability Applications: Bridging the Gap Nelson, W. B., Recurrent Events Data Analysis for Product Repairs, Disease Recurrences, and Other Applications Mason, R. L. and Young, J. C., Multivariate Statistical Process Control with Industrial Applications Smith, P. L., A Primer for Sampling Solids, Liquids, and Gases: Based on the Seven Sampling Errors of Pierre Gy Meyer, M. A. and Booker, J. M., Eliciting and Analyzing Expert Judgment: A Practical Guide Latouche, G. and Ramaswami, V., Introduction to Matrix Analytic Methods in Stochastic Modeling Peck, R., Haugh, L., and Goodman, A., Statistical Case Studies: A Collaboration Between Academe and Industry, Student Edition Peck, R., Haugh, L., and Goodman, A., Statistical Case Studies: A Collaboration Between Academe and Industry Barlow, R., Engineering Reliability Czitrom, V. and Spagon, P. D., Statistical Case Studies for Industrial Process Improvement
SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page iii Data Clustering Theory, Algorithms, and Applications Guojun Gan York University Toronto, Ontario, Canada Chaoqun Ma Hunan University Changsha, Hunan, People’s Republic of China Jianhong Wu York University Toronto, Ontario, Canada Society for Industrial and Applied Mathematics Philadelphia, Pennsylvania American Statistical Association Alexandria, Virginia
SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page iv The correct bibliographic citation for this book is as follows: Gan, Guojun, Chaoqun Ma, and Jianhong Wu, Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA, 2007. Copyright © 2007 by the American Statistical Association and the Society for Industrial and Applied Mathematics. 10 9 8 7 6 5 4 3 2 1 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Trademarked names may be used in this book without the inclusion of a trademark symbol. These names are intended in an editorial context only; no infringement of trademark is intended. Library of Congress Cataloging-in-Publication Data Gan, Guojun, 1979- Data clustering : theory, algorithms, and applications / Guojun Gan, Chaoqun Ma, Jianhong Wu. p. cm. – (ASA-SIAM series on statistics and applied probability ; 20) Includes bibliographical references and index. ISBN: 978-0-898716-23-8 (alk. paper) 1. Cluster analysis. 2. Cluster analysis—Data processing. I. Ma, Chaoqun, Ph.D. II. Wu, Jianhong. III. Title. QA278.G355 2007 519.5’3—dc22 2007061713 is a registered trademark.
Contents List of Figures List of Tables List of Algorithms Preface xiii xv xvii xix 1 I 1 2 1.3 1.4 1.5 1.6 Clustering, Data, and Similarity Measures Data Clustering 1.1 1.2 . . . . . . . . . . . . . . . . . . Records and Attributes . . . . . Distances and Similarities Clusters, Centers, and Modes . . . . . Hard Clustering and Fuzzy Clustering . . . . . Validity Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Definition of Data Clustering . . . . . The Vocabulary of Clustering . . . . . 1.2.1 1.2.2 1.2.3 1.2.4 1.2.5 Clustering Processes . . . . . Dealing with Missing Values Resources for Clustering . . . . . 1.5.1 1.5.2 1.5.3 1.5.4 1.5.5 Summary . . . . . . . . . . . . . Surveys and Reviews on Clustering . . . . Books on Clustering . . . . . Journals . . . . Conference Proceedings . . . . . Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 5 5 5 6 7 8 8 . 10 . . . . 12 . . . . . . . . . 12 . . . . . . . . . . 12 . . . . . . . . . . 13 . . . . . . . . 15 . . . . . . . . . 17 . . . . 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Types 2.1 2.2 2.3 2.4 2.5 2.6 . . . . Categorical Data . . . . . Binary Data . Transaction Data . . . . . Symbolic Data . . . . . Time Series Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v . . . . 19 . . . . 19 . . . . . . 21 . . . . 23 . . . . . 23 . . . . . . 24 . . . . . . . . 24 . . . .
vi 3 4 5 6 Scale Conversion 3.1 . . . . . . . . . . . . . . . . . . . . . . . Interval to Ordinal . . . . Interval to Nominal . . . . Ordinal to Nominal . . . . Nominal to Ordinal . . . . Ordinal to Interval . . . . Other Conversions . . . . Introduction . 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.1.6 Categorization of Numerical Data . . . . . 3.2.1 3.2.2 3.2.3 Summary . Direct Categorization . . . . . Cluster-based Categorization . . . . . Automatic Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 3.3 Contents 25 . . 25 . . . . 25 . . . . . . . . . 27 . . . . . . . . . 28 . . . . . . . . . 28 . . . . . . . . . 29 . . . . . . . . . . . . . . 29 . . . . . . . . . 30 . . . . . . . . . . 30 . . . . . . 31 . . . . . . . 37 . . . . 41 . . . . . . . . 43 . . 43 . . 46 . . . 46 . . . . 48 . 49 . . . . . . . . 51 . . . . . . . . . . Data Standardization and Transformation 4.1 4.2 . . . . . . . . . . . . Data Standardization . . . . . Data Transformation . . . . . 4.2.1 4.2.2 4.2.3 Summary . . . . . . . . . . . 4.3 Principal Component Analysis . . . . SVD . . . . . The Karhunen-Loève Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Visualization 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Sammon’s Mapping . . . . MDS . SOM . Class-preserving Projections Parallel Coordinates . . . . Tree Maps . Categorical Data Visualization . . . . . Other Visualization Techniques . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 . . . . . . . . . 53 . . . . . . . . . . 54 . . . . . . . . . . 56 . . . . . . . . . 59 . . . . . . . . . 60 . . . . . . . . . . . 61 . . . . . . . . . . 62 . . . . . . . . . . 65 . . . . 65 . . . . . . . . . . . . . . . . . . . . . . . . Similarity and Dissimilarity Measures 6.1 . . . . . . . . . . . . . . Preliminaries . Proximity Matrix . . . . . 6.1.1 Proximity Graph . . . . . 6.1.2 Scatter Matrix . . . . . 6.1.3 6.1.4 Covariance Matrix . . . . . Measures for Numerical Data . . . . . Euclidean Distance . . . . . 6.2.1 6.2.2 Manhattan Distance Maximum Distance . . . . . 6.2.3 Minkowski Distance . . . . . 6.2.4 6.2.5 Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 . . 67 . . 68 . . 69 . . . . 69 . 70 . 71 . 71 . . . . . . . . . . 71 . 72 . . . . . . . . . . 72 . . . . . . . . . 72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2
Contents vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average Distance . . . . . Other Distances . . . . . The Simple Matching Distance . . . . . Other Matching Coefficients . . . . . A General Similarity Coefficient . . . . A General Distance Coefficient . . . . . A Generalized Minkowski Distance . . . . . . . . . . . . . . 6.2.6 6.2.7 Measures for Categorical Data 6.3.1 6.3.2 Measures for Binary Data . . . . Measures for Mixed-type Data 6.5.1 6.5.2 6.5.3 Measures for Time Series Data . . . . 6.6.1 6.6.2 6.6.3 6.6.4 6.6.5 6.6.6 6.6.7 Other Measures 6.7.1 6.7.2 6.7.3 Similarity and Dissimilarity Measures between Clusters . . . . 6.8.1 6.8.2 6.8.3 6.8.4 6.8.5 Similarity and Dissimilarity between Variables . . . . . 6.9.1 6.9.2 6.9.3 6.9.4 Summary . . . 73 . . . 74 . . . . . . . . . . 74 . . . . . 76 . . . . . . 76 . . . . . . 77 . . . . . . . . . . 79 . . 79 . . . . . 80 . . 81 . . . 83 The Minkowski Distance . . . . . 84 Time Series Preprocessing . . . . . . . . . . . . 85 Dynamic Time Warping . . . . . . . . . . . . . 87 Measures Based on Longest Common Subsequences . . . . 88 . . . . . 90 Measures Based on Probabilistic Models . 91 Measures Based on Landmark Models . . . . . Evaluation . . . . . . 92 . . . . 92 . . . . 93 . . . 93 . . . 94 . . . . 94 . . . . . . . 94 . . . . 95 . . . . 95 . . . . 96 . 96 . . 98 . . . 98 . . . . 101 . . . 103 . . 105 . . . . . . . . 106 Pearson’s Correlation Coefficients . . . . . Measures Based on the Chi-square Statistic . . . . . Measures Based on Optimal Class Prediction . . . . . Group-based Distance . . . . The Mean-based Distance . . . . . The Nearest Neighbor Distance . . . . . The Farthest Neighbor Distance . . . . . The Average Neighbor Distance . . . . . Lance-Williams Formula . . . . The Cosine Similarity Measure A Link-based Similarity Measure Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 II Clustering Algorithms 107 7 Hierarchical Clustering Techniques 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Representations of Hierarchical Clusterings 7.1.1 7.1.2 7.1.3 7.1.4 7.1.5 7.1.6 7.1.7 n-tree . . . . . Dendrogram . . . . . Banner Pointer Representation . . . . Packed Representation . . . . Icicle Plot . . . . . . . . . . . Other Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 . . . 109 . . . . 110 . . . . 110 . . . 112 . . 112 . . 114 . . 115 . . . . . . . . . 115 . . . . . . . . . . . . . . . . . .
viii 7.2 7.3 7.4 7.5 Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Single-link Method . . . . . The Complete Link Method . . . . . The Group Average Method . . . . . The Weighted Group Average Method . . . . . The Centroid Method . . . . The Median Method . . . . Ward’s Method . . . . . Other Agglomerative Methods . . . . Agglomerative Hierarchical Methods . . . . . 7.2.1 7.2.2 7.2.3 7.2.4 7.2.5 7.2.6 7.2.7 7.2.8 Divisive Hierarchical Methods . . . . . Several Hierarchical Algorithms . . . . . 7.4.1 7.4.2 7.4.3 7.4.4 7.4.5 7.4.6 7.4.7 7.4.8 Summary . . . . . . . . 116 . . . . . . . . 118 . . . . . . 120 . . . . . . 122 . 125 . . . 126 . . . 130 . . . 132 . . . 137 . . . . . . . . . . 137 . . . . . . . . . . 138 SLINK . . . . . . . . . . . . . . 138 Single-link Algorithms Based on Minimum Spanning Trees 140 . . . . . . . . . . 141 CLINK . . . . . . . . . . . . . . 144 BIRCH . . . . . . . . . . . . . . . 144 CURE . . . . . . . . . . . . . . 145 DIANA . . . . DISMEA . . . . . . . . . . . . . 147 . . . . . . Edwards and Cavalli-Sforza Method . . . . . . . 147 . . . . . . . . . . . . 149 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fuzzy Clustering Algorithms 8.1 8.2 8.3 8.4 8.5 8.6 Fuzzy Sets . Fuzzy Relations . . . . Fuzzy k-means . Fuzzy k-modes . The c-means Method . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 9 151 . . . . . . . . . . . 151 . . . . . . . . . . . 153 . . . . . . . . . 154 . 156 . . . . . . . . 158 . . . . . . . . 159 . . . . 161 . . . . 161 . . . . . . . 164 . . . . . . . . . 165 . . 165 . . . . . . . 166 . . . . . . . . . Center-based Clustering Algorithms 9.1 9.2 . . . . . . . . . . . . . . . . The k-means Algorithm . . . . . Variations of the k-means Algorithm . . . . . 9.2.1 9.2.2 9.2.3 9.2.4 The Continuous k-means Algorithm . . . . The Compare-means Algorithm . . . . The Sort-means Algorithm . . . . . . . . . . . Acceleration of the k-means Algorithm with the kd-tree Other Acceleration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.5 The Trimmed k-means Algorithm . . . . . The x-means Algorithm . . . . . The k-harmonic Means Algorithm . . . . . The Mean Shift Algorithm . . . . . MEC . The k-modes Algorithm (Huang) 9.8.1 The k-modes Algorithm (Chaturvedi et al.) . . . . . . . . . Initial Modes Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 . . . . . . 168 . . . . . . . . . 169 . . . . 170 . . . . . . . . . 171 . . . 173 . . . . 175 . . . . . . . . . 176 . . . . . . . . . 178 . . 178 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 9.4 9.5 9.6 9.7 9.8 9.9
分享到:
收藏