logo资料库

cluto paper.pdf

第1页 / 共10页
第2页 / 共10页
第3页 / 共10页
第4页 / 共10页
第5页 / 共10页
第6页 / 共10页
第7页 / 共10页
第8页 / 共10页
资料共10页,剩余部分请下载后查看
gCLUTO – An Interactive Clustering, Visualization, and Analysis System ∗ Matt Rasmussen University of Minnesota, Department of Computer Science and Engineering Minneapolis, MN 55455 mrasmus@cs.umn.edu ABSTRACT Clustering algorithms are exploratory data analysis tools that have proved to be essential for gaining valuable insights on various as- pects and relationships of the underlying systems. In this paper we present gCLUTO, a stand-alone clustering software package which serves as an easy-to-use platform that combines clustering algo- rithms along with a number of analysis, reporting, and visualization tools to aid in interactive exploration and clustering-driven analysis of large datasets. gCLUTO provides a wide-range of algorithms that operate either directly on the original feature-based representation of the objects or on the object-to-object similarity graphs and are capable of analyzing different types of datasets and finding clusters with different characteristics. In addition, gCLUTO implements a project-oriented work-flow that eases the process of data analysis. Keywords Cluster analysis, interactive data exploration, information visual- ization INTRODUCTION 1. Due to recent advances in information technology, there has been an enormous growth in the amount of data generated in fields rang- ing from business to the sciences. This data has the potential to greatly benefit its owners by increasing their understanding of their own work. However, the growing size and complexity of data has ∗This work was supported in part by NSF CCR-9972519, EIA- 9986042, ACI-9982274, ACI-0133464, and ACI-0312828; the Digital Technology Center at the University of Minnesota; and by the Army High Performance Computing Research Center (AH- PCRC) under the auspices of the Department of the Army, Army Research Laboratory (ARL) under Cooperative Agreement number DAAD19-01-2-0014. The content of which does not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute. CSE/UMN Technical Report: TR# 04–021 George Karypis University of Minnesota, Department of Computer Science and Engineering, Digital Technology Center, and Army HPC Research Center, Minneapolis, MN 55455 karypis@cs.umn.edu introduced new challenges in analyzing and extracting its mean- ing. As a result, data mining methods, which employ a variety of computational techniques are becoming increasingly important as the only viable approach to efficiently and effectively extract new information from these large datasets. One technique in particular, clustering, has been successful in a wide range of knowledge discovery applications. Clustering is an exploratory data analysis process [47, 7] that groups a set of data items such that the intra-cluster similarity is maximized and the inter-cluster similarity is minimized. These discovered clusters can be used to explain the characteristics of the underlying data dis- tribution, provide insights on various aspects and relationships of the underlying systems, and serve as the foundation for other data mining and analysis tasks. The topic of clustering has been extensively studied in many scien- tific disciplines and over the years a variety of different algorithms have been developed [32, 6, 26, 20, 35, 4, 53, 6, 12, 52, 15, 16, 23, 9, 25]. Two relatively recent surveys on the topics [21, 17] offer a comprehensive summary of the different applications and algorithms. This work provides a wide variety of techniques that are suited for clustering different types of datasets and find clusters that satisfy different properties. However, along with increasing choices for algorithms and param- eters has come increasing complexity of the data analysis process. Without proper selection of a clustering algorithm or parameters, important structures within the data may be overlooked. Therefore, there exists a need for a clustering analysis work bench, a single interactive work space where many clustering algorithms, visual- izations, and analysis tools are available to use in a way that allows the user to efficiently explore the best method of analyzing their data. In this paper, we introduce gCLUTO, a stand alone clustering soft- ware package designed to ease the use of clustering algorithms and their results. gCLUTO offers improvements over existing tools by providing features that make clustering practical for a wide vari- ety of applications. First, gCLUTO provides an array of clustering algorithms and options through the use of the CLUTO clustering library [25]. The CLUTO library provides highly optimized imple- mentations of agglomerative, k-means, and graph clustering, es- pecially in the context of sparse high-dimensional data. Second, gCLUTO helps the user sort through the algorithm options and re- sulting data files by providing a intuitive graphical interface. Fol-
lowing the paradigm of most development tools, gCLUTO uses the concept of a “project” in order to organize the user’s various datasets, clustering solutions, and visualizations. Third, gCLUTO provides both standard statistics and unique visualizations for in- terpreting clustering results. Given the wide range of options and factors that are involved in clustering, the user should carefully an- alyze their results and compare them with results generated with differing options. Therefore, additional effort has gone into visu- alizations that can facilitate analysis and comparisons. Lastly, the software package has been designed to suit a wide range of users from different problem domains who may or may not be knowl- edgeable about the subtleties of clustering. Accordingly, reason- able defaults are available for users who want quick results, while power users have access to all configuration options involved in clustering. Most importantly, these options and analysis tools are tightly integrated into a single interactive work bench that can serve as the user’s main work space for carrying out analysis. The rest of this paper is organized as follows. Section 2 describes some of the existing work that is most closely related to the topic of this paper. Section 3 provides a detailed description of gCLUTO by briefly describing the various clustering algorithms that it incorpo- rates, its project-oriented work-flow, and its various visualization, and analysis capabilities. Finally, Section 4 provides some con- cluding remarks and describes our ongoing and future efforts on enhancing its capabilities and applications. 2. RELATED WORK There are a number of commercial tools that integrate clustering methods in an easy-to-use graphical and interactive environment. Many of these tools, which include SAS’s enterprise miner, SPSS’s Clementine, and IBM’s Intelligent miner, are available within the context of a larger application suite focusing on providing an in- tegrated environment for data mining and analysis. Even though these tools are highly integrated, have advanced data import/export capabilities, and provide a variety of information visualization tech- niques, they tend to provide only a small number of different clus- tering algorithms limiting both the types of analysis that can be performed and the types of datasets that can be analyzed. In recent years, within the context of research in the life sciences, cluster analysis has emerged as an essential tool for analyzing gene expression datasets derived from microarrays and DNA chips [13, 50, 41, 55]. This has led to the development of a number of com- mercial and publicly available integrated tools for performing clus- ter analysis on this type of datasets. Among these are SpotFire Decision Site [46], the Rosetta Resolver [37], Expressionist [14], Cluster and TreeView [11], GeneCluster [5], BioConductor [10], TM4 [38], as well as others. Most of these tools are highly so- phisticated, offer a variety of different clustering algorithms, and have extensive visualization capabilities. However, since they are designed to analyze gene expression datasets, they can only oper- ate on datasets whose objects are represented with relatively low- dimensional dense feature vectors and cannot operate (or scale) to large, high-dimensional sparse datasets (e.g., datasets derived from customer transactions or document datasets). lection and to that extent they provide a number of sophisticated visualizations that help the user to quickly focus on the informa- tion that he or she wants. However, they are not designed to be general purpose cluster analysis tools, as they only include a small number of different clustering algorithms and they are focused on a specific task. Finally, a number of different techniques have been developed for visualizing clustering solutions. This research falls within the broad area of information visualization and its goal is to provide the user with an intuitive representation of the underlying dataset by show- ing/highlighting the inter- and intra-relationships between the ob- jects across and within clusters. Examples of such techniques in- clude various 2D and 3D dendrograms [43, 19], which are highly effective for representing hierarchical clustering solutions; 2D and 3D non-linear projections of clustering solutions such as Sammon map [40], MDS [30], SOM [29], and various projections based on PCA and LSI [20, 4]; and hyperbolic visualizations [31, 19, 51]. 3. gCLUTO: INTERACTIVE CLUSTERING, VISUALIZATION, AND ANALYSIS gCLUTO’s primary design goals are to provide an easy-to-use plat- form that combines a variety of different clustering algorithms, which are capable of analyzing different types of datasets and find- ing clusters with different characteristics, along with a number of analysis, reporting, and visualization tools to aid in the interactive exploration and clustering-driven analysis of large datasets. To- ward these goals, gCLUTO provides a wide-range of clustering al- gorithms that operate either directly on the original feature-based representation of the objects or on the object-to-object similarity graphs and a project-oriented work-flow that eases the process of data analysis. 3.1 Clustering Algorithms gCLUTO supports agglomerative, partitional, and graph partitional clustering algorithms, each of which have different advantages and disadvantages, are suited for datasets with different characteristics, and can be used to perform different types of analysis. In addition, each of these algorithms can be used within the context of a boot- strap clustering process that uses statistical techniques to determine the stability of the resulting clustering solutions. 3.1.1 Agglomerative Clustering Agglomerative algorithms find the clusters by initially assigning each object to its own cluster and then repeatedly merging pairs of clusters until either the desired number of clusters has been ob- tained or all the objects have been merged into a single cluster leading to a complete agglomerative tree. The key step in these algorithms is the method used to identify the pairs of clusters to be merged next. gCLUTO supports ten different merging schemes. The first three are based on the classical single-link [45], complete- link [28], and group average [22] approaches, whereas the other seven, called I1, I2, E1, H1, H2, G1, and G2, are based on some recently introduced schemes that were motivated by research on partitional clustering [54]. There are a number of different systems that are specifically de- signed for the interactive exploration/visualization of document col- lections and query results that use document clustering and clus- ter visualization as one of the mechanisms to facilitate this type of analysis [3, 18, 8, 33, 49, 42, 34]. The primary goal of these systems is to aid the user in navigating through the document col- These schemes differ on how the similarity between the individual objects in the various clusters are combined to determine the sim- ilarity between the clusters themselves. The single-link criterion function measures the similarity of two clusters by the maximum similarity between any pair of objects from each cluster, whereas the complete-link uses the minimum similarity. In general, both
the single- and the complete-link approaches do not work very well because they either base their decisions to a limited amount of in- formation (single-link), or assume that all the objects in the cluster are very similar to each other (complete-link). On the other hand, the group average approach measures the similarity of two clusters by the average of the pairwise similarity of the objects from each cluster and does not suffer from the problems arising with single- and complete-link. The remaining seven schemes take an entirely different approach and treat the clustering process as an optimiza- tion problem by selecting the cluster-pairs that optimize various as- pects of intra-cluster similarity, inter-cluster dissimilarity, and their combinations. The advantage of these schemes is that they lead to more natural clusters and agglomerative trees that are more bal- anced than the more traditional schemes. A precise description of these schemes is beyond the scope of this paper, and the reader should refer to [54, 56] for a detailed description and comparative evaluation. 3.1.2 Partitional Clustering Partitional clustering algorithms find the clusters by partitioning the entire dataset into a predetermined number of disjoint sets, each corresponding to a single cluster. This partitioning is achieved by treating the clustering process as an optimization procedure that tries to create high quality clusters according to a particular func- tion that reflects the underlying definition of the “goodness” of the clusters. This function is referred to as the clustering criterion func- tion and gCLUTO implements seven such criterion functions (that are similar to the I1, I2, E1, H1, H2, G1, and G2 schemes used by agglomerative clustering) and have been shown to produce high- quality clusters in low- and high-dimensional datasets [56]. gCLUTO uses two different methods for computing the partitioning clustering solution. The first method computes a k-way clustering solution via a sequence of repeated bisections, whereas the sec- ond method computes the solution directly (in a fashion similar to traditional K-means-based algorithms). These methods are often referred to as repeated bisecting and direct k-way clustering, re- spectively. In both cases, the clustering solution is computed using a randomized incremental optimization algorithm that is greedy in nature, has low computational requirements, and produces high- quality solutions [56]. 3.1.3 Graph Partitional gCLUTO’s graph-partitioning-based clustering algorithms use a sparse graph to model the affinity relations between the different objects, and then discover the desired clusters by partitioning this graph [23]. To some extent, this approach is similar in spirit with that used by the partitional clustering algorithms described earlier; however, unlike those algorithms, the graph-partitioning-based approaches consider only the affinity relations between an object and a small number of its most similar other objects. As will be discussed later, this enables them to find clusters that have inherently different char- acteristics than those discovered by partitional methods. gCLUTO provides different methods for constructing this affinity graph and various post-processing schemes that are designed to help in identifying the natural clusters in the dataset. The actual graph partitioning is computed using an efficient multilevel graph- partitioning algorithm [24] that leads to high-quality partitionings and clustering solutions. 3.1.4 Bootstrap Clustering Although the clustering problem is a well-defined optimization prob- lem, there can still be uncertainty associated with its result due to uncertainties in the input data. For example, in the case of clus- tering data generated from measurements, each measurement will have a margin of error. Without additional analysis, a clustering algorithm will cluster the data under the assumption that the data is completely accurate. However, it would be more appropriate if the algorithm could incorporate the uncertainties associated with the data to produce a clustering result that could portray its level of statistical significance given the uncertainties. Bootstrap clustering is a technique introduced by [27] that adds the statistical technique of bootstrapping to clustering algorithms. Bootstrapping simulates the multiple sampling of a distribution by randomly selecting values from a known sample with replacement. By sampling with replacement, new hypothetical datasets can be produced from the original dataset that exhibit the same distribution of values and uncertainties. This allows clustering algorithms to explore what results would occur if the same measurements were taken again. gCLUTO implements two methods of bootstrapping data: resam- pling features and resampling residuals. To resample the features of a dataset, gCLUTO randomly selects with replacement columns from the dataset to produce a new set of features. This resampling tests to what extent the clustering algorithm may be relying on any particular feature. The resampling of residuals is performed by first supplying a residual matrix for the dataset. A residual matrix con- tains the errors associated with a dataset, which can be found by fitting the data to a linear model. In [27], residuals for microarray data are found by fitting the data to an ANOVA model. gCLUTO can accept residual matrices stored in character delimited file for- mats. With the residual matrix, gCLUTO performs bootstrapping to generate a new residual matrix, which is then added to the original dataset to produce a new hypothetical dataset. Bootstrap clustering uses these hypothetical datasets to estimate the significance of a clustering solution by clustering each hypotheti- cal dataset and comparing all of their clustering solutions. gCLUTO provides three statistics for reporting a clustering solution’s signifi- cance: solution stability, cluster stability, and object stability [36]. The term stability refers to the level of consistency observed be- tween the various clustering results. These stability measurements range from zero (no consistency between solutions) to one (com- plete consistency between solutions). Solution stability represents the significance of the solution as a whole, where as cluster and object stability portray a significance level on a per cluster and per object basis. gCLUTO compares the various solutions generated by bootstrap clustering by computing a consensus solution to which a mapping is found to all other solutions. This star-like mapping arrangement allows comparisons to be made between any pair of solutions while also requiring the fewest mappings to be found. The consensus so- lution is found by generating a solution that is most similar to most of the solutions. This is done by clustering the objects using a sim- ilarity graph built from information about the multiple bootstrap solutions. gCLUTO builds a similarity graph by defining the sim- ilarity of two objects as the percentage of bootstrap solutions that assign the two objects to the same cluster. The consensus solution is also the final solution that gCLUTO presents to the user in the solution report.
3.1.5 Characteristics of the Various Clustering Al- gorithms The various clustering algorithms provided by gCLUTO have been designed, and are well-suited, for finding different types of clusters— allowing for different types of analysis. There are two general types of clusters that often arise in different application domains and dif- ferent analysis requirements. What differentiates them is the simi- larity relations among the objects assigned to the various clusters. The first type contains clusters in which the similarity between all pairs of objects assigned to the same cluster will be high. On the other hand, the second type contains clusters in which the direct pairwise similarity between the various objects of the same clus- ter may be quite low, but within each cluster there exist a suf- ficiently large number of other objects that eventually “connect” these low similarity objects. That is, if we consider the object- to-object similarity graph, then these objects will be connected by many paths that stay within the cluster that traverse high similarity edges. The names of these two cluster types have been inspired by this similarity-based view, and they are referred to as globular and transitive clusters, respectively. gCLUTO’s partitional and ag- glomerative algorithms are able to find clusters that are primarily globular, whereas its graph-partitioning and some of its agglomer- ative algorithms (e.g., single-link) are capable of finding transitive clusters. 3.1.6 Similarity Measures gCLUTO’s feature-based clustering algorithms treat the objects to be clustered as vectors in a multi-dimensional space and measure the degree of similarity between these objects using either the co- sine function, the Pearson’s correlation coefficient, extended Jac- card coefficient [48], or a similarity measure derived from the Eu- clidean distance of these vectors. The first two similarity measures can be used by all clustering algorithms, whereas the last two can be used only by the graph-partitioning-based algorithms. By using the cosine and correlation coefficient measures, two ob- jects are similar if their corresponding vectors1 point in the same direction (i.e., they have roughly the same set of features and in the same proportion), regardless of their actual length. On the other hand, the Euclidean distance does take into account both direction and magnitude. Finally, similarity based on extended Jaccard co- efficient accounts both for angle as well as magnitude. These are some of the most widely used measures, and have been used exten- sively to cluster a variety of different datasets. 3.1.7 Computational Requirements gCLUTO’s algorithms have been optimized for operating on very large datasets both in terms of the number of objects as well as the number of dimensions. Nevertheless, the various clustering algo- rithms have different memory and computational scalability char- acteristics. The agglomerative based schemes can cluster datasets containing 2,000–5,000 objects in under a minute but due to their memory requirements they should not be used to cluster datasets with over 10,000 objects. The partitional algorithms are very scal- able both in terms of memory and computational complexity, and can cluster datasets containing several tens of thousands of ob- jects in a few minutes. Finally, the complexity of the graph-based schemes is usually between that of agglomerative and partitional 1In the case of Pearson’s correlation coefficient, the vectors are obtained by first subtracting their average value. Figure 1: Overview of gCLUTO’s work-flow with example screen-shots for each stage. methods and maintain the low memory requirements of the parti- tional schemes. 3.2 Clustering Work-flow and Organization The main strength of gCLUTO is its ability to organize the user’s data and work-flow in a way that eases the process of data anal- ysis. This work-flow often consists of a sequence of stages, such as importing and preparing data, selecting clustering options, inter- preting solution reports, and concluding with visualization. Each stage of the process demands decisions to be made by the user that can alter the course of data analysis. Consequently, the user may want to backtrack to previous stages and create a new branch of analysis. An overview of this work-flow with examples of branch- ing is depicted in Figure 1. gCLUTO assists these types of work-flows by introducing the con- cept of a project. A project manages the various data files, solu- tions reports, and visualizations that the user generates by storing and presenting them in a single container. Figure 2 illustrates how gCLUTO uses a tree to represent a project as it progresses through the stages of data analysis. The work-flow of a user begins by creating a new project. gCLUTO will create a new directory to hold all project related files as well as a new empty project tree. Next the user imports one or more re- lated datasets. These datasets are represented by icons that appear directly beneath the project tree’s root. After importation, the user can cluster a dataset to produce a clustering solution. For each so- lution, a solution report is generated which contains statistics about the clustering. Clustering solutions are presented by an ‘S’ icon and are placed beneath the clustered dataset’s icon in the project tree. As more clustering solutions are generated, the project tree will continue to organize them by their corresponding datasets. Lastly, the work-flow concludes with interpreting a solution using one or more visualizations. Again, the project tree will reflect which solu- tions have generated visualizations by placing beneath them visu- alization icons. 3.2.1 Creating a New Project The user begins their analysis by first creating a new project. A project is intended to hold one or more related datasets. The project
Figure 2: A screen-shot of the project tree displaying data items, solutions, and visualizations. On the right is an exam- ple of a Solution Report. tree provides an easy interface for switching between datasets and comparing their results. When a project is saved, all of the project information is saved under a single project directory specified by the user. Within the project directory, directories and text files are used to capture the same tree structure seen in the gCLUTO project tree. This straight forward format is used so that third party applications can access gCLUTO’s project data. In addition, gCLUTO allows exporting of solutions and printing of visualizations to standard formats for ex- ternal use. 3.2.2 Importing Data Datasets can be imported into gCLUTO in a variety of formats. Currently the supported formats include the (i) sparse and dense vectors, (ii) object-to-object similarity matrices, and (iii) character delimited files. The vector format contains a matrix written in either dense or sparse form. Each row of the matrix represents an object, whereas each column represents a feature. The value of the of ith row and jth column represents the strength of feature j in object i. With this matrix, gCLUTO will compare objects by comparing their vec- tors. This format can be used to directly represent a wide-range of datasets, including the document-term matrices commonly used in information retrieval, customer purchasing transaction, gene ex- pression measurements, or any other datasets that is represented as a rectangular matrix whose rows are the objects and columns are the various dimensions. If such vectors are not available, but information about object pair- wise similarities is available, then the gCLUTO’s similarity format can be used. This format consists of a square matrix with same number of rows and columns as the number of objects. The value in the ith row and jth column represents the similarity of the ith and jth object. Note that the user can specify either a dense or a sparse similarity matrix. The similarity entries that are not supplied are assumed to be zero. Character delimited files contain the same information as the gCLUTO’s vector format except in a more common and flexible form. Most spreadsheet applications can export data in character delimited for- Figure 3: Several screen-shots of the clustering dialog. mats. The format also allows labels to be present in the first row and column of the matrix. 3.2.3 Clustering Once a dataset is imported into gCLUTO, clustering (using the var- ious algorithms described in Section 3.1) can be initiated by select- ing the desired options from the clustering options dialog pictured in Figure 3. A full listing of the available options is shown in Table 1. These op- tions have been organized into four sections: General, Preprocess, Bootstrap, and Miscellaneous. The most general options include specifying the number of desired clusters, the clustering method, and similarity and criterion functions. The preprocess options al- low the user to prepare their data before clustering. This is accom- plished by using model and pruning functions. The models scale various portions of the dataset, whereas the pruning options gen- erate a more descriptive subset of the dataset. These options are necessary for datasets that have value distributions that may skew clustering algorithms. In addition, these pre-processing options can be used to implement a number of object and feature weighting schemes that are used within the context of document clustering including tf-(no)idf, maxtf-(no)idf, and logtf-(no)idf [39]. 3.3 Solution Reports Solution reports are generated for each dataset that is clustered. Solution reports contain information about the clustering options used and statistics about the discovered clusters. These statistics include the number of clusters, cluster sizes, the average internal and external similarities (ISim and ESim), the average internal and external standard deviations of these similarities (ISdev and ESdev), and a list of the most discriminating and descriptive features for each cluster. For each of these features, gCLUTO also displays the percentage of the within cluster similarity and across cluster difference that these features account for, respectively. If known classes are specified for the objects, then the entropy, purity, and class conservation statistics are also displayed. In Figure 4 an example solution report is given for a dataset con- taining documents about sports. Each object is a document that contains the words in the documents. A class file has been speci- fied for this dataset that allows gCLUTO to compare its clustering to the known classes. With this information gCLUTO can calculate the purity and entropy of a cluster by noting how many different classes are associated to the objects of the cluster. The class dis- tribution matrix shows how many objects of each cluster belong to
Table 1: Clustering options available in gCLUTO. Some options are only available for certain clustering methods. General # of Clusters Method Similarity Criterion Preprocess Models Row Column Graph Pruning Column Vertex Edge Bootstrap Perform # of Iterations Features Residuals Miscellaneous Graph Options Components Neighbors Partitioning # of Trials # of Iterations Selection K-way refine Number of clusters that the algorithm should find Clustering algorithm to use Function to measure the similarity between two objects Function to guide algorithm by evaluating intermediate clusterings Scales the values of each row in data matrix Globally scales the values of each row across rows Determines when an edge will exist between two vertices Remove columns that do not contribute to similarity Remove vertices that tend to be outliers Remove edges that tend to connect clusters Whether to perform bootstrap clustering Number of solutions to create Whether to resample the features in each iteration Whether to resample data by adding residuals Remove small connected components before clustering # of nearest neighbors used in graph-partitioning # of clusterings to create to search for best clustering # of refinement iterations in partitioning Determines how to select next cluster for bisection Whether to k-way refine a repeated bisectioning solution each class. From the class distribution, we can see that clusters 0 through 6 associate strongly to a single class. Cluster 7, however, appears to contain objects from many classes. 3.4 Visualization gCLUTO can generate two different visualizations that can be used to gain insight on the relationships between the clusters and the relationships between the objects within each cluster. Both of these visualizations are entirely interactive and can be easily customized and modified by the user. 3.4.1 Matrix Visualization The Matrix visualization allows the user to interactively view, ex- plore, and understand the clustering solution by zooming, querying, and averaging arbitrary rows and columns. It represents the origi- nal data matrix except with a few alterations. First, colors are used to represent the values of the matrix. For example, dark red repre- sents large positive values, while lighter shades are used for values near zero. Conversely, shades of green are used for negative values. Second, the rows of the matrix are reordered in order to display the clusters found during clustering. Objects of the same cluster have their rows placed consecutively and black horizontal lines separate rows belonging to different clusters. This display allows the user to visually inspect their data for pat- terns. In an ideal clustering solution, rows belonging to the same cluster should have relatively similar patterns of red and green. The visualization emphasizes these patterns for the user by displaying them in contiguous blocks. If the features represent a sequence, for example measurements in a time-course experiment, then the user can identify trends that occur across the features. The user may Figure 4: Example solution report of a clustering of sports re- lated documents. The sections of this report in order are clus- tering options, cluster statistics, class distribution, and descrip- tive and discriminating features.
Figure 5: A screen-shot of the Matrix visualization. also be able to identify more questionable clusters by observing stark dissimilarities between rows within a cluster. In addition to the color matrix, the visualization also includes la- bels and hierarchical trees located at the edges of the matrix. If the user supplies labels with their data, then the rows of the matrix will be labeled with object names and the columns with feature names. If the user clusters their data with an agglomerative algorithm, then the agglomerative tree will be displayed on the left-hand side of the visualization. The user may also generate a hierarchical tree even if a partitional clustering algorithm was used. In such cases, gCLUTO performs additional agglomerative clustering within each partitional cluster and a single agglomerative clustering of the clus- ters themselves. Using the trees generated from these additional clusterings, gCLUTO constructs a single hierarchical tree that con- forms to the same cluster structure found with the partitional algo- rithm. Lastly, the Matrix Visualization can also display a hierarchi- cal tree called the feature tree, which is generated by performing agglomerative clustering on the transpose of the data matrix. Similar to visualizations in other clustering applications, the hier- archical tree depicts relationships between objects by displaying the order in which objects were merged in the agglomerative pro- cess. Since merging is performed by descending pair-wise simi- larity, objects that are near each other in the tree are more similar than objects placed in distant locations. However, if users want to draw conclusions about object similarities using the hierarchical tree, they must keep in mind that a two-dimensional drawing of a hierarchical tree is not unique. That is, for every parent node in the tree, the two children nodes and their sub-trees can be drawn in one of two possible orientations: top or bottom (note that gCLUTO draws the hierarchical tree with children placed to the upper-right or lower-right of their parents). gCLUTO removes this ambiguity by explicitly ordering the visual position of each subtree by choos- ing the set of orientations that maximizes the similarity between objects placed in consecutive rows in the Matrix Visualization. Manipulating the Matrix Visualization. Once the Matrix vi- sualization is generated, users can further explore their results by manipulating the visualization in several ways. First, the user may collapse any set of rows or columns in the matrix by collapsing the corresponding nodes in the hierarchical trees located above and to the left of the matrix. By collapsing a node of the tree, the user can hide all of the node’s descendants. In the matrix, the corresponding rows that belong to the leaves of the collapsed sub-tree are replaced by a single representative row. The representative row contains the average vector of all of the hidden rows and, thus, summaries the data in a condensed form. This feature is especially useful for large datasets that are difficult to fully display on a computer monitor. Columns can also be collapsed in a similar manner. When a rep- resentative row crosses a representative column, the intersection is a representative cell, which contains the average value of the cells contained within the collapsed rectangular region. A frequent use of row averaging is to view the cluster mid-point vectors. This can be done either by collapsing the appropriate nodes in the object hierarchical tree, or by selecting the “Show Only Clusters” option from the “Matrix” menu. The user may also quickly expand all collapsed nodes by choosing the “Show All Ob- jects” option from the “Matrix” menu. The last manipulation that is available to the user is scaling. A common problem with viewing similar visualizations in other ap- plications, is that it is difficult to represent a large dataset on a rela-
tively small display. One solution is to only display a portion of the visualization at any one time and allow the user to scroll to view other portions. The downside to this solution is that the user has a narrow view of their data, which makes it difficult to compare local details to the global trends. Another solution is to shrink the graph- ics until they fit within the viewable area. In cases where the matrix has more rows and columns than the number of pixels available, it becomes difficult to appropriately represent the matrix without ex- cessive distortion. gCLUTO implements a unique compromise by allowing the user to zoom in on portions of the matrix that are of interest, while zooming away from portions that are less important but are still needed for context. Scaling is initiated by selecting a rectangular region of cells in the matrix by dragging the mouse. Once selected, the rectangular re- gion can be scaled to a larger or smaller size by using the mouse and dragging on any of the region’s edges. This action will rescale the selected region, while keeping the scaling of neighboring re- gions intact. The visualization also provides several menu options and controls for performing common scalings. 3.4.2 Mountain Visualization The Mountain Visualization is another visualization that gCLUTO provides for analyzing a clustering result. Its purpose is to visually aid the user in understanding the contents of a high-dimensional dataset. The dimension of a dataset is problem specific and is de- termined by the number of features present, which can be on the order of tens to thousands. Since it is not convenient to directly display this data on a two-dimensional screen, a function must be chosen that maps the high-dimensional data to an easily displayed lower-dimensional representation. For each cluster, the Mountain Visualization provides the number of constituent objects, internal similarity, external similarity, and standard deviation. The visu- alization attempts to summarize all of this information into one graphical form. When a user generates a Mountain Visualization from a clustering result, a 3D OpenGL window displaying a colored mountain-like terrain is opened. This terrain consists of a horizontal plane which rises in peaks in several locations. Each peak represents a single cluster in the clustering. Information about the corresponding clus- ter is represented by the peak’s location on the plane, height, vol- ume, and color. The most informative attribute of a peak is its location on the plane with respect to other peaks. The distance between a pair of peaks on the plane represents the relative similarity of their clusters. Sim- ilarity is measured using the same similarity function chosen for clustering. The purpose of this representation, is to illustrate the relationships between clusters using visual distance. In this man- ner, clusters that are similar will have peaks that lie closely to- gether, whereas more dissimilar clusters will be displayed with dis- tant peaks. In Figure 6, an example Mountain Visualization is given of a clus- tered dataset. Although this clustering specifies ten clusters, the visualization has chosen to place these peaks into two large groups. This indicates a more general structure in the dataset, namely two large dissimilar clusters with high internal similarity. Given this information, the user may make conclusions about the meaning of their clusters or may chose to re-cluster their data with a different specified number of clusters. Figure 6: A screen-shot of the Mountain visualization display- ing ten clusters in two major groups. The height of each peak on the plane is proportional to the internal similarity of the corresponding cluster. Internal similarity is calcu- lated by finding the average pair-wise similarity between objects of the cluster. The volume of a peak is proportional to the number of objects within the cluster. Lastly, the color of a peak represents the internal standard deviation of the cluster’s objects. Red represents low deviation, whereas blue represents high deviation. The internal deviation is calculated by finding the average standard deviation of the pair-wise similarities between the cluster’s objects. The overall effect of this representation is to emphasize features of highly structured data. For example, the user will be able to quickly identify clusters with high similarity by finding tall peaks. Also the user will be able to identify clusters with low standard deviation, another feature of structured data, by finding peaks with “hot” col- ors, such as red and orange. Clusters with high standard deviation are often “noisy” and so they are given a cool blue color. Since the default color of the terrain is blue, noisy and less meaningful clusters appear to blend into the background. Implementation. The core algorithm behind the Mountain Vi- sualization is the mapping function between the original high-dime- nsional dataset and the two-dimensional data that is displayed. The Mountain Visualization uses Multidimensional Scaling (MDS) [30, 9] to find a mapping that minimizes the data’s distortion as it is mapped to the plane. MDS is an algorithm that takes as input a list of high-dimensional vertices and outputs a list of lower-dimensional vertices. In gCLUTO’s implementation, the cluster midpoints are used as input and the out- put consists of two-dimensional points, which are used to place peaks on the plane of the visualization. MDS evaluates a partic- ular mapping using a stress function that calculates the mapping’s distortion using a sum-of-squared-error calculation. The optimal mapping is defined as the mapping with the least error, which is found by: X 1 Error = i
分享到:
收藏