基于熵的分类数据聚类算法.pdf

发布时间：2022-06-06 发布人：admin 分类：说明书资料大小：0.27M 资料格式：pdf 举报版权申诉

shijfwangb-8221779-4744302543396713859.pdf-第1页.png

第1页 / 共15页

shijfwangb-8221779-4744302543396713859.pdf-第2页.png

第2页 / 共15页

shijfwangb-8221779-4744302543396713859.pdf-第3页.png

第3页 / 共15页

shijfwangb-8221779-4744302543396713859.pdf-第4页.png

第4页 / 共15页

shijfwangb-8221779-4744302543396713859.pdf-第5页.png

第5页 / 共15页

shijfwangb-8221779-4744302543396713859.pdf-第6页.png

第6页 / 共15页

shijfwangb-8221779-4744302543396713859.pdf-第7页.png

第7页 / 共15页

shijfwangb-8221779-4744302543396713859.pdf-第8页.png

第8页 / 共15页

文本预览

COOLCAT: An entropy-based algorithm for categorical clustering Daniel Barbara Julia Couto Yi Li George Mason University Information and Software Engineering Department Fairfax, VA Phone: - , Fax=- {dbarbara,jics,yli}@gmu.edu October , Abstract Clustering of categorical attributes is a dicult problem that has not received as much at- tention as its numerical counterpart. In this paper we explore the connection between clustering and entropy: clusters of similar points have lower entropy than those of dissimilar ones. We use this connection to design a heuristic algorithm, COOLCAT, which is capable of eciently cluster large data sets of records with categorical attributes. In contrast with other categorical clustering algorithms published in the past, COOLCAT’s clustering results are very stable for dierent sam- ple sizes and parameter settings. Also, the criteria for clustering is a very intuitive one, since it is deeply rooted on the well-known notion of entropy. We demonstrate the eciency and scalability of COOLCAT by a series of experiments on real and synthetic data sets. Introduction Clustering is a widely used technique in which data points are partitioned into groups, in such a way that points in the same group, or cluster, are more similar among themselves than to those in other clusters. Clustering of categorical attributes (i.e., attributes whose domain is not numeric) is a dicult, yet important task: many elds, from statistics to psychology deal with categorical data. In spite of its importance, the task of categorical clustering has received scant attention in the KDD community as of late, with only a handful of publications addressing the problem ([, , ]). Much of the published algorithms to cluster categorical data rely on the usage of a distance metric that captures the separation between two vectors of categorical attributes, such as the Jaccard coecient []. In this paper, we present COOLCAT (the name comes from the fact that we reduce the entropy of the clusters, thereby \cooling" them), a novel method which uses the notion of entropy to group records. We argue that a classical notion such as entropy is a more natural and intuitive way of relating records, and more importantly does not rely in arbitrary distance metrics. This paper is set up as follows. Section oers the background and relationship between entropy and clustering, and formulates the problem. Section reviews the related work. Section describes COOLCAT, our algorithm. Section presents the experimental evidence that demonstrates the advantages of COOLCAT. Finally, Section presents conclusions and future work. This work has been supported by NSF grant IIS-

Background and problem formulation In this section, we present the background of entropy and clustering and formulate the problem. . Entropy and Clustering Entropy is the measure of information and uncertainty of a random variable []. Formally, if X is a random variable, S(X) the set of values that X can take, and p(x) the probability function of X, the entropy E(X) is dened as shown in Equation . E(X) = X p(x)log(p(x)) x S(X) () The entropy of a multivariate vector ^x = fX; ; Xng can be computed as shown in Equation . E(^x) = X X p(x; ; xn)logp(x; ; xn) () x S(X) xn S(Xn) Entropy is sometimes referred to as a measure of the amount of "disorder" in a system. A room with socks strewn all over the oor has more entropy than a room in which socks are paired up, neatly folded, and placed in one side of your sock and underwear drawer. . Problem formulation j ; ; pd The problem we are trying to solve can be formulated as follows. Given a data set D of N points ^p; ; ^pN , where each point is a multidimensional vector of d categorical attributes, i.e., ^pj = (p j ), and given an integer k, we would like to separate the points into k groups C; ; Ck, or clusters, in such a way that we minimize the entropy of the whole arrangement. Un- fortunately, this problem is NP-Complete, and moreover, dicult to approximate []. In fact, the problem is NP-Complete for any distance function d(x; y), dened over pairs of points x; y, such that the function maps pairs of points to real numbers (and hence, our entropic function qualies), therefore we need to resort to heuristics to solve it. We rst have to resolve the issue of what we mean by the \whole entropy of the system." In other words, we have to make our objective function clear. We aim to minimize the expected entropy, whose expression is shown in Equation , where E(P (C)); ; E(P (Ck)), represent the entropies of each cluster, P (Ci) denotes the points assigned to cluster Ci, P (Ci) D, with the property that P (Ci) \ P (Cj) = ;, for all i; j = ; ::; k i = j. The symbol C = fC; ; Ckg represents the clustering. E( C) = X ( k jP (Ck)j jDj (E(P (Ck)))) () This function, as we will see later, allows us to implement an incremental algorithm that can eectively deal with large datasets, since we do not need to look at the entire set of points to decide about the entropy of an arrangement. Rather, we will be able to decide for each point, how it would aect the entropy of each of the existing clusters if placed in each one of them. The solution we propose in this paper (and present in Section ) is a heuristic based in nding a set of initial clusters (using the entropic criteria), and then incrementally (greedily) add points to the clusters according to a criteria that minimizes Equation .

Cluster # Clustering Clustering Clustering members Cluster f"red"; "heavy"g f"red"; "medium"g E . Cluster f"blue"; "light"g members f"red"; "heavy"g f"blue"; "light"g f"red"; "medium"g E : Members f"red"; heavy"g E f"red"; "medium"g : f"blue"; "light"g Exp.E : : : Figure : Three dierent clusterings for the set v; v; v. Clustering minimizes the expected entropy of the two clusters. Furthermore, we make a simplication in the computation of entropy of a set of records. We assume independence of the attributes of the record, transforming Equation into Equation . In other words, the joint probability of the combined attribute values becomes the product of the probabilities of each attribute, and hence the entropy can be calculated as the sum of entropies of the attributes. E(^x) = X X (p(x) p(xn))log(p(x) p(xn)) x S(X) xn S(Xn) = E(X) + E(X) + + E(Xn) () Assume that we have a set of three records, v; v as before, and v = f"red"; "medium"g, and we want to form two clusters with them. Figure shows all the possible arrangements, with the entropy of each cluster, and the expected entropy in each arrangement. As we can see, the minimum expected entropy is that of arrangement , which obviously is the correct way of clustering the records (using two clusters). However, in the cases we can demonstrate that there is a correlation between two or more attributes of the data set, we will change the data points by creating attributes that reect these correlations and then apply Equation to compute the join entropy. For instance, if the data set is composed of records of attributes A; B; C; D; E; F and we know that A; B, A; C and E; F are correlated. we will convert the data set into one having records with attributes AB; AC; D; EF and compute the entropy assuming that these new attributes are independent. Notice that for the grouped attributes, we are in eect computing their joint probabilities. . Entropy and the similarity coecients As we shall prove shortly, as a measure of the similarity between two vectors, the use of entropy is equivalent to that of other widely-used similarity coecients []. Similarity coecients are best explained by the x association table shown in Figure , where refers to the presence of a variable and to its absence. The simple matching coecient (SM) is dened as shown in Equation . Notice that this coe- cient takes into account the joint absence of a variable (as indicated by d in the table). SM = (a + d) (a + b + c + d) ()

a c b d Figure : Association table: refers to the presence of a variable, and to its absence. The Jaccard (J) coecient is shown in Equation . It avoids the use of joint absences of a variable in the calculation of similarity. Lemma For any vectors p; q; u; v in a data set, SM (fp; qg) = SM (fu; vg) i E(fp; qg) = J = a (a + b + c) () E(fu; vg). Proof: Throughout the proof, we will use SM (fp; qg = a + d a + b + c + d and SM (fu; vg) = a + d where a; b; c; d are the entries of the association table of p; q, and a; b; c; d a + b + c + d those for the association table of u; v. Notice that since the vectors are drawn from the same data set, they all have the same number of attributes, therefore, a + b + c + d = a + b + c + d. We shall proceed with the proof of the two implications. Part a) If SM (fp; qg) = SM (fu; vg), then E(fp; qg) = E(fu; vg): Since SM (fp; qg) = SM (fu; vg) = k, then a + d = k(a + b + c + d) and a + d = k(a + b + c + d). Also, since a + b + c + d = a + b + c + d, it follows that a + d = a + d, and b + c = b + c. Now, since for matching attributes, the entropy contribution will be , and for non-matching attributes, the entropy contribution will be ( )), then it follows that E(fp; qg) = (b + c)log( ) = (b + c)log( ) = E(fu; vg). log( Part b) If E(fp; qg) = E(fu; vg), then SM (fp; qg) = SM (fu; vg): Since for any pair of vectors, the matching attributes will not contribute to the total entropy, while the non-matching attributes will contribute log( ) (i.e., an equal amount per attribute), then E(fp; qg) = E(fu; vg) implies that b + c = b + c. Now, since a + b + c + d = a + b + c + d, it follows that a + d = a + d proving the implication. Lemma For any vectors p; q; u; v in a data set, SM (fp; qg) > SM (fu; vg) i E(fp; qg) < E(fu; vg). Proof: Part a) If SM (fp; qg) > SM (fu; vg) then E(fp; qg) < E(fu; vg). Since the rst coecient is larger than the second, it means that more attribute values agree for the two vectors (a + d > a + d). This, obviously reduces the entropy of the pair, since agreeing values contribute with to the nal entropy count. Part b) If E(fp; qg) < E(fu; vg), then SM (fp; qg) > SM (fu; vg). Since the entropy of the rst pair is less than that of the second, it means that less attributes in )), i.e., b + c < b + c. Therefore, the vectors disagree (since those contribute with the value log( the coecient will be larger (more values agree). Lemma For any vectors p; q; u; v in a data set, SM (fp; qg) < SM (fu; vg) i E(fp; qg) > E(fu; vg). Proof: Similar to Lemma . Theorem Given two similarity matrices built over the same data set, one using SM and the

Figure : Association table for vectors v = ; ; and v = ; ; . other entropy, if we order the rst matrix in descending order of entries, and the second in ascending order, the relative ordering of the pairs of points is the same in both cases. Proof: By Lemmas , and . As a consequence of Theorem , in any algorithm that uses the SM coecient as a similarity measure, we could also use entropy as well, obtaining identical results. For instance, ROCK [] (which we shall review in Section ) uses an \extended" similarity measure in which the coecient is computed as the number of attributes that agree divided by the total number of attribute val- ues, calling this coecient the Jaccard coecient. (For instance, in the example shown in Section ., the coecient J(fv; vg) = , since no values are the same in v = < red; heavy > and v = < blue; light >; but J(fv; vg) = , since one value out of three is the same in the pair v = < red; heavy >, v = f"red"; "medium"g.) In fact, it is easy to see that if we convert the "red", "heavy", and "medium" values into binary variables, the vectors become v = ; ; and v = ; ; , and the association table for v; v is the one shown in Figure . Then the SM coecient can be easily computed as . The Jaccard coecient in this case would also be the same, as d = . This fact is true for every case in which we convert the values into binary variables since we will never have a column in which the two binary values are . So, it is easy to see that Theorem will always apply when the Jaccard coecient is applied to nominal attributes by converting the values to binary variables, since then the Jaccard coecient and the SM coecient would be equal. Our algorithm, COOLCAT, however, uses the fact that entropy, unlike the SM or Jaccard coef- cients, can serve as a measure of similarity among any set of vectors (not just two). . Expected entropy and the Minimum Description Length principle The Minimum Description Length principle (MDL) [, ] recommends choosing the model that minimizes the number of bits needed to encode it. This principle is widely used to compare classiers (see []) but it has not been used much to deal with clustering. (An idea of how to do this by encoding the clusters centers is sketched in []). An optimal encoding for a set of clusters can be realized by assigning codes to each attribute of the data based on the probability that the attribute appears on the cluster, using a Homan code. This encoding will have length Pi P (Ai = Vij)logP (Ai = Vij) for each cluster Ck. So, it can be minimized by minimizing the expected value of that function. But, this is precisely the function we have selected to minimize: the expected entropy of the clustering. So, our technique aims to nd the clustering that follows the MDL principle. This has important implications: the fact that the encoding of the clusters is minimal implies that we can expect very concise representations of the clusters we have found at any given point. This in turn makes possible the incremental processing of further data points without having to keep all the previous data points in memory, but just the concise representation of the clusters they form.

f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g Figure : A market basket data set Cluster Cluster f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g Solution f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g f; ; g Figure : The clustering solution obtained by ROCK and by expected entropy for the data of Figure . Evaluating clustering results This section illustrates the diculties in evaluating the results of clustering algorithms. A frequent problem one encounters when applying clustering algorithms in practice is the diculty in evaluating the solutions. Dierent clustering algorithms (and sometimes multiple applications of the same algorithm using slight variations of initial conditions or parameters) result in very dierent solutions, all of them looking plausible. This stems from the fact that there is no unifying criteria to dene clusters, and more often than not, the nal clusters found by the algorithm are in fact the ones that correspond to the criteria used to drive the algorithm. To illustrate this point, let us consider the following example, taken from [], a paper that describes ROCK, a categorical clustering algorithm by Guha, Rastogi and Shim. Figure shows a series of transactions that represent \market baskets" of products acquired by clients in a shop. (Each product represented by an integer.) The problem is to cluster these baskets into two clusters. Figure shows both the solution obtained by ROCK and also by nding the minimum expected entropy (both solutions are identical). How do we know this is a good solution? Authors have pondered about good ways to validate clusters found by algorithms (e.g., see [, , ]). Two widely used methods are the following: Signicance Test on External Variables This technique calls for the usage of signicance tests that compare the clusters on variables not used to generate them. One way of doing this is to compute the entropy of the solution using a variable that did not participate in the clustering. (A class attribute.) The entropy of an attribute C in a cluster Ck is computed as shown in Equation , where Vi denotes one of the possible values that C can take. The evaluation is performed by computing the expected entropy (taken into consideration the cluster sizes). E(Ck) = X i P (C = Vi)logP (C = Vi) () The category utility function The category utility (CU) function [] attempts to maximize both the probability that two objects in the same cluster have attribute values in common and ROCK, as we shall explain in Section , uses a combination of a Jaccard coecient [] and computation of common neighbors among points to drive the clustering algorithm.

the probability that objects from dierent clusters have dierent attributes. The expression to calculate the expected value of the CU function is shown in Equation . CU = X k kCkk jDj X i X j [P (Ai = Vij=Ck) P (Ai = Vij)] () We have used both techniques in validating our results, as shall be seen in the experimental section. Related Work Clustering is an extensively researched area not only by data mining and database researchers [, , , , , , ], but also by people in other disciplines [, ]. Most of the eorts, however, have been focused in clustering of numerical data records. Among the numerical clustering algorithms, ENCLUS [] uses entropy as a criteria to drive the algorithm. However, ENCLUS follows a completely dierent algorithm to our approach. ENCLUS divides the hyperspace recursively, considering an additional dimension in each iteration. For each subspace, ENCLUS estimates its density and entropy and determines if it satises the goodness criteria: its entropy has to be lower than a threshold. The authors of ENCLUS prove the relationship of their entropic measure with the denition of density. However, it is not possible to translate either the algorithm or the relationships to the area of categorical clustering, since the notion of density has no intuitive meaning when the attributes are categorical. The density estimation necessitates of a denition of hypervolume of the subspace, which without setting an arbitrary distance between two values of a categorical attribute (e.g., blue and red) is impossible to calculate. In the area of clustering categorical records, a few recent publications are worth mentioning. In [], the authors address the problem of clustering transactions in a market basket database. To do so, each frequent itemset is represented as a hyperedge in a weighted hypergraph. The weight of the graph is computed as the average of the condences for all possible association rules that can be generated from the itemset. Then, a hypergraph partitioning algorithm is employed to partition the items, minimizing the weight of the cut hyperedges. The results is a clustering of items. However, the algorithm does not produce a clustering of the transactions and it is not obvious how to obtain one from the item clusters. A related paper by Gibson et al [] also treats categorical clustering as hypergraph partitioning, but uses a less combinatorial approach to solving it, based on non-linear dynamical systems. CACTUS [], is an agglomerative algorithm that uses the author’s denitions of support, strong connection and similarity to cluster categorical data. Support for an attribute value pair (ai; aj), where ai is in the domain of attribute Ai and aj in the domain of attribute Aj is dened as the number of tuples that have these two values. The two attributes ai; aj are strongly connected if their support exceeds the value expected under the attribute-independence. This concept is then extended to sets of attributes. A cluster is dened as a region of attributes that are pairwise strongly connected, no sub-region has the property, and its support exceeds the expected support under the attribute-independence assumption. Similarity is used to connect attribute values (a; a) of the same attribute Ai, and it measures how many \neighboring" values x belonging to other attributes exist, such that a; x and a; x have positive support. Using support and similarity, CACTUS denes inter-attribute and intra-attribute summaries which tell us how related values from dierent and the same attribute are respectively. CACTUS uses these summaries to compute the so-called cluster projections on individual attributes and use these projections to obtain candidate clusters on a pair

Hierarchical agglomerative methods (e.g., of attributes, extending then to three and more attributes. CACTUS, obviously, bases its clustering decisions on the \neighboring" concept of similarity, similar to the algorithms described in [, ]. [, , ]), cluster points by iteratively merging clusters (at the start every point is a cluster by itself). For instance, the single linkage method [] begins searching for the two most similar points in a similarity matrix and puts them in the same cluster. Then it searches for the most similar point to those in the cluster, making this point part of the cluster. At every step, the algorithm looks for points to join into existing clusters (or among themselves). The problem with agglomerative methods is that they do not scale, since they require the calculation and storage of potentially large similarity matrices. They also lack stability when the data is re-shued in the similarity matrix. ROCK [] computes distances between records using the Jaccard coecient (redened as shown in Section .). Using a threshold, it determines, for each record, who are its neighbors. For a given point p, a point q is a neighbor of p if the Jaccard coecient J(p; q) exceeds the threshold. Then, it computes the values of a matrix LIN K, in which the entries link(p; q) are the number of common neighbors between p and q. The algorithm then proceeds to cluster the records in an agglomerative way, trying to maximize for the k clusters (k is a predened integer) the function Pk i= ni Pp;qCi, where is the threshold, and f () is a function selected by the user. The ROCK solution oered in Figure for the data of Figure was found using = : and f () = + . The choice of f () is critical in dening the tness of the clusters formed the the ROCK algorithm, and, as the authors point out, the function is dependent on the data set as well as on the kind of cluster the user is interested in. We feel that choosing the function is a delicate and dicult task for users that may be a roadblock to using ROCK eciently. In contrast, our algorithm oers a unique, well-dened cluster tness criteria, that is based on a solid and well understood property: entropy. Our algorithm Our entropy-based algorithm, COOLCAT, consists of two steps: initialization and incremental step. . Initialization The initialization step \bootstraps" the algorithm, nding a suitable set of clusters out of a sample S, taken from the data set (jsj << N ), where N is the size of the entire data set. We rst nd the k most \dissimilar" records from the sample set by maximizing the minimum pairwise entropy of the chosen points. We start by nding the two points ps; ps that maximize E(ps; ps) and placing them in two separate clusters (C; C), marking the records (this takes O(jSj)). From there, we proceed incrementally, i.e., to nd the record we will put in the j-th cluster, we choose an unmarked point psj that maximizes mini=;::;j(E(psi; psj )). The rest of the sample unmarked points (jSj k), as well as the remaining points (outside the sample), are placed in the clusters using the incremental step. . Incremental Step After the initialization, we process the remaining records of the data set (the rest of the sample and points outside the sample) incrementally, nding a suitable cluster for each record. This is done by computing the expected entropy that results of placing the point in each of the clusters and selecting the cluster for which that expected entropy is the minimum. We proceed in the incremental step by bringing a buer of points to main memory and clustering them one by one. The rst batch of points is composed by those records in the sample that were not selected to seed the clusters initially.

分享到：

赞收藏

资料库

基于熵的分类数据聚类算法.pdf

相关推荐

信息化

热门标签

最新资料