CMAR基于多关联的分类算法.pdf

发布时间：2022-05-29 发布人：admin 分类：说明书资料大小：0.13M 资料格式：pdf 举报版权申诉

maxft-1294264-4744302543315457680.pdf-第1页.png

第1页 / 共8页

maxft-1294264-4744302543315457680.pdf-第2页.png

第2页 / 共8页

maxft-1294264-4744302543315457680.pdf-第3页.png

第3页 / 共8页

maxft-1294264-4744302543315457680.pdf-第4页.png

第4页 / 共8页

maxft-1294264-4744302543315457680.pdf-第5页.png

第5页 / 共8页

maxft-1294264-4744302543315457680.pdf-第6页.png

第6页 / 共8页

maxft-1294264-4744302543315457680.pdf-第7页.png

第7页 / 共8页

maxft-1294264-4744302543315457680.pdf-第8页.png

第8页 / 共8页

文本预览

CMAR: Accurate and Efﬁcient Classiﬁcation Based on Multiple Class-Association Rules Wenmin Li Jiawei Han Jian Pei Burnaby, B.C., Canada V5A 1S6 School of Computing Science, Simon Fraser University E-mail: wli, han, peijian @cs.sfu.ca Abstract Previous studies propose that associative classiﬁcation has high classiﬁcation accuracy and strong ﬂexibility at handling unstructured data. However, it still suffers from the huge set of mined rules and sometimes biased classi- ﬁcation or overﬁtting since the classiﬁcation is based on only single high-conﬁdence rule. In this study, we propose a new associative classi- i.e., Classiﬁcation based on ﬁcation method, CMAR, Multiple Association Rules. The method extends an ef- ﬁcient frequent pattern mining method, FP-growth, con- structs a class distribution-associated FP-tree, and mines large database efﬁciently. Moreover, it applies a CR-tree structure to store and retrieve mined association rules ef- ﬁciently, and prunes rules effectively based on conﬁdence, correlation and database coverage. The classiﬁcation is performed based on a weighted analysis using multiple strong association rules. Our extensive experiments on databases from UCI machine learning database repository show that CMAR is consistent, highly effective at classiﬁ- cation of various kinds of databases and has better aver- age classiﬁcation accuracy in comparison with CBA and C4.5. Moreover, our performance study shows that the method is highly efﬁcient and scalable in comparison with other reported associative classiﬁcation methods. 1 Introduction Building accurate and efﬁcient classiﬁers for large databases is one of the essential tasks of data mining and machine learning research. Given a set of cases with class labels as a training set, classiﬁcation is to build a model (called classiﬁer) to predict future data objects for which the class label is unknown. Previous studies have developed heuristic/greedy The work was supported in part by the Natural Sciences and En- Contact author. gineering Research Council of Canada (grant NSERC-A3723), and the Networks of Centres of Excellence of Canada (grant NCE/IRIS-3). search techniques for building classiﬁers, such as decision trees [10], rule learning [2], na¨ıve-Bayes classiﬁcation [4], and statistical approaches [8]. These techniques induces a representative subset of rules (e.g., a decision tree or a set of rules) from training data sets for quality prediction. Recent studies propose the extraction of a set of high quality association rules from the training data set which satisfy certain user-speciﬁed frequency and conﬁdence thresholds. Effective and efﬁcient classiﬁers have been built by careful selection of rules, e.g., CBA [9], CAEP [3], and ADT [11]. Such a method takes the most effec- tive rule(s) from among all the rules mined for classiﬁ- cation. Since association rules explore highly conﬁdent associations among multiple variables, it may overcome some constraints introduced by a decision-tree induction method which examines one variable at a time. Extensive performance studies [6, 9, 3, 11] show that association- based classiﬁcation may have better accuracy in general. However, this approach may also suffer some weakness as shown below. On one hand, it is not easy to identify the most effective rule at classifying a new case. Some method, such as [9, 3, 11], simply selects a rule with a maximal user-deﬁned measure, such as conﬁdence. As we will see later, such a selection may not always be the right choice in many cases. Such a simple pick may affect the classiﬁcation accuracy. On the other hand, a training data set often generates a huge set of rules. It is challenging to store, retrieve, prune, and sort a large number of rules efﬁciently for classiﬁca- tion. Many studies [1, 5] have indicated the inherent nature of a combinatorial explosive number of frequent patterns and hence association rules that could be generated when the support threshold is small (i.e., when rare cases are also be included in the consideration). To achieve high accu- racy, a classiﬁer may have to handle a large set of rules, including storing those generated by association mining methods, retrieving the related rules, and pruning and sort- ing a large number of rules. Can we solve the above two problems? To solve the ﬁrst problem, that is, to predict a new case accurately, in- stead of applying only a single rule, one may consider a small subset of the most related, highly conﬁdent rules and

make a collective and all-around decision. Intuitively, that would help us avoid bias, exceptions, and overﬁtting of too small data sets. To overcome the second problem, the efﬁ- ciency and scalability problem of association-based classi- ﬁcation, one needs to develop efﬁcient methods for storing and retrieving rules. This may help improve efﬁciency as well as the accuracy of classiﬁcation since more rules can be stored and considered. This is the motivation of this research. In this paper, we develop a new technique, CMAR, for accurate and efﬁcient classiﬁcation and make the follow- ing contributions. First, instead of relying on a single rule for classiﬁca- tion, CMAR determines the class label by a set of rules. Given a new case for prediction, CMAR selects a small set of high conﬁdence, highly related rules and analyzes the correlation among those rules. To avoid bias, we develop a new technique, called weighted , which derives a good measure on how strong the rule is under both conditional support and class distribution. An extensive performance study shows that CMAR in general has higher prediction accuracy than CBA [9] and C4.5 [10]. Second, to improve both accuracy and efﬁciency, CMAR employs a novel data structure, CR-tree, to com- pactly store and efﬁciently retrieve a large number of rules for classiﬁcation. CR-tree is a preﬁx tree structure to ex- plore the sharing among rules, which achieves substantial compactness. CR-tree itself is also an index structure for rules and serves rule retrieval efﬁciently. Third, to speed up the mining of complete set of rules, CMAR adopts a variant of recently devel- oped FP-growth method. FP-growth is much faster than Apriori-like methods used in previous association-based classiﬁcation, such as [9, 3, 11], especially when there ex- ist a huge number of rules, large training data sets, and long pattern rules. The remaining of the paper is arranged as follows. Sec- tion 2 revisits the general idea of associative classiﬁcation. Section 3 devotes to the generation of rules for classiﬁca- tion. Section 4 discusses how to classify a new data ob- ject using the generated rules. The experimental results on classiﬁcation accuracy and the performance study on efﬁ- ciency and scalability are reported in Section 5. The paper is concluded in Section 6. 2 Associative Classiﬁcation , . . . , , where follows the Suppose a data object schema are called at- tributes. Attributes can be categorical or continuous. For a categorical attribute, we assume that all the possible values are mapped to a set of consecutive positive integers. For a continuous attribute, we assume that its value range is dis- cretized into intervals, and the intervals are also mapped to consecutive positive integers. By doing so, all the at- tributes are treated uniformly in this study. be a ﬁnite set of class labels. A Let object with it. A classiﬁer ( . Given a data object , there exists a class label "!$#%'& is a function from , ( training data set is a set of data objects such that, for each associated to returns a class ) *& In general, given a training data set, the task of classi- ﬁcation is to build a classiﬁer from the training data set such that it can be used to predict class labels of unknown objects with high accuracy. label. Besides many different approach for classiﬁcation, such is a set of as decision tree approach, naive Bayesian approach, + - nearest neighbors approach, neural network approach, a new approach is to explore association relationships be- tween object conditions and class labels [9]. The idea is natural since it utilizes frequent patterns and association relationships between cases and class labels in training data set to do classiﬁcation. If strong associations among some frequent patterns and class labels can be observed in training data set, the future object of similar patterns can be classiﬁed. -/.01-2 -98 , -98 & , is said to match , if and only if for )3A4BC4 in attribute , let be a class label. For , the number of data objects in D match- and having class label is called the support . The ratio of the number of ob- and having class label versus In general, a pattern , attribute-values such that for 354674 %= for > %<; ? . A data object and : @-/.01-2 pattern , -98 . -98 has value Given a training data set D rule EGF,IH ing pattern , of E , denoted as JKML jects matching pattern , the total number of objects matching pattern , the conﬁdence of E NPO For example, if QMRS cannot get a credit limit more than TMUVMVMV , i.e., the conﬁ- dence of rule EWF no job H credit limit less than UMVVMV is QMRS to classify future data objects. To avoid noise, a rule is used for classiﬁcation only if it has enough support. Given a support thresh- old and a conﬁdence threshold, associative classiﬁcation method ﬁnds the complete set of class-association rules (CAR) passing the thresholds. When a new (unknown) object comes, the classiﬁer selects the rule which matches the data object and has the highest conﬁdence and uses it to predict the class label of the new object. , then we can use rule E of customers who have no job , denoted as is called . Recent studies show that associative classiﬁcation is in- tuitive and effective and has good classiﬁcation accuracy in many cases. In most existing associative classiﬁcation methods [9, 3, 11], the rule with the highest conﬁdence is used for classiﬁcation. However, such a decision may not always be the correct one. For example, suppose we want to determine the credit limit of a customer with attribute values (no job, invest- conﬁdent rules matching the customer are as follows. ment immigrant, oversea assetXYRVMV+ ). The top-U most Z Rule E credit limit UMVVMV\[ : no job H (support: + : ; + E E

); : UVMVMV , conﬁdence: QMRMS Z Rule credit limit UVMVMV QUMS Z Rule credit limit UVMVMV E : ); and ). investment immigrant (support: conﬁdence: RMVMVV , oversea asset (support: RMVVM+ conﬁdence: VMVV , So, given such a customer, what class label should we pre- dict? A conventional associative classiﬁcation method, like only, since it has the highest conﬁdence. However, a closer decision seriously. The three rules have very similar conﬁ- CBA, may predict credit limit UVMVV\[ according to rule E look at rules E and E may suggest that we reconsider the dence, but E and E have stronger support. The decision based on rules E and E seems to be more reliable. The above example indicates that to make a reliable and accurate prediction, the most conﬁdent rule may not al- ways be the best choice, and a thorough, detailed, and all- around measure analysis based on multiple rules may lead to better quality prediction. 3 Generating Rules for Classiﬁcation In this section, we develop a new associative- classiﬁcation method, called CMAR, which performs Classiﬁcation based on Multiple Association Rules. CMAR consists of two phases: rule generation and classiﬁcation. In the ﬁrst phase, rule generation, CMAR computes the complete set of rules in the form of E ,IH is a pattern in the training data set, and such that JKL , where , pass the given support and conﬁdence thresholds, respectively. Furthermore, CMAR prunes some rules and only selects a subset of high quality rules for classiﬁcation. is a class label and NPO In the second phase, classiﬁcation, for a given data ob- , CMAR extracts a subset of rules matching the ob- ject and predicts the class label of the object by analyzing this subset of rules. ject pattern mining algorithm which is faster than conventional Apriori-like methods, especially in the situations where there exist large data sets, low support threshold, and/or long patterns. The general idea of mining rules in CMAR is shown in the following example. Example 1 (Mining class-association rules) Given a as shown in Table 1. Let the support . CMAR threshold is mines class-association rules as follows. and conﬁdence threshold is RVMS training data set D Row-id Class label 1 2 3 4 5 Table 1. A training data set. First, CMAR scans the training data set D once, ﬁnd the set of attribute values happening at least twice in D and is called frequent item set. set is All other attribute values, which fail the support threshold, cannot play any role in the class-association rules, and thus can be pruned. G . The Then, CMAR sorts attribute values in scending order, i.e., F-list CMAR scans the training data set again to construct an FP-tree, as shown in Figure 1(a). . Then, 7 in support de- ? FP-tree is a preﬁx tree w.r.t. F-list. For each tuple in the training data set, attributes values appearing in F-list are extracted and sorted according to F-list. For example, for the ﬁrst tuple, / are extracted and inserted in the tree as the left-most branch in the tree. the class label is attached to the last node in the path. Tuples in the training data set share preﬁxes. For exam- in ple, the second tuple carries attribute values F-list and shares a common preﬁx So, it also shares the with the ﬁrst tuple. sub-path with the left-most branch. All nodes with same attribute value are linked together as a queue started from the header table. In this section, we develop methods to generate rules for classiﬁcation. The second phase, classiﬁcation, will be discussed in Section 4. Header Table Item Link 3.1 Mining Class-Association Rules Passing Sup- port and Conﬁdence Thresholds To ﬁnd rules for classiﬁcation, CMAR ﬁrst mines the training data set to ﬁnd the complete set of rules passing certain support and conﬁdence thresholds. This is a typi- cal frequent pattern or association rule mining task [1]. To make mining highly scalable and efﬁcient, CMAR adopts a variant of FP-growth method [5]. FP-growth is a frequent a1 b2 c1 d3 root a1 d3 (A:1) c1 (A:1) b2 d3 (C:1) c1 (B:1) d3 (C:1) Header Table Item Link a1 b2 c1 root a1 c1 (A:1) b2 (C:1) c1 (B:1, C:1) (a) FP-tree (b) FP-tree after merging nodes of d3 Figure 1. FP-tree in Example 1. E H X H Q 3 S F E E

. It (In nor Third, based on F-list, the set of class-association rules subsets without overlap: (1) the ones can be divided into ; (2) the ones having but no having ; (3) the ones ; and (4) the ones having only having but no . CMAR ﬁnds these subsets one by one. Fourth, to ﬁnd the subset of rules having , CMAR tra- verses nodes having attribute value and look “upward” to collect a -projected database, which contains three tu- ples: , / and contains all the tuples having . The problem of ﬁnding all frequent patterns having can be reduced to mine frequent patterns in -projected database. in the whole training set Recursively, in -projected database, are the frequent attribute values, i.e., they pass support thresh- old. ple and thus is trivially frequent. We do not count a local frequent attribute value.) We can mine the pro- jected database recursively by constructing FP-trees and projected databases. Please see [5] for more details. happens in every tu- as -projected database, and and -projected database, al- is a frequent pattern. and have same sup- . To avoid triviality, we only adopt fre- It happens that, in ways happen together and thus and are two subpatterns of port count as quent pattern information, we generate rule and conﬁdence 3 After search for rules having . Based on the class label distribution with support VVMS are node is regis- -projected merged into their parent nodes, respectively. That is, the class label information registered in a tered in its parent node. The FP-tree is shrunk as shown in Figure 1(b). Please note that this tree-shrinking opera- tion is done at the same scan of collecting the database. 5H , all nodes of . The remaining subsets of rules can be mined similarly. There are two major differences in the rule mining in CMAR and the standard FP-growth algorithm. On one hand, CMAR ﬁnds frequent patterns and gen- erates rules in one step. Conventionally, association rules must be mined in two steps [1]. This is also the case for traditional associative classiﬁcation methods [9]. First, all the frequent patterns (i.e., patterns passing support threshold) are found. Then, all the association rules satisfying the conﬁdence threshold are generated based on the mined frequent patterns. The difference of CMAR from other associative clas- siﬁcation methods is that for every pattern, CMAR main- tains the distribution of various class labels among data ob- jects matching the pattern. This is done without any over- head in the procedure of counting (conditional) databases. Thus, once a frequent pattern (i.e., pattern passing support threshold) is found, rules about the pattern can be gener- ated immediately. Therefore, CMAR has no separated rule generation step. On the other hand, CMAR uses class label distribution to prune. , let be the most dominant . If the number is less than the support threshold, there is no need to search any super- since any rule in the form of For any frequent pattern , class in the set of data objects matching , of objects having class label and matching , > of , pattern (superset) , 3.2 Storing Rules in CR-tree cannot satisfy the support threshold either. Once a rule is generated, it is stored in a CR-tree, which is a preﬁx tree structure. We demonstrate the general idea of CR-tree in the following example. Example 2 (CR-tree) After mining a training data set, four rules are found as shown in Table 2. Rule-id 1 2 3 4 Rule Support Conﬁdence Table 2. Rules found in a training data set. A CR-tree is built for the set of rules, as shown in Figure 2, while the construction process is explained as follows. Header Table root item head of node-links a b b c a b c d e c (A, 80, 80%) e (B, 36, 60%) d (C, 210, 70%) d (A, 63, 90%) Figure 2. A CR-tree for rules in Example 2. A CR-tree has a root node. All the attribute values ap- pearing at the left hand side of rules are sorted according to their frequency, i.e., the most frequently appearing at- tribute value goes ﬁrst. The ﬁrst rule, is inserted into the tree as a path from root node. The class label as well as the support " and conﬁdence of the rule, denoted as VS registered at the last node in the path, i.e., node , shares a preﬁx The second rule, rule. the ﬁrst rule. Thus, it is inserted into the tree by extending a new node to the path formed by the ﬁrst rule. Again, the class label, support and conﬁdence of the rule are reg- istered at the last node, i.e., . , are for this with F F F , > H H V H

The third and fourth rules can be inserted similarly. All the nodes with the same attribute value are linked together by node-link to a queue. The head of each queue is stored in a Header table. To store the original rule set, 3 U cells are needed for the left hand sides of the rules. Using CR-tree, only Q nodes are needed. As can be seen from the above example, the CR-tree structure has some advantages as follows. CR-tree is a compact structure. It explores potential sharing among rules and thus can save a lot of space on storing rules. Our experimental results show that, in many CR-tree itself is an index for rules. For example, if we cases, aboutRV VMS of space can be saved using CR-tree. want to retrieve all the rules having attribute value and in the set of rules in Example 2, we only need to traverse node-links of , which starts at the header table, and keep looking upward for . Once a CR-tree is built, rule retrieval becomes efﬁcient. That facilitates the pruning of rules and using rules for classiﬁcation dramatically. 3.3 Pruning Rules The number of rules generated by class-association rule mining can be huge. To make the classiﬁcation effective and also efﬁcient, we need to prune rules to delete redun- dant and noisy information. According to the facility of rules on classiﬁcation, a , E , denoted global order of rules is composed. Given two rules E is said having higher rank than E and E ; (2) , if and only if (1) as E NPO NPO ; or (3) but JKL NPO NPO JKL has but E , JKL NPO NPO JKL does. In fewer attribute values in its left hand side than E is said a general rule w.r.t. addition, a rule E FP, > , if and only if , > . is a subset of , rule E CMAR employs the following methods for rule pruning. First, using general and high-conﬁdence rule to prune more speciﬁc and lower conﬁdence ones. Given two rules . CMAR . The ratio- nale is that we only need to consider general rules with high conﬁdence, and thus more speciﬁc rules with low conﬁdence should be pruned. is a general rule w.r.t. E also has higher rank than E and E prunes E , where E if E FM, This pruning is pursued when the rule is inserted into the CR-tree. When a rule is inserted into the tree, retrieval over the tree is triggered to check if the rule can be pruned or it can prune other rules that are already inserted. Our experimental results show that this pruning is effective. each rule EFM,IH related with by Second, selecting only positively correlated rules. For is positively cor- testing. Only the rules that are posi- tively correlated, i.e., those with value passing a signif- icance level threshold, are used for later classiﬁcation. All , we test whether , Algorithm 1 (Selecting rules based on database coverage) Input: a set of rules and a coverage threshold Output: a subset of rules for classiﬁcation Method: 1. Sort rules in the rank descending order; 2. For each data object in the training data set, set its cover-count to ; 3. While both the training data set and rule set are not empty, for each rule in rank descending order, ﬁnd all data objects matching rule . If can correctly classify at least one object then select and increase . A the cover-count of those objects matching data object is removed if its cover-count passes cov- erage threshold by ; Figure 3. Selecting rules based on database coverage. the other rules are pruned. The rationale of this pruning is that we use the rules re- ﬂecting strong implications to do classiﬁcation. By remov- ing those rules not positively correlated, we prune noise. This pruning happens when a rule is found. Since the distribution of class labels w.r.t. frequent patterns is kept track during the rule mining, the testing is done almost for free. Third, pruning rules based on database coverage. CMAR selects a subset of high quality rules for classiﬁca- tion. That is achieved by pruning rules based on database coverage. CMAR uses a coverage threshold [7] to select database coverage, as shown in Figure 3. The database coverage method used in CMAR is simi- lar to that in CBA. The major difference is that, instead of removing one data object from the training data set imme- diately after it is covered by some selected rule, we let it rules. That allows stay there until it is covered by at least more selected rules. When classifying a new data object, it may have more rules to consult and may have better chance to be accurately predicted. This pruning is pursued when the rule mining process ﬁnishes. It is the last pruning of rules. 4 Classiﬁcation Based on Multiple Rules After a set of rules is selected for classiﬁcation, as dis- cussed in Section 3, CMAR is ready to classify new ob- jects. Given a new data object, CMAR collects the subset of rules matching the new object from the set of rules for classiﬁcation. In this section, we discuss how to determine the class label based on the subset of rules. Trivially, if all the rules matching the new object have the same class label, CMAR just simply assigns that label E E E E E E E E E E E H > H E

to the new object. If the rules are not consistent in class labels, CMAR di- vides the rules into groups according to class labels. All rules in a group share the same class label and each group has a distinct label. CMAR compares the effects of the groups and yields to the strongest group. To compare the strength of groups, we need to mea- sure the “combined effect” of each group. Intuitively, if the rules in a group are highly positively correlated and have good support, the group should have strong effect. There are many possible ways to measure the combined effect of a group of rules. For example, one can use the strongest rule as a representative. That is, the rule with highest value is selected. However, simply choosing the rule with highest value may be favorable to minority classes, as illustrated in the following example. Example 3 In a credit card application approval case, there are two rules. NPO: N F MVV , NPO Figure 4. NP , and NP: J: QMQ RS MVMS :/ N N JKLML LML JKL The observed and expected contingencies are shown in approved rejected total job=yes job=no total The observed contingency of rule approved rejected total ed=univ univ ed total The observed contingency of rule job=yes job=no total approved rejected total The expected contingency of rule approved rejected total ed=univ univ ed total The expected contingency of rule . . . . Figure 4. Observed and expected contingen- cies for rules. Based on the observed and expected values, the val- and E are and UU , respectively. For a customer having no job and with university education, we may predict her application would be rejected using rule ues for E , if the choice of rules is based on only However, rule E since E is intuitively much better than E has much higher support and conﬁdence. values. Another alternative is to use the compound of correla- tion of rules as measure. For example, we can sum up values in a group as the measure of the group. However, this method suffers from the same problem that it may fa- vors minority too much. A better way is to integrate both information of cor- relation and popularity into the measure. After empirical veriﬁcation, CMAR adopts a weighted measure [7] as follows. be the number of D the number of data objects in the training as follows. / ,IH data objects in training data set that associated with class For each rule EF label and data set. We deﬁne JKML <:/N where , let JKL for rule E " JKL " JKML / JKL D D ! "#$ &%'")( #+ ,-+ & "#"*#+ #+ ,/+ $ & "#"0$ &%'" computes the upper bound of value of the rule w.r.t. other settings are ﬁxed. Then, for each group of measure of the group is deﬁned as rules, the weighted & "*#+ ,-+ &%'"#" ,-+ !%'"#".( 2323 5476 3 . As can be seen, we use the ratio of value against ) to overcome the bias of its upper bound (i.e., value favoring minority class. Please note that, theoreti- cally, it is hard to verify the soundness or effect of mea- sures on strength of groups of rules. Instead, we explore the effect of measures empirically, and according to our value is the best from experimental results, weighted among a good set of candidate measure formulas that can be worked out by us. 5 Experimental Results and Performance Study To evaluate the accuracy, efﬁciency and scalability of CMAR, we have performed an extensive performance study. In this section, we report our experimental results on comparing CMAR against two popular classiﬁcation meth- ods: CBA [9] and C4.5 [10]. It shows that CMAR outper- forms both CBA and C4.5 in terms of average accuracy, efﬁciency and scalability. All the experiments are performed on a 600MHz Pen- tium PC with 128M main memory, running Microsoft Windows/NT. CBA and C4.5 were implemented by their E F H E K K H : E , , [ [ ( [ [ 1 2

authors, respectively. In the experiments, the parameters of the three methods are set as follows. All C4.5 parameters are default values. We test both C4.5 decision tree method and rule method. Since the rule method has better accuracy, we only report the accuracy for rule method. For CBA, we set support threshold to 3 dence threshold to RMVS rules. Other parameters remain default. and conﬁ- and disable the limit on number of For CMAR, the support and conﬁdence thresholds are set as same as CBA. The database coverage threshold is set to and the conﬁdence difference threshold to . All reports of the runtime include both CPU time and VMS I/O time. We test data sets from UCI Machine Learning Repository. We use C4.5’s shufﬂe utility to shufﬂe the data sets. Also, we adopt the same method used by CBA to discretize continuous attributes. Data set Anneal Austral Auto Breast Cleve Crx Diabetes German Glass Heart Hepatic Horse Hypo Iono Iris Labor Led7 Lymph Pima Sick Sonar Tic-tac Vehicle Waveform Wine Zoo Average # attr 38 14 25 10 13 15 8 20 9 13 19 22 25 34 4 16 7 18 8 29 60 9 18 21 13 16 # cls 6 2 7 2 2 2 2 2 7 2 2 2 2 2 3 2 10 4 2 2 2 2 4 3 3 7 # rec 898 690 205 699 303 690 768 1000 214 270 155 368 3163 351 150 57 3200 148 768 2800 208 958 846 5000 178 101 C4.5 94.8 84.7 80.1 95 78.2 84.9 74.2 72.3 68.7 80.8 80.6 82.6 99.2 90 95.3 79.3 73.5 73.5 75.5 98.5 70.2 99.4 72.6 78.1 92.7 92.2 83.34 CBA 97.9 84.9 78.3 96.3 82.8 84.7 74.5 73.4 73.9 81.9 81.8 82.1 98.9 92.3 94.7 86.3 71.9 77.8 72.9 97 77.5 99.6 68.7 80 95 96.8 84.69 CMAR 97.3 86.1 78.1 96.4 82.2 84.9 75.8 74.9 70.1 82.2 80.5 82.6 98.4 91.5 94 89.7 72.5 83.1 75.1 97.5 79.4 99.2 68.8 83.2 95 97.1 85.22 Table 3. The comparison of C4.5, CBA and CMAR on accuracy. C4.5 and CBA on accuracy. Furthermore, out of the As can be seen from the table, CMAR outperforms both U ones. of test data sets. In some data sets, e.g. Lymph, CMAR wins the second place data sets, CMAR achieves the best accuracy in 3 In another word, CMAR wins in RMVS over RMS in accuracy. There are two important parameters, database coverage threshold and conﬁdence difference threshold, in CMAR. As discussed before, these two thresholds control the num- ber of rules selected for classiﬁcation. In general, if the set of rules is too small, some effective rules may be missed. On the other hand, if the rule set is too large, the training data set may be overﬁt. Thus, we need to test the sensitivities of the two thresholds w.r.t. classiﬁcation accuracy. As an example, we test different database coverage threshold values on the Sonar data set from UCI Machine Learning Database Repository. The results are shown in Figure 5, where the conﬁdence difference threshold is set ence threshold values on the Sonar data set. The results are shown in Figure 6, where the database coverage threshold to V . On the other hand, we test different conﬁdence differ- is set to 3 . ) % ( y c a r u c c A 84 82 80 78 76 1 1.5 2 2.5 3 3.5 4 4.5 5 Database coverage threshold Figure 5. The effect of coverage threshold on accuracy. ) % ( y c a r u c c A 80 79 78 77 76 75 0 0.02 0.04 0.06 0.08 0.1 0.12 Confidence difference threshold Figure 6. The effect of conﬁdence difference threshold on accuracy. From the ﬁgures, one can see that the peaks of accu- racy are achieved at the middle of both curves. That is, there are optimal settings for both thresholds. However, according to our experimental results, there seems no way to pre-determine the best threshold values. Fortunately, both curves are quite plain. That means the accuracy is not very sensitive to the two thresholds values. CR-tree is a compact structure to store rules. To test S

the effectiveness of CR-tree, we compare the main mem- ory usage of CBA and CMAR on large test data sets. The results are shown in Table 4. Dataset # attr # cls # rec CBA CMAR mem (M) mem (M) saving Auto Hypo Iono Sick Sonar Vehicle Average 25 25 34 29 60 18 7 2 2 2 2 4 488 330 334 321 590 590 205 3163 351 2800 208 846 Table 4. The comparison of CBA and CMAR on main memory usage. 393.33 101 90 86 85 88 88 90 Please note that, in this experiment, we disable the lim- itation of number of rules in CBA. In such a setting, CBA and CMAR generate all the rules necessary for classiﬁca- tion and thus are compared in a fair base. From the table, one can see that, on average, CMAR achieves sav- ing on main memory usage. 93 The saving in main memory usage can be explained MS from two apsects. First, CMAR uses CR-tree. The compactness of CR-tree brings signiﬁcant gain in storing a large set of rules where many items in the rules can be shared. On the other hand, CR-tree is also an index structure of rules. Before a rule is inserted into a CR-tree, CMAR checks if there is a general rule or some more speciﬁc rules in the tree. If so, related pruning is pursued immediately. Such a pruning techique also contributes to the saving of main memory. To test the scalability of CMAR, we compare the run- time of CBA and CMAR on six data sets. The results are shown in Figure 5. Again, we disable the limit on number of rules in CBA. In the experiments, CBA spends a large portion of runtime on I/O. Dataset Auto Hypo Iono Sick Sonar Vehicle # attr 25 25 34 29 60 18 # cls 7 2 2 2 2 4 # rec 205 3163 351 2800 208 846 CBA runtime CMAR runtime 612s 92s 150s 74s 226s 284s 408s 19s 89s 13s 145s 192s Table 5. The runtime of CBA and CMAR. As can be seen from the table, CMAR is faster than CBA in many cases. Please be note that the machine we use for testing is with relatively small size of main memory (128M). Both CBA and CMAR can be expected running signiﬁcantly faster if more main memory is available. 6 Conclusions In this paper, we examined two major challenges in associative classiﬁcation: (1) efﬁciency at handling huge number of mined association rules, and (2) effectiveness at predicting new class labels with high classiﬁcation ac- curacy. We proposed a novel associative classiﬁcation i.e., Classiﬁcation based on Multiple method, CMAR, Association Rules. The method has several distinguished (1) its classiﬁcation is performed based on a features: weighted analysis enforced on multiple association rules, which leads to better overall classiﬁcation accu- racy, (2) it prunes rules effectively based on conﬁdence, correlation and database coverage, and (3) its efﬁciency is achieved by extension of an efﬁcient frequent pat- tern mining method, FP-growth, construction of a class distribution-associated FP-tree, and applying a CR-tree structure to store and retrieve mined association rules efﬁ- ciently. Our experiments on databases in UCI machine learning database repository show that CMAR is consis- tent, highly effective at classiﬁcation of various kinds of databases and has better average classiﬁcation accuracy in comparison with CBA and C4.5, and is more efﬁcient and scalable than other associative classiﬁcation methods. References [1] R. Agrawal and R. Srikant. Fast algorithms for mining as- sociation rules. In VLDB’94, Chile, Sept. 1994. [2] P. Clark and T. Niblett. The CN2 induction algorithm. Ma- chine Learning, 3:261–283, 1989. [3] G. Dong, X. Zhang, L. Wong, and J. Li. Caep: Classiﬁ- cation by aggregating emerging patterns. In DS’99 (LNCS 1721), Japan, Dec. 1999. [4] R. Duda and P. Hart. Pattern Classiﬁcation and Scene Anal- ysis. John Wiley & Sons, 1973. [5] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without In SIGMOD’00, Dallas, TX, May candidate generation. 2000. [6] B. Lent, A. Swami, and J. Widom. Clustering association rules. In ICDE’97, England, April 1997. [7] W. Li. Classiﬁcation based on multiple association rules. M.Sc. Thesis, Simon Fraser University, April 2001. [8] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classiﬁcation algorithms. Machine Learning, 39, 2000. [9] B. Liu, W. Hsu, and Y. Ma. Integrating classiﬁcation and association rule mining. In KDD’98, New York, NY, Aug. 1998. [10] J. R. Quinlan. C4.5: Programs for Machine Learning. Mor- gan Kaufmann, 1993. [11] K. Wang, S. Zhou, and Y. He. Growing decision tree on In KDD’00, Boston, MA, support-less association rules. Aug. 2000.

分享到：

赞收藏

资料库

CMAR基于多关联的分类算法.pdf

相关推荐

开发技术

热门标签

最新资料