logo资料库

CN2规则学习算法.pdf

第1页 / 共23页
第2页 / 共23页
第3页 / 共23页
第4页 / 共23页
第5页 / 共23页
第6页 / 共23页
第7页 / 共23页
第8页 / 共23页
资料共23页,剩余部分请下载后查看
Machine Learning 3: 261-283, 1989 © 1989 Kluwer Academic Publishers - Manufactured in The Netherlands The CN2 Induction Algorithm PETER CLARK TIM NIBLETT The Turing Institute, 36 North Hanover Street, Glasgow, G1 2AD, U.K. (PETE@TURING.AC.UK) (TIM@TURING.AC.UK) (Received: June 25, 1987) (Revised: October 25, 1988) Keywords: Concept learning, rule induction, noise, comprehensibility Abstract. Systems for inducing concept descriptions from examples are valuable tools for assisting in the task of knowledge acquisition for expert systems. This paper presents a description and empirical evaluation of a new induction system, CN2, designed for the efficient induction of simple, comprehensible production rules in domains where problems of poor description language and/or noise may be present. Implementations of the CN2, ID3, and AQ algorithms are compared on three medical classification tasks. 1. Introduction In the task of constructing expert systems, methods for inducing concept de- scriptions from examples have proved useful in easing the bottleneck of knowl- edge acquisition (Mowforth, 1986). Two families of systems, based on the ID3 (Quinlan, 1983) and AQ (Michalski, 1969) algorithms, have been especially successful. These basic algorithms assume no noise in the domain, searching for a concept description that classifies training data perfectly. However, ap- plication to real-world domains requires methods for handling noisy data. In particular, one needs mechanisms that do not overfit the induced concept de- scription to the data, and this requires relaxing the constraint that the induced description must classify the training data perfectly. Fortunately, the ID3 algorithm lends itself to such modification by the na- ture of its general-to-specific search. Tree-pruning techniques (e.g., Quinlan, 1987a; Niblett, 1987), used for example in C4 (Quinlan, Compton, Horn, & Lazarus, 1987) and ASSISTANT (Kononenko, Bratko, & Roskar, 1984), have proved effective against overfitting. However, the AQ algorithm's dependence on specific training examples during search makes it less easy to modify. Ex- isting implementations, such as AQll (Michalski & Larson, 1983) and AQ15 (Michalski, Mozetic, Hong, & Lavrac, 1986), leave the basic AQ algorithm intact and handle noise with pre-processing and post-processing techniques. Our objective in designing CN2 was to modify the AQ algorithm itself in ways that removed this dependence on specific examples and increased the
262 P. CLARK AND T. NIBLETT space of rules searched. This lets one apply statistical techniques, analogous to those used for tree pruning, in the generation of if-then rules, leading to a simpler induction algorithm. One can identify several requirements that learning systems should meet if they are to prove useful in a variety of real-world situations: • Accurate classification. The induced rules should be able to classify new examples accurately, even in the presence of noise. • Simple rules. For the sake of comprehensibility, the induced rules should be as short as possible. However, when noise is present, overfitting can lead to long rules. Thus, to induce short rules, one must usually relax the requirement that the induced rules be consistent with all the training data. The choice of how much to relax this requirement involves a trade-off between accuracy and simplicity (Iba, Wogulis, & Langley, 1988). • Efficient rule generation. If one expects to use large example sets, it is important that the algorithm scales up to complex situations. In practice, the time taken for rule generation should be linear in the size of the example set. With these requirements in mind, this paper presents a description and empir- ical evaluation of CN2, a new induction algorithm. This system combines the efficiency and ability to cope with noisy data of ID3 with the if-then rule form and flexible search strategy of the AQ family. The representation for rules output by CN2 is an ordered set of if-then rules, also known as a decision list (Rivest, 1987). CN2 uses a heuristic function to terminate search during rule construction, based on an estimate of the noise present in the data. This results in rules that may not classify all the training examples correctly, but that perform well on new data. In the following section we describe CN2 and three other systems used for our comparative study. In Section 3 we consider the time complexity of the various algorithms and in Section 4 we compare their performance on three medical diagnosis tasks. We also compare the performance of ASSISTANT and CN2 on two synthetic tasks. In Section 5 we discuss the significance of our results, and we follow this with some suggestions for future work in Section 6. 2. CN2 and related algorithms CN2 incorporates ideas from both Michalski's (1969) AQ and Quinlan's (1983) ID3 algorithms. Thus we begin by describing Kononenko et al.'s (1984) ASSISTANT, a variant of ID3, and AQR, the authors' reconstruction of Michal- ski's method. After this, we present CN2 and discuss its relationship to these systems. We also describe a simple Bayesian classifier, which provides a refer- ence for the performance of the other algorithms. We characterize each system along three dimensions: • the representation language for the induced knowledge; • the performance engine for executing the rules; and • the learning algorithm and its associated search heuristics.
THE CN2 ALGORITHM 263 In all of our experiments, the example description language consisted of at- tributes, attribute values, and user-specified classes. This language was the same for each algorithm. 2.1 ASSISTANT The ASSISTANT algorithm (Kononenko et al., 1984) is a descendant of Quin- lan's ID3 (1983), and incorporates a tree-pruning mechanism for handling noisy data. 2.1.1 Concept description and interpretation in ASSISTANT ASSISTANT represents acquired knowledge in the form of decision trees. An internal node of a tree specifies a test of an attribute, with each outgoing branch corresponding to a possible result of this test. Leaf nodes represent the classification to be assigned to an example. To classify a new example, a path from the root of the decision tree to a leaf node is traced. At each internal node reached, one follows the branch corresponding to the value of the attribute tested at that node. The class at the leaf node represents the class prediction for that example. 2.1.2 The ASSISTANT learning algorithm ASSISTANT induces a decision tree by repeatedly specializing leaf nodes of an initially single-node tree. The specialization operation involves replacing a leaf node with an attribute test, and adding new leaves to that node corresponding to the possible results of that test. Heuristics determine the attribute on which to test and when to stop specialization. Table 1 summarizes this algorithm. 2.1.3 Heuristic functions in ASSISTANT ASSISTANT uses an entropy measure to guide the growth of the decision tree, as described by Quinlan (1983). This corresponds to the function IDM in Table 1. In addition, the algorithm can apply a tree-cutoff method based on an estimate of maximal classification precision. This technique estimates whether additional branching would reduce classification accuracy and, if so, terminates search (there are no user-controlled parameters in this calculation). This cutoff criterion corresponds to the function TE in Table 1. If ASSISTANT is to generate an 'unpruned' tree, the termination criterion TE(E) is satisfied if all the examples E have the same class value, 2.2 AQR AQR is an induction system that uses the basic AQ algorithm (Michalski, 1969) to generate a set of classification rules. Many systems use this algorithm in a more sophisticated manner than AQR to improve predictive accuracy and rule simplicity; e.g., AQ11 (Michalski & Larson, 1983) uses a more complex method of rule interpretation that involves degrees of confirmation. AQR is a reconstruction of a straightforward AQ-based system.
264 P. CLARK AND T. NIBLETT Table 1. The core of the ASSISTANT algorithm. Let E be a set of classified examples. Let A be a set of attributes for describing examples. Let TE(E) be a termination criterion. Let IDM(Ai,E) be an evaluation function, where Ai 6 A. Procedure Assistant (E) If E satisfies the termination criterion TE(E) , Then return a leaf node for TREE, labelled with the most common class of examples in E. Else let Abest € A be the attribute with the largest value of the function IDM(Abest,E). For each value Vj of attribute Abest, Generate subtrees using ASSISTANT(Ej), where Ej are those examples in E with value Vj for attribute Abest. Return a node labelled as a test on attribute Abest with these subtrees attached. 2.2.1 Concept description and interpretation in AQR AQR induces a set of decision rules, one for each class. Each rule is of the form 'if then predict ', where is a Boolean combi- nation of attribute tests as we now describe. The basic test on an attribute is called a selector. For instance, (Cloudy = yes), (Weather = wet V stormy), and (Temperature > 60} are all selectors. AQR allows tests in the set {= ,<,>,T^}. A conjunction of selectors is called a complex, and a disjunct of complexes is called a cover. We say that an expression (a selector, complex, or cover) covers an example if the expression is true of the example. Thus, the empty complex (a conjunct of zero attribute tests) covers all examples and the empty cover (a disjunct of zero complexes) covers no examples. A cover is stored along with an associated class value, representing the most common class of training examples that it covers. In AQR, a new example is classified by finding which of the induced rules have their conditions satisfied by the example. If the example satisfies only one rule, then one assigns the class predicted by that rule to the example. If the example satisfies more than one rule, then one predicts the most common class of training examples that were covered by those rules. If the example is not covered by any rule, then it is assigned by default to the class that occurred most frequently in the training examples. 2.2.2 The AQR learning algorithm The AQ rule-generation algorithm has been described elsewhere (Michalski & Larson, 1983; Michalski & Chilausky, 1980; O'Rorke, 1982), and the AQR system is an instance of this general algorithm. AQR generates a decision rule
THE CN2 ALGORITHM 265 Table 2. The AQR algorithm for generating a class cover. Let POS be a set of positive examples of class C. Let NEG be a set of negative examples of class C. Procedure AQR(POS, NEG) Let COVER be the empty cover. While COVER does not cover all examples in POS, Select a SEED (a positive example not covered by COVER) . Let STAR be STAR (SEED, NEG) (a set of complexes that cover SEED but that cover no examples in NEG). Let BEST be the best complex in STAR according to user-defined criteria. Add BEST as an extra disjunct to COVER. Return COVER. Procedure STAR(SEED, NEG) Let STAR be the set containing the empty complex. While any complex in STAR covers some negative examples in NEG, Select a negative example Eneg covered by a complex in STAR. Specialize complexes in STAR to exclude Eneg by: Let EXTENSION be all selectors that cover SEED but not Eneg. Let STAR be the set {x Ay|x € STAR,y 6 EXTENSION}. Remove all complexes in STAR subsumed by other complexes. Repeat until size of STAR < max star (a user-defined maximum): Remove the Worst complex from STAR. Return STAR. for each class in turn. Having chosen a class on which to focus, it forms a disjunct of complexes (the cover) to serve as the condition of the rule for that class. This process occurs in stages; each stage generates a single complex, and then removes the examples it covers from the training set. This step is repeated until enough complexes have been found to cover all the examples of the chosen class. The entire process is repeated for each class in turn. Table 2 summarizes the AQR algorithm. 2.2.3 Heuristic functions in AQR The particular heuristic functions used by the AQ algorithm are implemen- tation dependent. The heuristic used by AQR to choose the best complex is "maximize the number of positive examples covered." The heuristic used to trim the partial star during generation of a complex is "maximize the sum of positive examples covered and negative examples excluded." In the case of a tie for either heuristic, the system prefers complexes with fewer selectors. Seeds are chosen at random and negative examples nearest to the seed are picked first, where distance is the number of attributes with different values in the seed and negative example. In the case of contradictions (i.e., if the seed
266 P. CLARK AND T. NIBLETT and negative example have identical attribute values) the negative example is ignored and a different one is chosen, since the complex cannot be specialized to exclude it but still include the seed. 2.3 The CN2 algorithm Now that we have reviewed ASSISTANT and AQR, we can turn to CN2, a new algorithm that combines aspects of both methods. We begin by describing how the general approach arises naturally from consideration of the decision- tree and AQ algorithms and then consider its details. 2.3.1. Relation to ID3 and AQ ID3 can be easily adapted to handle noisy data by virtue of its top-down approach to tree generation. During induction, all possible attribute tests are considered when 'growing' a leaf node in the tree, and entropy is used to select the best one to place at that node. Overfitting of decision trees can thus be avoided by halting tree growth when no more significant information can be gained. We wish to apply a similar method to the induction of if-then rules. The AQ algorithm, when generating a complex, also performs a general- to-specific search for the best complex. However, the method only considers specializations that exclude some particular covered negative example from the complex while ensuring some particular 'seed' positive example remains covered, iterating until all negative examples are excluded. As a result, AQ searches only the space of complexes that are completely consistent with the training data. The basic algorithm employs a beam search, which can be viewed as several hill-climbing searches in parallel. For the CN2 algorithm, we have retained the beam search method of the AQ algorithm but removed its dependence on specific examples during search and extended its search space to include rules that do not perform perfectly on the training data. This is achieved by broadening the specialization process to examine all specializations of a complex, in much the same way that ID3 considers all attribute tests when growing a node in the tree. Indeed, with a beam width of one the CN2 algorithm behaves equivalently to ID3 growing a single tree branch. This top-down search for complexes lets one apply a cutoff method similar to decision-tree pruning to halt specialization when no further specializations are statistically significant. Finally, we note that CN2 produces an ordered list of if-then rules, rather than an unordered set like that generated by AQ-based systems. Both repre- sentations have their respective advantages and disadvantages for comprehen- sibility. Order-independent rules require some additional mechanism to resolve any rule conflicts that may occur, thus detracting from a strict logical inter- pretation of the rules. Ordered rules also sacrifice in comprehensibility, in that the interpretation of a single rule is dependent on the other rules that precede it in the list.1 10ne can make CN2 produce unordered if-then rules by appropriately changing the eval- uation function; e.g., one can use the same evaluation function as AQR, then generate a rule set for each class in turn.
THE CN2 ALGORITHM 267 2.3.2 Concept description and interpretation in CN2 Rules induced by CN2 each have the form 'if then predict ', where has the same definition as for AQR, namely a conjunction of attribute tests. This ordered rule representation is a version of what Rivest (1987) has termed decision lists. The last rule in CN2's list is a 'default rule', which simply predicts the most commonly occurring class in the training data for all new examples. To use the induced rules to classify new examples, CN2 tries each rule in order until one is found whose conditions are satisfied by the example being classified. The class prediction of this rule is then assigned as the class of the example. Thus, the ordering of the rules is important. If no induced rules are satisfied, the final default rule assigns the most common class to the new example. 2.3.3 The CN2 learning algorithm Table 3 presents a summary of the CN2 algorithm. This works in an iterative fashion, each iteration searching for a complex that covers a large number of examples of a single class C and few of other classes. The complex must be both predictive and reliable, as determined by CN2's evaluation functions. Having found a good complex, the algorithm removes those examples it covers from the training set and adds the rule 'if then predict C' to the end of the rule list. This process iterates until no more satisfactory complexes can be found. The system searches for complexes by carrying out a pruned general-to- specific search. At each stage in the search, CN2 retains a size-limited set or star S of 'best complexes found so far'. The system examines only special- izations of this set, carrying out a beam search of the space of complexes. A complex is specialized by either adding a new conjunctive term or removing a disjunctive element in one of its selectors. Each complex can be specialized in several ways, and CN2 generates and evaluates all such specializations. The star is trimmed after completion of this step by removing its lowest ranking elements as measured by an evaluation function that we will describe shortly. Our implementation of the specialization step is to repeatedly intersect2 the set of all possible selectors with the current star, eliminating all the null and unchanged elements in the resulting set of complexes. (A null complex is one that contains a pair of incompatible selectors, e.g., big = y A big = n.) CN2 deals with continuous attributes in a manner similar to ASSISTANT - by dividing the range of values of each attribute into discrete subranges. Tests on such attributes examine whether a value is greater or less (or equal) than the values at subrange boundaries. The complete range of values and size of each subrange is provided by the user. 2The intersection of set A with set B is the set {x A y|x e A, y 6 B}. For example, {a A b, aAc, bf\d] intersected with {a,b, c, d} is { a f e , aAfcAc, aA&Ad, aAc, aAcAd, bAd, &AcAd}. If we now remove unchanged elements in this set, we obtain {aAiAc, aAbAd, aAcAd, 6AcAd}.
268 P. CLARK AND T. NIBLETT Table 3. The CN2 induction algorithm. Let E be a set of classified examples. Let SELECTORS be the set of all possible selectors. Procedure CN2(E) Let RULE-LIST be the empty list. Repeat until BEST.CPX is nil or E is empty: Let BEST.CPX be Find-Best.Complex(E). If BEST.CPX is not nil, Then let E' be the examples covered by BEST.CPX. Remove from E the examples E' covered by BEST.CPX. Let C be the most common class of examples in E'. Add the rule 'If BEST.CPX then the class is C' to the end of RULE-LIST. Return RULE-LIST. Procedure Find-Best.Complex(E) Let STAR be the set containing the empty complex. Let BEST.CPX be nil. While STAR is not empty, Specialize all complexes in STAR as follows: Let NEWSTAR be the set {x A y|x 6 STAR, y € SELECTORS}. Remove all complexes in NEWSTAR that are either in STAR (i.e., the unspecialized ones) or null (e.g., big = y A big = n). For every complex Ci in NEWSTAR: If Ci is statistically significant and better than BEST.CPX by user-defined criteria when tested on E, Then replace the current value of BEST.CPX by Ci. Repeat until size of NEWSTAR < user-defined maximum: Remove the worst complex from NEWSTAR. Let STAR be NEWSTAR. Return BEST.CPX. For dealing with unknown attribute values, CN2 uses the simple method of replacing unknown values with the most commonly occurring value for that attribute in the training data. In the case of numeric attributes, it uses the midvalue of the most commonly occurring subrange. 2.3.4 Heuristics in CN2 The CN2 algorithm must make two heuristic decisions during the learning process, and it employs two evaluation functions to aid in these decisions. First it must assess the quality of complexes, determining if a new complex should replace the 'best complex' found so far and also which complexes in the star 5 to discard if the maximum size is exceeded. Computing this involves first finding the set E' of examples which a complex covers (i.e., which satisfy all of
分享到:
收藏