logo资料库

层级分类综述.pdf

第1页 / 共42页
第2页 / 共42页
第3页 / 共42页
第4页 / 共42页
第5页 / 共42页
第6页 / 共42页
第7页 / 共42页
第8页 / 共42页
资料共42页,剩余部分请下载后查看
A survey of hierarchical classification across different application domains
Abstract
1 Introduction
2 What is hierarchical classification?
3 Flat classification approach
4 Local classifier approaches
4.1 Local classifier per node approach
4.2 Local classifier per parent node approach
4.3 Local classifier per level approach
4.4 Non-mandatory leaf node prediction and the blocking problem
5 Global classifier (or big-bang) approach
6 A unifying framework for hierarchical classification
6.1 Categorization of the different types of hierarchical classification problems
6.2 Categorization of different types of hierarchical classification algorithms
7 Conceptual and empirical comparison between different hierarchical classification approaches
8 Major applications of hierarchical classification
8.1 Text categorization
8.2 Protein function prediction
8.3 Music genre classification
8.4 Other applications
9 Concluding remarks
Acknowledgements
References
Data Min Knowl Disc (2011) 22:31–72 DOI 10.1007/s10618-010-0175-9 A survey of hierarchical classification across different application domains Carlos N. Silla Jr. · Alex A. Freitas Received: 24 February 2009 / Accepted: 11 March 2010 / Published online: 7 April 2010 © The Author(s) 2010 Abstract In this survey we discuss the task of hierarchical classification. The literature about this field is scattered across very different application domains and for that reason research in one domain is often done unaware of methods developed in other domains. We define what is the task of hierarchical classification and discuss why some related tasks should not be considered hierarchical classification. We also present a new perspective about some existing hierarchical classification approaches, and based on that perspective we propose a new unifying framework to classify the existing approaches. We also present a review of empirical comparisons of the existing methods reported in the literature as well as a conceptual comparison of those methods at a high level of abstraction, discussing their advantages and disadvantages. Keywords Hierarchical classification · Tree-structured class hierarchies · DAG-structured class hierarchies 1 Introduction A very large amount of research in the data mining, machine learning, statistical pattern recognition and related research communities has focused on flat classification prob- lems. By flat classification problem we are referring to standard binary or multi-class classification problems. On the other hand, many important real-world classification problems are naturally cast as hierarchical classification problems, where the clas- ses to be predicted are organized into a class hierarchy—typically a tree or a DAG C. N. Silla Jr. (B) · A. A. Freitas School of Computing, University of Kent, Canterbury, UK e-mail: cns2@kent.ac.uk A. A. Freitas e-mail: A.A.Freitas@kent.ac.uk 123
32 C. N. Silla Jr., A. A. Freitas (Direct Acyclic Graph). The task of hierarchical classification, however, needs to be better defined, as it can be overlooked or confused with other tasks, which are often wrongly referred to by the same name. Moreover, the existing literature that deals with hierarchical classification problems is usually scattered across different applica- tion domains which are not strongly connected with each other. As a result, researchers in one application domain are often unaware of methods developed by researchers in another domain. Also, there seems to be no standards on how to evaluate hierarchical classification systems or even how to setup the experiments in a standard way. The contributions of this paper are: – To clarify what the task of hierarchical classification is, and what it is not. – To propose a unifying framework to classify existing and novel hierarchical classi- fication methods, as well as different types of hierarchical classification problems. – To perform a cross-domain critical survey, in order to create a taxonomy of hierar- chical classification systems, by identifying important similarities and differences between the different approaches, which are currently scattered across different application domains. – To suggest some experimental protocols to be undertaken when performing hierarchical classification experiments, in order to have a better understanding of the results. For instance, many authors claim that some hierarchical classifica- tion methods are better than others, but they often use standard flat classification evaluation measures instead of using hierarchical evaluation measures. Also, in some cases, it is possible to overlook what would be interesting to compare, and authors often compare their hierarchical classification methods only against flat classification methods, although the use of a baseline hierarchical method is not hard to implement and would offer a more interesting experimental comparison. This survey seems timely as different fields of research are more and more using an automated approach to deal with hierarchical information, as hierarchies (or taxono- mies) are a good way to help organize vast amounts of information. The first issue that will be discussed in this paper (Sect. 2) is precisely the definition of the hierarchical classification task. After clearly defining the task, we classify the existing approaches in the literature according to three different broad types of approach, based on the underlying methods. These approaches can be classified as: flat, i.e., ignoring the class hierarchy (Sect. 3); local (Sect. 4) or global (Sect. 5). Based on the new understanding about these approaches we present a unifying framework to classify hierarchical clas- sification methods and problems (Sect. 6). A summary, a conceptual comparison and a review of empirical comparisons reported in the literature about these three approaches is presented in Sect. 7. Section 8 presents some major applications of hierarchical clas- sification methods; and finally in Sect. 9 we present the conclusions of this work. 2 What is hierarchical classification? In order to learn about hierarchical classification, one might start searching for papers with the keywords “hierarchical” and “classification”; however, this might be mis- leading. One of the reasons for this is that, due to the popularity of SVM (Support Vector Machine) methods in the machine learning community (which were originally 123
A survey of hierarchical classification 33 developed for binary classification problems), different researchers have developed different methods to deal with multi-class classification problems. The most common are the One-Against-One and the One-Against-All schemes (Lorena and Carvalho 2004). A less known approach consists of dividing the problem in a hierarchical way where classes which are more similar to one another are grouped together into meta- classes, resulting in a Binary Hierarchical Classifier (BHC) (Kumar et al. 2002). For instance, in Chen et al. (2004) the authors modified the standard SVM, creating what they called a H-SVM (Hierarchical SVM), based on this hierarchical problem decom- position approach. When we consider the use of meta-classes in the pattern recognition field, they are usually manually assigned, like in Koerich and Kalva (2005), where handwritten letters with the same curves in uppercase and lowercase format (e.g. “o” and “O” will be represented by the same meta-class). An automated method for the generation of meta-classes was recently proposed by Freitas et al. (2008). At first glance the use of meta-classes (and their automatic generation) seems to be related to the hierarchical problem decomposition approach, as one can view the use of meta-classes as a two- level hierarchy where leaf classes are grouped together by similarity into intermediate classes (the meta-classes). This issue is interesting and deserves further investigation, but is beyond the scope of this paper. In this paper we take the perspective that this kind of approach is not considered to be a hierarchical classification approach, because it creates new (meta-)classes on the fly, instead of using a pre-established taxonomy. In principle a classification algorithm is not supposed to create new classes, which is related to clustering. In this paper we are interested in approaches that cope with a pre-defined class hierarchy, instead of creating one from the similarity of classes within data (which would lead to higher-level classes that could be meaningless to the user). Let us elabo- rate on this point. There are application domains where the internal (non-leaf) nodes of the class hierarchy can be chosen based on data (usually in the text mining application domain), like in Sasaki and Kita (1998), Punera et al. (2005), Li et al. (2007), Hao et al. (2007), where they build the hierarchy during training by using some sort of hierar- chical clustering method, and then classify new test examples by using a hierarchical approach. However, in other domains, like protein function prediction in bioinformat- ics, just knowing that classes A and B are similar can be misleading, as proteins with similar characteristics (sequences of amino acids) can have very different functions and vice-versa (Gerlt and Babbitt 2000). Therefore, in this work, we are interested only in hierarchical classification (a type of supervised learning). Hierarchical clustering (a type of unsupervised learning) is out of the scope of the paper. Hierarchical classification can also appear under the name of Structured Classifi- cation (Seeger 2008; Astikainen et al. 2008). However, the research field of structured classification involves many different types of problems which are not hierarchical classification problems, e.g. Label Sequence Learning (Altun and Hofmann 2003; Tsochantaridis et al. 2005). Therefore, hierarchical classification can be seen as a par- ticular type of structured classification problem, where the output of the classification algorithm is defined over a class taxonomy; whilst the term structured classification is broader and denotes a classification problem where there is some structure (hierar- chical or not) among the classes. 123
34 C. N. Silla Jr., A. A. Freitas It is important then to define what exactly is a class taxonomy. Wu et al. (2005) have defined a class taxonomy as a tree structured regular concept hierarchy defined over a partially order set (C,≺), where C is a finite set that enumerates all class concepts in the application domain, and the relation ≺ represents the “IS-A” relationship. Wu et al. (2005) define the “IS-A” relationship as both anti-reflexive and transitive. However, we prefer to define the “IS-A” relationship as asymmetric, anti-reflexive and transitive: – The only one greatest element “R” is the root of the tree. – ∀ci , c j ∈ C, i f ci ≺ c j then c j – ∀ci ∈ C, ci ≺ ci . – ∀ci , c j , ck ∈ C, ci ≺ c j and c j ≺ ck imply ci ≺ ck. ≺ ci . This definition, although originally proposed for tree structured class taxonomies, can be used to define DAG structured class taxonomies as well. Ruiz and Srinivasan (2002) give a good example of the asymmetric and transitive relations: The “IS-A” relation is asymmetric (e.g. all dogs are animals, but not all animals are dogs) and tran- sitive (e.g., all pines are evergreens, and all evergreens are trees; therefore all pines are trees). Note that, for the purposes of this survey, any classification problem with a class structure satisfying the aforementioned four properties of the IS-A hierarchy can be considered as a hierarchical classification problem, and in general the hierarchical classification methods surveyed in this work assume (explicitly or implicitly) the underlying class structure satisfies those problems. In the vast majority of works on hierarchical classification, the actual class hierarchy in the underlying problem domain can indeed be called a IS-A hierarchy from a semantical point of view. However, in a few cases the semantics of the underlying class hierarchy might be different, but as long as the aforementioned four properties are satisfied, we would consider the target problem as a hierarchical classification one. For instance, the class taxonomy asso- ciated with cellular localization in the Gene Ontology (an ontology which is briefly discussed in Sect. 8.2) is essentially, from a semantical point of view, a PART-OF class hierarchy, but it still satisfies the four properties of the aforementioned definition of a IS-A hierarchy, so we consider the prediction of cellular location classes according to that class hierarchy as a hierarchical classification problem. Whether the taxonomy is organized into a tree or a DAG influences the degree of difficulty of the underlying hierarchical classification problem. Notably, as it will be seen in Sect. 7, most of the current literature focus on working with trees as it is an easier problem. One of the main contributions of this survey is to organize the existing hierarchical classification approaches into a taxonomy, based on their essen- tial properties, regardless of the application domain. One of the main problems, in order to do this, is to deal with all the different terminology that has already been pro- posed, which is often inconsistent across different works. In order to understand these essential properties, is important to clarify a few aspects of hierarchical classification methods. Let us consider initially two types of conventional classification methods that cannot directly cope with hierarchical classes: binary and multi-class classifiers. First, the main difference between a binary classifier and a multi-class classifier is that the binary classifier can only handle two-class problems, whilst a multi-class clas- 123
A survey of hierarchical classification 35 sifier can handle in principle any number of classes. Secondly, there are multi-class classifiers that can also be multi-label, i.e. the answer from the classifier can be more than one class assigned to a given example. Thirdly, since these types of classifi- ers were not designed to deal with hierarchical classification problems, they will be referred to as flat classification algorithms. Fourthly, in the context of hierarchical classification most approaches could be called multi-label. For instance, considering the hierarchical class structure presented in Fig. 1 (where R denotes the root node), if the output of a classifier is class 2.1.1, it is natural to say that it also belongs to classes 2 and 2.1, therefore having three classes as the output of the classifier. In Tikk et al. (2004) this notion of multi-label is used and they call this a particular type of multi-label classification problem. However, since this definition is trivial, as any hierarchical approach could be considered multi-label in this sense, in this work we will only consider a hierarchical classifier to be hierarchically multi-label if it can assign more than one class at any given level of the hierarchy to a given example. This distinction is particularly important, as a hierarchically multi-label classification algorithm is more challenging to design than a hierarchically single-label one. Also, recall that in hierarchical classification we assume that the relation between a node and its parent in the class hierarchy is a “IS-A” relationship. According to Freitas and de Carvalho (2007) and Sun and Lim (2001) hierarchical classification methods differ in a number of criteria. The first criterion is the type of hierarchical structure used. This structure is based on the problem structure and it typically is either a tree or a DAG. Figure 2 illustrates these two types of structures. Fig. 1 An example of a tree-based hierarchical class structure Fig. 2 A simple example of a tree structure (left) and a DAG structure (right) 123
36 C. N. Silla Jr., A. A. Freitas The main difference between them is that in the DAG a node can have more than one parent node. The second criterion is related to how deep the classification in the hierarchy is performed. That is, the hierarchical classification method can be implemented in a way that will always classify a leaf node [which Freitas and de Carvalho (2007) refer to as mandatory leaf-node prediction (MLNP) and Sun and Lim (2001) refer to as virtual category tree] or the method can consider stopping the classification at any node in any level of the hierarchy [which Freitas and de Carvalho (2007) refer to as non-mandatory leaf node prediction and Sun and Lim (2001) refer to as category tree]. In this paper we will use the term (non-)mandatory leaf node prediction, which can be naturally used for both tree-structured and DAG-structured class taxonomies. The third criterion is related to how the hierarchical structure is explored. The current literature often refers to top-down (or local) classifiers, when the system employs a set of local classifiers; big-bang (or global) classifiers, when a single classi- fier coping with the entire class hierarchy is used; or flat classifiers, which ignore the class relationships, typically predicting only the leaf nodes. However, a closer look at the existing hierarchical classification methods reveals that: 1. The top-down approach is not a full hierarchical classification approach by itself, but rather a method for avoiding or correcting inconsistencies in class prediction at different levels, during the testing (rather than training) phase; 2. There are different ways of using local information to create local classifiers, and although most of them are referred to as top-down in the literature, they are very different during the training phase and slightly different in the test phase; 3. Big-bang (or global) classifiers are trained by considering the entire class hierar- chy at once, and hence they lack the kind of modularity for local training of the classifier that is a core characteristic of the local classifier approach. These are the main points which will be discussed in detail in the next four sections. 3 Flat classification approach The flat classification approach, which is the simplest one to deal with hierarchical classification problems, consists of completely ignoring the class hierarchy, typically predicting only classes at the leaf nodes. This approach behaves like a traditional classi- fication algorithm during training and testing. However, it provides an indirect solution to the problem of hierarchical classification, because, when a leaf class is assigned to an example, one can consider that all its ancestor classes are also implicitly assigned to that instance (recall that we assume a “IS-A” class hierarchy). However, this very simple approach has the serious disadvantage of having to build a classifier to discriminate among a large number of classes (all leaf classes), without exploring information about parent-child class relationships present in the class hierar- chy. Figure 3 illustrates this approach. We use here the term flat classification approach, as it seems to be the most commonly used term in the existing literature, although in Burred and Lerch (2003) the authors refer to this approach as “the direct approach”, while in Xiao et al. (2007) this approach is referred to as a “global classifier”—which 123
A survey of hierarchical classification 37 Fig. 3 Flat classification approach using a flat multi-class classification algorithm to always predict the leaf nodes is misleading as they are referring to this naïve flat classification algorithm, and the term global classifier is often used to refer to the “big-bang” approach (Sect. 5). In Barbedo and Lopes (2007) the authors refer to this approach as a “bottom-up” approach. They justify this term as follows: “The signal is firstly classified according to the basic genres, and the corresponding upper classes are consequences of this first classification (bottom-up approach).” In this paper, however, we prefer to use the term flat classification to be consistent with the majority of the literature. Considering the different types of class taxonomies (tree or DAG), this approach can cope with both of them as long as the problem is a mandatory-leaf node pre- diction problem, as it is incapable of handling non-mandatory leaf node prediction problems. In this approach training and testing proceed in the same way as in standard (non-hierarchical) classification algorithms. 4 Local classifier approaches In the seminal work of Koller and Sahami (1997), the first type of local classifier approach (also known as top-down approach in the literature) was proposed. From this work onwards, many different authors used augmented versions of this approach to deal with hierarchical classification problems. However, the important aspect here is not that the approach is top-down (as it is commonly called), but rather that the hier- archy is taken into account by using a local information perspective. The idea behind this reasoning is that in the literature there are several papers that employ this local information in different ways. These approaches, therefore, can be grouped based on how they use this local information and how they build their classifiers around it. More precisely, there seems to exist three standard ways of using the local information: a local classifier per node (LCN), a local classifier per parent node (LCPN) and a local classifier per level (LCL). In the following subsections we discuss each one of them in detail. Also note that unless specified otherwise, the discussion will assume a single label tree-structured class hierarchy and mandatory leaf node prediction. 123
38 C. N. Silla Jr., A. A. Freitas It should be noted that, although the three types of local hierarchical classification algorithms discussed in the next three sub-sections differ significantly in their training phase, they share a very similar top-down approach in their testing phase. In essence, in this top-down approach, for each new example in the test set, the system first predicts its first-level (most generic) class, then it uses that predicted class to narrow the choices of classes to be predicted at the second level (the only valid candidate second-level classes are the children of the class predicted at the first level), and so on, recursively, until the most specific prediction is made. As a result, a disadvantage of the top-down class-prediction approach (which is shared by all the three types of local classifiers discussed next) is that an error at a certain class level is going to be propagated downwards the hierarchy, unless some procedure for avoiding this problem is used. If the problem is non-mandatory leaf node prediction, a blocking approach (where an example is passed down to the next lower level only if the confidence on the prediction at the current level is greater than a threshold) can avoid that misclassifications are propagated downwards, at the expense of providing the user with less specific (less useful) class predictions. Some authors use methods to give better estimates of class probabilities, like shrinkage (McCallum et al. 1998) and isotonic smoothing (Punera and Ghosh 2008). The issues of non-mandatory leaf node prediction and blocking are discussed in Sect. 4.4. 4.1 Local classifier per node approach This is by far the most used approach in the literature. It often appears under the name of a top-down approach, but as we mentioned earlier, we shall see why this is not a good name as the top-down approach is essentially a method to avoid inconsistencies in class predictions at different levels in the class hierarchy. The LCN approach con- sists of training one binary classifier for each node of the class hierarchy (except the root node). Figure 4 illustrates this approach. Fig. 4 Local classifier per node approach (circles represent classes and dashed squares with round corners represent binary classifiers) 123
分享到:
收藏