层级分类综述.pdf

发布时间：2022-05-31 发布人：admin 分类：说明书资料大小：0.88M 资料格式：pdf 举报版权申诉

u013066326-10754736-16359647688835384749.pdf-第1页.png

第1页 / 共42页

u013066326-10754736-16359647688835384749.pdf-第2页.png

第2页 / 共42页

u013066326-10754736-16359647688835384749.pdf-第3页.png

第3页 / 共42页

u013066326-10754736-16359647688835384749.pdf-第4页.png

第4页 / 共42页

u013066326-10754736-16359647688835384749.pdf-第5页.png

第5页 / 共42页

u013066326-10754736-16359647688835384749.pdf-第6页.png

第6页 / 共42页

u013066326-10754736-16359647688835384749.pdf-第7页.png

第7页 / 共42页

u013066326-10754736-16359647688835384749.pdf-第8页.png

第8页 / 共42页

A survey of hierarchical classification across different application domains

Abstract

1 Introduction

2 What is hierarchical classification?

3 Flat classification approach

4 Local classifier approaches

4.1 Local classifier per node approach

4.2 Local classifier per parent node approach

4.3 Local classifier per level approach

4.4 Non-mandatory leaf node prediction and the blocking problem

5 Global classifier (or big-bang) approach

6 A unifying framework for hierarchical classification

6.1 Categorization of the different types of hierarchical classification problems

6.2 Categorization of different types of hierarchical classification algorithms

7 Conceptual and empirical comparison between different hierarchical classification approaches

8 Major applications of hierarchical classification

8.1 Text categorization

8.2 Protein function prediction

8.3 Music genre classification

8.4 Other applications

9 Concluding remarks

Acknowledgements

References

Data Min Knowl Disc (2011) 22:31–72 DOI 10.1007/s10618-010-0175-9 A survey of hierarchical classiﬁcation across different application domains Carlos N. Silla Jr. · Alex A. Freitas Received: 24 February 2009 / Accepted: 11 March 2010 / Published online: 7 April 2010 © The Author(s) 2010 Abstract In this survey we discuss the task of hierarchical classiﬁcation. The literature about this ﬁeld is scattered across very different application domains and for that reason research in one domain is often done unaware of methods developed in other domains. We deﬁne what is the task of hierarchical classiﬁcation and discuss why some related tasks should not be considered hierarchical classiﬁcation. We also present a new perspective about some existing hierarchical classiﬁcation approaches, and based on that perspective we propose a new unifying framework to classify the existing approaches. We also present a review of empirical comparisons of the existing methods reported in the literature as well as a conceptual comparison of those methods at a high level of abstraction, discussing their advantages and disadvantages. Keywords Hierarchical classiﬁcation · Tree-structured class hierarchies · DAG-structured class hierarchies 1 Introduction A very large amount of research in the data mining, machine learning, statistical pattern recognition and related research communities has focused on ﬂat classiﬁcation prob- lems. By ﬂat classiﬁcation problem we are referring to standard binary or multi-class classiﬁcation problems. On the other hand, many important real-world classiﬁcation problems are naturally cast as hierarchical classiﬁcation problems, where the clas- ses to be predicted are organized into a class hierarchy—typically a tree or a DAG C. N. Silla Jr. (B) · A. A. Freitas School of Computing, University of Kent, Canterbury, UK e-mail: cns2@kent.ac.uk A. A. Freitas e-mail: A.A.Freitas@kent.ac.uk 123

32 C. N. Silla Jr., A. A. Freitas (Direct Acyclic Graph). The task of hierarchical classiﬁcation, however, needs to be better deﬁned, as it can be overlooked or confused with other tasks, which are often wrongly referred to by the same name. Moreover, the existing literature that deals with hierarchical classiﬁcation problems is usually scattered across different applica- tion domains which are not strongly connected with each other. As a result, researchers in one application domain are often unaware of methods developed by researchers in another domain. Also, there seems to be no standards on how to evaluate hierarchical classiﬁcation systems or even how to setup the experiments in a standard way. The contributions of this paper are: – To clarify what the task of hierarchical classiﬁcation is, and what it is not. – To propose a unifying framework to classify existing and novel hierarchical classi- ﬁcation methods, as well as different types of hierarchical classiﬁcation problems. – To perform a cross-domain critical survey, in order to create a taxonomy of hierar- chical classiﬁcation systems, by identifying important similarities and differences between the different approaches, which are currently scattered across different application domains. – To suggest some experimental protocols to be undertaken when performing hierarchical classiﬁcation experiments, in order to have a better understanding of the results. For instance, many authors claim that some hierarchical classiﬁca- tion methods are better than others, but they often use standard ﬂat classiﬁcation evaluation measures instead of using hierarchical evaluation measures. Also, in some cases, it is possible to overlook what would be interesting to compare, and authors often compare their hierarchical classiﬁcation methods only against ﬂat classiﬁcation methods, although the use of a baseline hierarchical method is not hard to implement and would offer a more interesting experimental comparison. This survey seems timely as different ﬁelds of research are more and more using an automated approach to deal with hierarchical information, as hierarchies (or taxono- mies) are a good way to help organize vast amounts of information. The ﬁrst issue that will be discussed in this paper (Sect. 2) is precisely the definition of the hierarchical classiﬁcation task. After clearly deﬁning the task, we classify the existing approaches in the literature according to three different broad types of approach, based on the underlying methods. These approaches can be classiﬁed as: ﬂat, i.e., ignoring the class hierarchy (Sect. 3); local (Sect. 4) or global (Sect. 5). Based on the new understanding about these approaches we present a unifying framework to classify hierarchical clas- siﬁcation methods and problems (Sect. 6). A summary, a conceptual comparison and a review of empirical comparisons reported in the literature about these three approaches is presented in Sect. 7. Section 8 presents some major applications of hierarchical clas- siﬁcation methods; and ﬁnally in Sect. 9 we present the conclusions of this work. 2 What is hierarchical classiﬁcation? In order to learn about hierarchical classiﬁcation, one might start searching for papers with the keywords “hierarchical” and “classiﬁcation”; however, this might be mis- leading. One of the reasons for this is that, due to the popularity of SVM (Support Vector Machine) methods in the machine learning community (which were originally 123

A survey of hierarchical classiﬁcation 33 developed for binary classiﬁcation problems), different researchers have developed different methods to deal with multi-class classiﬁcation problems. The most common are the One-Against-One and the One-Against-All schemes (Lorena and Carvalho 2004). A less known approach consists of dividing the problem in a hierarchical way where classes which are more similar to one another are grouped together into meta- classes, resulting in a Binary Hierarchical Classiﬁer (BHC) (Kumar et al. 2002). For instance, in Chen et al. (2004) the authors modiﬁed the standard SVM, creating what they called a H-SVM (Hierarchical SVM), based on this hierarchical problem decom- position approach. When we consider the use of meta-classes in the pattern recognition ﬁeld, they are usually manually assigned, like in Koerich and Kalva (2005), where handwritten letters with the same curves in uppercase and lowercase format (e.g. “o” and “O” will be represented by the same meta-class). An automated method for the generation of meta-classes was recently proposed by Freitas et al. (2008). At ﬁrst glance the use of meta-classes (and their automatic generation) seems to be related to the hierarchical problem decomposition approach, as one can view the use of meta-classes as a two- level hierarchy where leaf classes are grouped together by similarity into intermediate classes (the meta-classes). This issue is interesting and deserves further investigation, but is beyond the scope of this paper. In this paper we take the perspective that this kind of approach is not considered to be a hierarchical classiﬁcation approach, because it creates new (meta-)classes on the ﬂy, instead of using a pre-established taxonomy. In principle a classiﬁcation algorithm is not supposed to create new classes, which is related to clustering. In this paper we are interested in approaches that cope with a pre-deﬁned class hierarchy, instead of creating one from the similarity of classes within data (which would lead to higher-level classes that could be meaningless to the user). Let us elabo- rate on this point. There are application domains where the internal (non-leaf) nodes of the class hierarchy can be chosen based on data (usually in the text mining application domain), like in Sasaki and Kita (1998), Punera et al. (2005), Li et al. (2007), Hao et al. (2007), where they build the hierarchy during training by using some sort of hierar- chical clustering method, and then classify new test examples by using a hierarchical approach. However, in other domains, like protein function prediction in bioinformat- ics, just knowing that classes A and B are similar can be misleading, as proteins with similar characteristics (sequences of amino acids) can have very different functions and vice-versa (Gerlt and Babbitt 2000). Therefore, in this work, we are interested only in hierarchical classiﬁcation (a type of supervised learning). Hierarchical clustering (a type of unsupervised learning) is out of the scope of the paper. Hierarchical classiﬁcation can also appear under the name of Structured Classiﬁ- cation (Seeger 2008; Astikainen et al. 2008). However, the research ﬁeld of structured classiﬁcation involves many different types of problems which are not hierarchical classiﬁcation problems, e.g. Label Sequence Learning (Altun and Hofmann 2003; Tsochantaridis et al. 2005). Therefore, hierarchical classiﬁcation can be seen as a par- ticular type of structured classiﬁcation problem, where the output of the classiﬁcation algorithm is deﬁned over a class taxonomy; whilst the term structured classiﬁcation is broader and denotes a classiﬁcation problem where there is some structure (hierar- chical or not) among the classes. 123

34 C. N. Silla Jr., A. A. Freitas It is important then to deﬁne what exactly is a class taxonomy. Wu et al. (2005) have deﬁned a class taxonomy as a tree structured regular concept hierarchy deﬁned over a partially order set (C,≺), where C is a ﬁnite set that enumerates all class concepts in the application domain, and the relation ≺ represents the “IS-A” relationship. Wu et al. (2005) deﬁne the “IS-A” relationship as both anti-reﬂexive and transitive. However, we prefer to deﬁne the “IS-A” relationship as asymmetric, anti-reﬂexive and transitive: – The only one greatest element “R” is the root of the tree. – ∀ci , c j ∈ C, i f ci ≺ c j then c j – ∀ci ∈ C, ci ≺ ci . – ∀ci , c j , ck ∈ C, ci ≺ c j and c j ≺ ck imply ci ≺ ck. ≺ ci . This definition, although originally proposed for tree structured class taxonomies, can be used to deﬁne DAG structured class taxonomies as well. Ruiz and Srinivasan (2002) give a good example of the asymmetric and transitive relations: The “IS-A” relation is asymmetric (e.g. all dogs are animals, but not all animals are dogs) and tran- sitive (e.g., all pines are evergreens, and all evergreens are trees; therefore all pines are trees). Note that, for the purposes of this survey, any classiﬁcation problem with a class structure satisfying the aforementioned four properties of the IS-A hierarchy can be considered as a hierarchical classiﬁcation problem, and in general the hierarchical classiﬁcation methods surveyed in this work assume (explicitly or implicitly) the underlying class structure satisﬁes those problems. In the vast majority of works on hierarchical classiﬁcation, the actual class hierarchy in the underlying problem domain can indeed be called a IS-A hierarchy from a semantical point of view. However, in a few cases the semantics of the underlying class hierarchy might be different, but as long as the aforementioned four properties are satisﬁed, we would consider the target problem as a hierarchical classiﬁcation one. For instance, the class taxonomy asso- ciated with cellular localization in the Gene Ontology (an ontology which is brieﬂy discussed in Sect. 8.2) is essentially, from a semantical point of view, a PART-OF class hierarchy, but it still satisﬁes the four properties of the aforementioned definition of a IS-A hierarchy, so we consider the prediction of cellular location classes according to that class hierarchy as a hierarchical classiﬁcation problem. Whether the taxonomy is organized into a tree or a DAG inﬂuences the degree of difﬁculty of the underlying hierarchical classiﬁcation problem. Notably, as it will be seen in Sect. 7, most of the current literature focus on working with trees as it is an easier problem. One of the main contributions of this survey is to organize the existing hierarchical classiﬁcation approaches into a taxonomy, based on their essen- tial properties, regardless of the application domain. One of the main problems, in order to do this, is to deal with all the different terminology that has already been pro- posed, which is often inconsistent across different works. In order to understand these essential properties, is important to clarify a few aspects of hierarchical classiﬁcation methods. Let us consider initially two types of conventional classiﬁcation methods that cannot directly cope with hierarchical classes: binary and multi-class classiﬁers. First, the main difference between a binary classiﬁer and a multi-class classiﬁer is that the binary classiﬁer can only handle two-class problems, whilst a multi-class clas- 123

A survey of hierarchical classiﬁcation 35 siﬁer can handle in principle any number of classes. Secondly, there are multi-class classiﬁers that can also be multi-label, i.e. the answer from the classiﬁer can be more than one class assigned to a given example. Thirdly, since these types of classiﬁ- ers were not designed to deal with hierarchical classiﬁcation problems, they will be referred to as ﬂat classiﬁcation algorithms. Fourthly, in the context of hierarchical classiﬁcation most approaches could be called multi-label. For instance, considering the hierarchical class structure presented in Fig. 1 (where R denotes the root node), if the output of a classiﬁer is class 2.1.1, it is natural to say that it also belongs to classes 2 and 2.1, therefore having three classes as the output of the classiﬁer. In Tikk et al. (2004) this notion of multi-label is used and they call this a particular type of multi-label classiﬁcation problem. However, since this definition is trivial, as any hierarchical approach could be considered multi-label in this sense, in this work we will only consider a hierarchical classiﬁer to be hierarchically multi-label if it can assign more than one class at any given level of the hierarchy to a given example. This distinction is particularly important, as a hierarchically multi-label classiﬁcation algorithm is more challenging to design than a hierarchically single-label one. Also, recall that in hierarchical classiﬁcation we assume that the relation between a node and its parent in the class hierarchy is a “IS-A” relationship. According to Freitas and de Carvalho (2007) and Sun and Lim (2001) hierarchical classiﬁcation methods differ in a number of criteria. The ﬁrst criterion is the type of hierarchical structure used. This structure is based on the problem structure and it typically is either a tree or a DAG. Figure 2 illustrates these two types of structures. Fig. 1 An example of a tree-based hierarchical class structure Fig. 2 A simple example of a tree structure (left) and a DAG structure (right) 123

36 C. N. Silla Jr., A. A. Freitas The main difference between them is that in the DAG a node can have more than one parent node. The second criterion is related to how deep the classiﬁcation in the hierarchy is performed. That is, the hierarchical classiﬁcation method can be implemented in a way that will always classify a leaf node [which Freitas and de Carvalho (2007) refer to as mandatory leaf-node prediction (MLNP) and Sun and Lim (2001) refer to as virtual category tree] or the method can consider stopping the classiﬁcation at any node in any level of the hierarchy [which Freitas and de Carvalho (2007) refer to as non-mandatory leaf node prediction and Sun and Lim (2001) refer to as category tree]. In this paper we will use the term (non-)mandatory leaf node prediction, which can be naturally used for both tree-structured and DAG-structured class taxonomies. The third criterion is related to how the hierarchical structure is explored. The current literature often refers to top-down (or local) classiﬁers, when the system employs a set of local classiﬁers; big-bang (or global) classiﬁers, when a single classi- ﬁer coping with the entire class hierarchy is used; or ﬂat classiﬁers, which ignore the class relationships, typically predicting only the leaf nodes. However, a closer look at the existing hierarchical classiﬁcation methods reveals that: 1. The top-down approach is not a full hierarchical classiﬁcation approach by itself, but rather a method for avoiding or correcting inconsistencies in class prediction at different levels, during the testing (rather than training) phase; 2. There are different ways of using local information to create local classiﬁers, and although most of them are referred to as top-down in the literature, they are very different during the training phase and slightly different in the test phase; 3. Big-bang (or global) classiﬁers are trained by considering the entire class hierar- chy at once, and hence they lack the kind of modularity for local training of the classiﬁer that is a core characteristic of the local classiﬁer approach. These are the main points which will be discussed in detail in the next four sections. 3 Flat classiﬁcation approach The ﬂat classiﬁcation approach, which is the simplest one to deal with hierarchical classiﬁcation problems, consists of completely ignoring the class hierarchy, typically predicting only classes at the leaf nodes. This approach behaves like a traditional classi- ﬁcation algorithm during training and testing. However, it provides an indirect solution to the problem of hierarchical classiﬁcation, because, when a leaf class is assigned to an example, one can consider that all its ancestor classes are also implicitly assigned to that instance (recall that we assume a “IS-A” class hierarchy). However, this very simple approach has the serious disadvantage of having to build a classiﬁer to discriminate among a large number of classes (all leaf classes), without exploring information about parent-child class relationships present in the class hierar- chy. Figure 3 illustrates this approach. We use here the term ﬂat classiﬁcation approach, as it seems to be the most commonly used term in the existing literature, although in Burred and Lerch (2003) the authors refer to this approach as “the direct approach”, while in Xiao et al. (2007) this approach is referred to as a “global classiﬁer”—which 123

A survey of hierarchical classiﬁcation 37 Fig. 3 Flat classiﬁcation approach using a ﬂat multi-class classiﬁcation algorithm to always predict the leaf nodes is misleading as they are referring to this naïve ﬂat classiﬁcation algorithm, and the term global classiﬁer is often used to refer to the “big-bang” approach (Sect. 5). In Barbedo and Lopes (2007) the authors refer to this approach as a “bottom-up” approach. They justify this term as follows: “The signal is ﬁrstly classiﬁed according to the basic genres, and the corresponding upper classes are consequences of this ﬁrst classiﬁcation (bottom-up approach).” In this paper, however, we prefer to use the term ﬂat classiﬁcation to be consistent with the majority of the literature. Considering the different types of class taxonomies (tree or DAG), this approach can cope with both of them as long as the problem is a mandatory-leaf node pre- diction problem, as it is incapable of handling non-mandatory leaf node prediction problems. In this approach training and testing proceed in the same way as in standard (non-hierarchical) classiﬁcation algorithms. 4 Local classiﬁer approaches In the seminal work of Koller and Sahami (1997), the ﬁrst type of local classiﬁer approach (also known as top-down approach in the literature) was proposed. From this work onwards, many different authors used augmented versions of this approach to deal with hierarchical classiﬁcation problems. However, the important aspect here is not that the approach is top-down (as it is commonly called), but rather that the hier- archy is taken into account by using a local information perspective. The idea behind this reasoning is that in the literature there are several papers that employ this local information in different ways. These approaches, therefore, can be grouped based on how they use this local information and how they build their classiﬁers around it. More precisely, there seems to exist three standard ways of using the local information: a local classiﬁer per node (LCN), a local classiﬁer per parent node (LCPN) and a local classiﬁer per level (LCL). In the following subsections we discuss each one of them in detail. Also note that unless speciﬁed otherwise, the discussion will assume a single label tree-structured class hierarchy and mandatory leaf node prediction. 123

38 C. N. Silla Jr., A. A. Freitas It should be noted that, although the three types of local hierarchical classiﬁcation algorithms discussed in the next three sub-sections differ significantly in their training phase, they share a very similar top-down approach in their testing phase. In essence, in this top-down approach, for each new example in the test set, the system ﬁrst predicts its ﬁrst-level (most generic) class, then it uses that predicted class to narrow the choices of classes to be predicted at the second level (the only valid candidate second-level classes are the children of the class predicted at the ﬁrst level), and so on, recursively, until the most speciﬁc prediction is made. As a result, a disadvantage of the top-down class-prediction approach (which is shared by all the three types of local classiﬁers discussed next) is that an error at a certain class level is going to be propagated downwards the hierarchy, unless some procedure for avoiding this problem is used. If the problem is non-mandatory leaf node prediction, a blocking approach (where an example is passed down to the next lower level only if the conﬁdence on the prediction at the current level is greater than a threshold) can avoid that misclassiﬁcations are propagated downwards, at the expense of providing the user with less speciﬁc (less useful) class predictions. Some authors use methods to give better estimates of class probabilities, like shrinkage (McCallum et al. 1998) and isotonic smoothing (Punera and Ghosh 2008). The issues of non-mandatory leaf node prediction and blocking are discussed in Sect. 4.4. 4.1 Local classiﬁer per node approach This is by far the most used approach in the literature. It often appears under the name of a top-down approach, but as we mentioned earlier, we shall see why this is not a good name as the top-down approach is essentially a method to avoid inconsistencies in class predictions at different levels in the class hierarchy. The LCN approach con- sists of training one binary classiﬁer for each node of the class hierarchy (except the root node). Figure 4 illustrates this approach. Fig. 4 Local classiﬁer per node approach (circles represent classes and dashed squares with round corners represent binary classiﬁers) 123

分享到：

赞收藏

资料库

层级分类综述.pdf

相关推荐

人工智能

热门标签

最新资料