Data Min Knowl Disc (2011) 22:31–72
DOI 10.1007/s10618-010-0175-9
A survey of hierarchical classification across different
application domains
Carlos N. Silla Jr. · Alex A. Freitas
Received: 24 February 2009 / Accepted: 11 March 2010 / Published online: 7 April 2010
© The Author(s) 2010
Abstract
In this survey we discuss the task of hierarchical classification. The
literature about this field is scattered across very different application domains and
for that reason research in one domain is often done unaware of methods developed
in other domains. We define what is the task of hierarchical classification and discuss
why some related tasks should not be considered hierarchical classification. We also
present a new perspective about some existing hierarchical classification approaches,
and based on that perspective we propose a new unifying framework to classify the
existing approaches. We also present a review of empirical comparisons of the existing
methods reported in the literature as well as a conceptual comparison of those methods
at a high level of abstraction, discussing their advantages and disadvantages.
Keywords Hierarchical classification · Tree-structured class hierarchies ·
DAG-structured class hierarchies
1 Introduction
A very large amount of research in the data mining, machine learning, statistical pattern
recognition and related research communities has focused on flat classification prob-
lems. By flat classification problem we are referring to standard binary or multi-class
classification problems. On the other hand, many important real-world classification
problems are naturally cast as hierarchical classification problems, where the clas-
ses to be predicted are organized into a class hierarchy—typically a tree or a DAG
C. N. Silla Jr. (B) · A. A. Freitas
School of Computing, University of Kent, Canterbury, UK
e-mail: cns2@kent.ac.uk
A. A. Freitas
e-mail: A.A.Freitas@kent.ac.uk
123
32
C. N. Silla Jr., A. A. Freitas
(Direct Acyclic Graph). The task of hierarchical classification, however, needs to be
better defined, as it can be overlooked or confused with other tasks, which are often
wrongly referred to by the same name. Moreover, the existing literature that deals
with hierarchical classification problems is usually scattered across different applica-
tion domains which are not strongly connected with each other. As a result, researchers
in one application domain are often unaware of methods developed by researchers in
another domain. Also, there seems to be no standards on how to evaluate hierarchical
classification systems or even how to setup the experiments in a standard way.
The contributions of this paper are:
– To clarify what the task of hierarchical classification is, and what it is not.
– To propose a unifying framework to classify existing and novel hierarchical classi-
fication methods, as well as different types of hierarchical classification problems.
– To perform a cross-domain critical survey, in order to create a taxonomy of hierar-
chical classification systems, by identifying important similarities and differences
between the different approaches, which are currently scattered across different
application domains.
– To suggest some experimental protocols to be undertaken when performing
hierarchical classification experiments, in order to have a better understanding
of the results. For instance, many authors claim that some hierarchical classifica-
tion methods are better than others, but they often use standard flat classification
evaluation measures instead of using hierarchical evaluation measures. Also, in
some cases, it is possible to overlook what would be interesting to compare, and
authors often compare their hierarchical classification methods only against flat
classification methods, although the use of a baseline hierarchical method is not
hard to implement and would offer a more interesting experimental comparison.
This survey seems timely as different fields of research are more and more using an
automated approach to deal with hierarchical information, as hierarchies (or taxono-
mies) are a good way to help organize vast amounts of information. The first issue that
will be discussed in this paper (Sect. 2) is precisely the definition of the hierarchical
classification task. After clearly defining the task, we classify the existing approaches
in the literature according to three different broad types of approach, based on the
underlying methods. These approaches can be classified as: flat, i.e., ignoring the class
hierarchy (Sect. 3); local (Sect. 4) or global (Sect. 5). Based on the new understanding
about these approaches we present a unifying framework to classify hierarchical clas-
sification methods and problems (Sect. 6). A summary, a conceptual comparison and a
review of empirical comparisons reported in the literature about these three approaches
is presented in Sect. 7. Section 8 presents some major applications of hierarchical clas-
sification methods; and finally in Sect. 9 we present the conclusions of this work.
2 What is hierarchical classification?
In order to learn about hierarchical classification, one might start searching for papers
with the keywords “hierarchical” and “classification”; however, this might be mis-
leading. One of the reasons for this is that, due to the popularity of SVM (Support
Vector Machine) methods in the machine learning community (which were originally
123
A survey of hierarchical classification
33
developed for binary classification problems), different researchers have developed
different methods to deal with multi-class classification problems. The most common
are the One-Against-One and the One-Against-All schemes (Lorena and Carvalho
2004). A less known approach consists of dividing the problem in a hierarchical way
where classes which are more similar to one another are grouped together into meta-
classes, resulting in a Binary Hierarchical Classifier (BHC) (Kumar et al. 2002). For
instance, in Chen et al. (2004) the authors modified the standard SVM, creating what
they called a H-SVM (Hierarchical SVM), based on this hierarchical problem decom-
position approach.
When we consider the use of meta-classes in the pattern recognition field, they
are usually manually assigned, like in Koerich and Kalva (2005), where handwritten
letters with the same curves in uppercase and lowercase format (e.g. “o” and “O” will
be represented by the same meta-class). An automated method for the generation of
meta-classes was recently proposed by Freitas et al. (2008). At first glance the use of
meta-classes (and their automatic generation) seems to be related to the hierarchical
problem decomposition approach, as one can view the use of meta-classes as a two-
level hierarchy where leaf classes are grouped together by similarity into intermediate
classes (the meta-classes). This issue is interesting and deserves further investigation,
but is beyond the scope of this paper. In this paper we take the perspective that this
kind of approach is not considered to be a hierarchical classification approach, because
it creates new (meta-)classes on the fly, instead of using a pre-established taxonomy.
In principle a classification algorithm is not supposed to create new classes, which is
related to clustering.
In this paper we are interested in approaches that cope with a pre-defined class
hierarchy, instead of creating one from the similarity of classes within data (which
would lead to higher-level classes that could be meaningless to the user). Let us elabo-
rate on this point. There are application domains where the internal (non-leaf) nodes of
the class hierarchy can be chosen based on data (usually in the text mining application
domain), like in Sasaki and Kita (1998), Punera et al. (2005), Li et al. (2007), Hao et al.
(2007), where they build the hierarchy during training by using some sort of hierar-
chical clustering method, and then classify new test examples by using a hierarchical
approach. However, in other domains, like protein function prediction in bioinformat-
ics, just knowing that classes A and B are similar can be misleading, as proteins with
similar characteristics (sequences of amino acids) can have very different functions
and vice-versa (Gerlt and Babbitt 2000). Therefore, in this work, we are interested only
in hierarchical classification (a type of supervised learning). Hierarchical clustering
(a type of unsupervised learning) is out of the scope of the paper.
Hierarchical classification can also appear under the name of Structured Classifi-
cation (Seeger 2008; Astikainen et al. 2008). However, the research field of structured
classification involves many different types of problems which are not hierarchical
classification problems, e.g. Label Sequence Learning (Altun and Hofmann 2003;
Tsochantaridis et al. 2005). Therefore, hierarchical classification can be seen as a par-
ticular type of structured classification problem, where the output of the classification
algorithm is defined over a class taxonomy; whilst the term structured classification
is broader and denotes a classification problem where there is some structure (hierar-
chical or not) among the classes.
123
34
C. N. Silla Jr., A. A. Freitas
It is important then to define what exactly is a class taxonomy. Wu et al. (2005) have
defined a class taxonomy as a tree structured regular concept hierarchy defined over a
partially order set (C,≺), where C is a finite set that enumerates all class concepts in
the application domain, and the relation ≺ represents the “IS-A” relationship. Wu et al.
(2005) define the “IS-A” relationship as both anti-reflexive and transitive. However,
we prefer to define the “IS-A” relationship as asymmetric, anti-reflexive and transitive:
– The only one greatest element “R” is the root of the tree.
– ∀ci , c j ∈ C, i f ci ≺ c j then c j
– ∀ci ∈ C, ci ≺ ci .
– ∀ci , c j , ck ∈ C, ci ≺ c j and c j ≺ ck imply ci ≺ ck.
≺ ci .
This definition, although originally proposed for tree structured class taxonomies,
can be used to define DAG structured class taxonomies as well. Ruiz and Srinivasan
(2002) give a good example of the asymmetric and transitive relations: The “IS-A”
relation is asymmetric (e.g. all dogs are animals, but not all animals are dogs) and tran-
sitive (e.g., all pines are evergreens, and all evergreens are trees; therefore all pines
are trees).
Note that, for the purposes of this survey, any classification problem with a class
structure satisfying the aforementioned four properties of the IS-A hierarchy can be
considered as a hierarchical classification problem, and in general the hierarchical
classification methods surveyed in this work assume (explicitly or implicitly) the
underlying class structure satisfies those problems. In the vast majority of works on
hierarchical classification, the actual class hierarchy in the underlying problem domain
can indeed be called a IS-A hierarchy from a semantical point of view. However, in
a few cases the semantics of the underlying class hierarchy might be different, but as
long as the aforementioned four properties are satisfied, we would consider the target
problem as a hierarchical classification one. For instance, the class taxonomy asso-
ciated with cellular localization in the Gene Ontology (an ontology which is briefly
discussed in Sect. 8.2) is essentially, from a semantical point of view, a PART-OF class
hierarchy, but it still satisfies the four properties of the aforementioned definition of a
IS-A hierarchy, so we consider the prediction of cellular location classes according to
that class hierarchy as a hierarchical classification problem.
Whether the taxonomy is organized into a tree or a DAG influences the degree
of difficulty of the underlying hierarchical classification problem. Notably, as it will
be seen in Sect. 7, most of the current literature focus on working with trees as it
is an easier problem. One of the main contributions of this survey is to organize the
existing hierarchical classification approaches into a taxonomy, based on their essen-
tial properties, regardless of the application domain. One of the main problems, in
order to do this, is to deal with all the different terminology that has already been pro-
posed, which is often inconsistent across different works. In order to understand these
essential properties, is important to clarify a few aspects of hierarchical classification
methods.
Let us consider initially two types of conventional classification methods that
cannot directly cope with hierarchical classes: binary and multi-class classifiers. First,
the main difference between a binary classifier and a multi-class classifier is that
the binary classifier can only handle two-class problems, whilst a multi-class clas-
123
A survey of hierarchical classification
35
sifier can handle in principle any number of classes. Secondly, there are multi-class
classifiers that can also be multi-label, i.e. the answer from the classifier can be more
than one class assigned to a given example. Thirdly, since these types of classifi-
ers were not designed to deal with hierarchical classification problems, they will be
referred to as flat classification algorithms. Fourthly, in the context of hierarchical
classification most approaches could be called multi-label. For instance, considering
the hierarchical class structure presented in Fig. 1 (where R denotes the root node),
if the output of a classifier is class 2.1.1, it is natural to say that it also belongs to
classes 2 and 2.1, therefore having three classes as the output of the classifier. In
Tikk et al. (2004) this notion of multi-label is used and they call this a particular type
of multi-label classification problem. However, since this definition is trivial, as any
hierarchical approach could be considered multi-label in this sense, in this work we
will only consider a hierarchical classifier to be hierarchically multi-label if it can
assign more than one class at any given level of the hierarchy to a given example.
This distinction is particularly important, as a hierarchically multi-label classification
algorithm is more challenging to design than a hierarchically single-label one. Also,
recall that in hierarchical classification we assume that the relation between a node
and its parent in the class hierarchy is a “IS-A” relationship.
According to Freitas and de Carvalho (2007) and Sun and Lim (2001) hierarchical
classification methods differ in a number of criteria. The first criterion is the type of
hierarchical structure used. This structure is based on the problem structure and it
typically is either a tree or a DAG. Figure 2 illustrates these two types of structures.
Fig. 1 An example of a tree-based hierarchical class structure
Fig. 2 A simple example of a tree structure (left) and a DAG structure (right)
123
36
C. N. Silla Jr., A. A. Freitas
The main difference between them is that in the DAG a node can have more than one
parent node.
The second criterion is related to how deep the classification in the hierarchy is
performed. That is, the hierarchical classification method can be implemented in a
way that will always classify a leaf node [which Freitas and de Carvalho (2007) refer
to as mandatory leaf-node prediction (MLNP) and Sun and Lim (2001) refer to as
virtual category tree] or the method can consider stopping the classification at any
node in any level of the hierarchy [which Freitas and de Carvalho (2007) refer to as
non-mandatory leaf node prediction and Sun and Lim (2001) refer to as category tree].
In this paper we will use the term (non-)mandatory leaf node prediction, which can
be naturally used for both tree-structured and DAG-structured class taxonomies.
The third criterion is related to how the hierarchical structure is explored. The
current literature often refers to top-down (or local) classifiers, when the system
employs a set of local classifiers; big-bang (or global) classifiers, when a single classi-
fier coping with the entire class hierarchy is used; or flat classifiers, which ignore the
class relationships, typically predicting only the leaf nodes. However, a closer look at
the existing hierarchical classification methods reveals that:
1. The top-down approach is not a full hierarchical classification approach by itself,
but rather a method for avoiding or correcting inconsistencies in class prediction
at different levels, during the testing (rather than training) phase;
2. There are different ways of using local information to create local classifiers, and
although most of them are referred to as top-down in the literature, they are very
different during the training phase and slightly different in the test phase;
3. Big-bang (or global) classifiers are trained by considering the entire class hierar-
chy at once, and hence they lack the kind of modularity for local training of the
classifier that is a core characteristic of the local classifier approach.
These are the main points which will be discussed in detail in the next four sections.
3 Flat classification approach
The flat classification approach, which is the simplest one to deal with hierarchical
classification problems, consists of completely ignoring the class hierarchy, typically
predicting only classes at the leaf nodes. This approach behaves like a traditional classi-
fication algorithm during training and testing. However, it provides an indirect solution
to the problem of hierarchical classification, because, when a leaf class is assigned to
an example, one can consider that all its ancestor classes are also implicitly assigned
to that instance (recall that we assume a “IS-A” class hierarchy).
However, this very simple approach has the serious disadvantage of having to build
a classifier to discriminate among a large number of classes (all leaf classes), without
exploring information about parent-child class relationships present in the class hierar-
chy. Figure 3 illustrates this approach. We use here the term flat classification approach,
as it seems to be the most commonly used term in the existing literature, although in
Burred and Lerch (2003) the authors refer to this approach as “the direct approach”,
while in Xiao et al. (2007) this approach is referred to as a “global classifier”—which
123
A survey of hierarchical classification
37
Fig. 3 Flat classification approach using a flat multi-class classification algorithm to always predict the
leaf nodes
is misleading as they are referring to this naïve flat classification algorithm, and the
term global classifier is often used to refer to the “big-bang” approach (Sect. 5).
In Barbedo and Lopes (2007) the authors refer to this approach as a “bottom-up”
approach. They justify this term as follows: “The signal is firstly classified according
to the basic genres, and the corresponding upper classes are consequences of this first
classification (bottom-up approach).” In this paper, however, we prefer to use the term
flat classification to be consistent with the majority of the literature.
Considering the different types of class taxonomies (tree or DAG), this approach
can cope with both of them as long as the problem is a mandatory-leaf node pre-
diction problem, as it is incapable of handling non-mandatory leaf node prediction
problems. In this approach training and testing proceed in the same way as in standard
(non-hierarchical) classification algorithms.
4 Local classifier approaches
In the seminal work of Koller and Sahami (1997), the first type of local classifier
approach (also known as top-down approach in the literature) was proposed. From
this work onwards, many different authors used augmented versions of this approach
to deal with hierarchical classification problems. However, the important aspect here
is not that the approach is top-down (as it is commonly called), but rather that the hier-
archy is taken into account by using a local information perspective. The idea behind
this reasoning is that in the literature there are several papers that employ this local
information in different ways. These approaches, therefore, can be grouped based on
how they use this local information and how they build their classifiers around it. More
precisely, there seems to exist three standard ways of using the local information: a
local classifier per node (LCN), a local classifier per parent node (LCPN) and a local
classifier per level (LCL). In the following subsections we discuss each one of them in
detail. Also note that unless specified otherwise, the discussion will assume a single
label tree-structured class hierarchy and mandatory leaf node prediction.
123
38
C. N. Silla Jr., A. A. Freitas
It should be noted that, although the three types of local hierarchical classification
algorithms discussed in the next three sub-sections differ significantly in their training
phase, they share a very similar top-down approach in their testing phase. In essence, in
this top-down approach, for each new example in the test set, the system first predicts
its first-level (most generic) class, then it uses that predicted class to narrow the choices
of classes to be predicted at the second level (the only valid candidate second-level
classes are the children of the class predicted at the first level), and so on, recursively,
until the most specific prediction is made.
As a result, a disadvantage of the top-down class-prediction approach (which is
shared by all the three types of local classifiers discussed next) is that an error at a
certain class level is going to be propagated downwards the hierarchy, unless some
procedure for avoiding this problem is used. If the problem is non-mandatory leaf
node prediction, a blocking approach (where an example is passed down to the next
lower level only if the confidence on the prediction at the current level is greater than a
threshold) can avoid that misclassifications are propagated downwards, at the expense
of providing the user with less specific (less useful) class predictions. Some authors use
methods to give better estimates of class probabilities, like shrinkage (McCallum et al.
1998) and isotonic smoothing (Punera and Ghosh 2008). The issues of non-mandatory
leaf node prediction and blocking are discussed in Sect. 4.4.
4.1 Local classifier per node approach
This is by far the most used approach in the literature. It often appears under the name
of a top-down approach, but as we mentioned earlier, we shall see why this is not a
good name as the top-down approach is essentially a method to avoid inconsistencies
in class predictions at different levels in the class hierarchy. The LCN approach con-
sists of training one binary classifier for each node of the class hierarchy (except the
root node). Figure 4 illustrates this approach.
Fig. 4 Local classifier per node approach (circles represent classes and dashed squares with round corners
represent binary classifiers)
123