logo资料库

Cascade Object Detection with Deformable Part Models_CVPR2010.pdf

第1页 / 共8页
第2页 / 共8页
第3页 / 共8页
第4页 / 共8页
第5页 / 共8页
第6页 / 共8页
第7页 / 共8页
第8页 / 共8页
资料共8页,全文预览结束
Cascade Object Detection with Deformable Part Models∗ Pedro F. Felzenszwalb University of Chicago pff@cs.uchicago.edu Ross B. Girshick University of Chicago rbg@cs.uchicago.edu David McAllester TTI at Chicago mcallester@ttic.edu Abstract We describe a general method for building cascade clas- sifiers from part-based deformable models such as pictorial structures. We focus primarily on the case of star-structured models and show how a simple algorithm based on par- tial hypothesis pruning can speed up object detection by more than one order of magnitude without sacrificing de- tection accuracy. In our algorithm, partial hypotheses are pruned with a sequence of thresholds. In analogy to proba- bly approximately correct (PAC) learning, we introduce the notion of probably approximately admissible (PAA) thresh- olds. Such thresholds provide theoretical guarantees on the performance of the cascade method and can be computed from a small sample of positive examples. Finally, we out- line a cascade detection algorithm for a general class of models defined by a grammar formalism. This class in- cludes not only tree-structured pictorial structures but also richer models that can represent each part recursively as a mixture of other parts. 1. Introduction A popular approach for object detection involves reduc- ing the problem to binary classification. The simplest and most common example of this approach is the sliding win- dow method. In this method a classifier is applied at all positions, scales, and, in some cases, orientations of an im- age. However, testing all points in the search space with a non-trivial classifier can be very slow. An effective method for addressing this problem involves applying a cascade of simple tests to each hypothesized object location to elimi- nate most of them very quickly [16, 12, 4, 15, 2, 13]. Another line of research, separate from cascade classi- fiers, uses part-based deformable models for detection. In this case an object hypothesis specifies a configuration of parts, which leads to a very large (exponential) hypothesis space. There has been significant success in algorithmic methods for searching over these large hypothesis spaces, including methods that are “asymptotically optimal” for tree-structured models [9]. However, these methods are still ∗This research has been supported by NSF grant IIS-0746569. (a) (b) Figure 1. Visualization of the amount of work performed by our al- gorithm over different regions of an image (top) using a car model (a) and a person model (b). relatively slow when compared to simple classifiers defined by cascades. In this paper we describe a method for building cascades for part-based deformable models such as pictorial structures. In the most general setting, this method leads to a cascade version of top-down dynamic programming for a general class of grammar based models. We focus primarily on the case of star-structured models due to their recent strong performance on difficult bench- marks such as the PASCAL datasets [11, 8, 5, 6, 7]. For star models, we obtain a fairly simple algorithm for early hypothesis pruning. This algorithm leads to a detection method over 20 times faster than the standard detection al- gorithm, which is based on dynamic programming and gen- eralized distance transforms, without sacrificing detection accuracy. Figure 1 illustrates the amount of work done by our algorithm in different areas of an image using two dif- ferent models. As described in [11, 8], detection with a deformable part model can be done by considering all possible locations of a distinguished “root” part and, for each of those, finding the 1
best configuration of the remaining parts. In this case we need to compute an optimal configuration for each location of the root. These problems are not independent because the possible locations of the remaining parts are shared among different root locations. For tree-structured models one can use dynamic programming to account for this sharing [9]. In practice one is only interested in root locations that lead to high scoring configurations. The basic idea of our algorithm is to use a hierarchy of models defined by an or- dering of the original model’s parts. For a model with n + 1 parts, including the root, we obtain a sequence of n + 1 models. The i-th model in this sequence is defined by the first i parts from the original model. Using this hierarchy, we can prune low scoring hypotheses after looking at the best configuration of a subset of the parts. Hypotheses that score high under a weak model are evaluated further us- ing a richer model. This process is analogous to a classi- cal cascade and is similar to the cascades of [2, 15] in that the score of a weaker model is reused when computing the score of a richer one. However, when using deformable part models individual applications of the cascade are not inde- pendent, so, in analogy to classical dynamic programming, work done evaluating one hypothesis is also reused when evaluating others. Our sequential search for parts is related to [1], where the authors propose a sequential search for semi-local fea- tures that fit a global arrangement. The work in [1] also considered the problem of selecting parameters that lead to fast search with a low false negative rate, by making some assumptions on the form of the distribution of local features and analyzing statistics of training data. We use an alter- native approach (see below) that makes fewer assumptions and relies more heavily on the training data. The time it takes to evaluate a hypothesis for a part-based model is highly dependent on the complexity of the individ- ual part models. Besides simplifying a model by removing some of its parts, we also consider simplifications that arise from replacing the original part appearance models with simpler ones that are much faster to compute. In this case, for a model with n + 1 parts we get a hierarchy of 2(n + 1) models. The first n + 1 models are obtained by sequen- tially adding parts with simplified appearance models. The second n + 1 models are obtained by sequentially replacing each simplified appearance model with its full one. Our algorithm prunes partial hypotheses using thresh- olds on their scores. Admissible thresholds would not prune any partial hypothesis that leads to a complete detection scoring above a global threshold. We define the error of a set of thresholds to be the fraction of full hypotheses scor- ing above the global threshold that are incorrectly pruned. To select pruning thresholds, we introduce the notion of probably approximately admissible (PAA) thresholds. PAA thresholds have a low error with high probability. We show that PAA thresholds can be obtained by looking at statistics of partial hypothesis scores over positive exam- ples. This leads to a simple method for picking safe and effective thresholds. The thresholds are safe because they have low error with high probability. They are effective be- cause they lead to a fast cascade with significant pruning. [9] notes that by using dynamic programming and dis- tance transforms the relationships among parts in a tree- structured model can be taken into account “for free.” That is, it takes very little additional time to detect whole object configurations as opposed to individually detecting parts on their own. Our results push this idea further. In practice we find that it is possible to detect whole object configurations much faster than detecting each individual part. 2. Object Detection with Star Models We start by defining a general framework for object de- tection with star-structured deformable part models that in- cludes the setting in [11, 8]. Let M be a model with a root part v0 and n additional parts v1, . . . , vn. Let Ω be a space of locations for each part within an image. For example, ω ∈ Ω could specify a position and scale. Let mi(ω) be the score for placing vi in location ω. This score depends on the image data, but we assume the image is implicitly defined to simplify notation. For a non-root part, let ai(ω) specify the ideal location for vi as a function of the root location. Let ∆ be a space of displacements, and let ⊕ : Ω × ∆ → Ω be a binary op- eration taking a location and a displacement to a “displaced location.” Let di(δ) specify a deformation cost for a dis- placement of vi from its ideal location relative to the root. An object configuration specifies a location for the root and a displacement for each additional part from its ideal location relative to the root. The score of a configuration is the sum of the scores of the parts at their locations minus deformation costs associated with each displacement. score(ω, δ1, . . . , δn) = n i=1 m0(ω) + mi(ai(ω) ⊕ δi) − di(δi) (1) We can define an overall score for a root location based on the maximum score of a configuration rooted at that lo- cation. In a star model each part is only attached to the root, so the score can be factored as follows. n scorei(ai(ω)) (mi(η ⊕ δi) − di(δi)) i=1 (2) (3) score(ω) = m0(ω) + scorei(η) = max δi∈∆ Here scorei(η) is the maximum, over displacements of the part from its ideal location, of the part score minus the de- formation cost associated with the displacement. 2
For a fixed global threshold T , any input to Algorithm 1 that correctly computes score(ω) whenever it is above T is called a T -admissible set of thresholds. If given T - admissible thresholds, Algorithm 1 returns exactly the same set of detections as the standard dynamic programming al- gorithm. In the next section we investigate the case of good inadmissible thresholds that produce a cascade with a low error rate but still allow aggressive pruning. The worst-case time of Algorithm 1 is O(n|Ω||∆|) if mi(ω) is taken to cost O(1), which is slower than the stan- dard dynamic programming algorithm with distance trans- forms. However, in practice, for typical models ∆ can be safely made relatively small (c.f . Section 6). Moreover, searching over ∆ is usually no more expensive than eval- uating mi(ω) at a single location, because the spatial extent of a part is of similar size as its range of displacement. The worse-case time of both methods is the same, O(n|Ω||∆|), if we assume evaluating mi(ω) takes O(|∆|) time. n)) and T 1), . . . , (tn, t Data: Thresholds ((t1, t Result: Set of detections D D ← ∅ for ω ∈ Ω do s ← ˜m0(ω) for i = 1 to n do if s < ti then skip ω p ← −∞ for δ ∈ ∆ do if s − di(δ) < t p ← max(p, ˜mi(ai(ω) ⊕ δ) − di(δ)) i then skip δ end s ← s + p end if s ≥ T then D ← D ∪ {ω} For the models in [11, 8], mi(ω) is the response of a fil- ter in a dense feature pyramid, and di(δ) is a (separable) quadratic function of δ. To detect objects [11, 8] look for root locations with an overall score above some threshold, score(ω) ≥ T . A dynamic programming algorithm is used to compute score(ω) for every location ω ∈ Ω. Using the fast distance transforms method from [9] the detection al- gorithm runs in O(n|Ω|) time if we assume that evaluating the appearance model for a part at a specific location takes O(1) time. In practice, evaluating the appearance models is the bottleneck of the method. 3. Star-Cascade Detection Here we describe a cascade algorithm for star models that uses a sequence of thresholds to prune detections using subsets of parts. Note that we are only interested in root locations where score(ω) ≥ T . By evaluating parts in a sequential order we can avoid evaluating the appearance model for most parts almost everywhere. For example, when detecting people we might evaluate the score of the head part at each possible location and decide that we do not need to evaluate the score of the torso part for most locations in the image. Figure 2 shows an example run of the algorithm. We use ˜mi(ω) to denote a memoized version of mi(ω). In a memoized function whenever a value is computed we store it to avoid recomputing it later. Memoization can be implemented by maintaining an ω-indexed array of already- computed values and checking in this array first whenever ˜mi(ω) is called to avoid computing the appearance model at the same location more than once. The cascade algorithm (Algorithm 1) for a star-structure model with n + 1 parts takes a global threshold T and a se- quence 2n of intermediate thresholds. To simplify the pre- sentation we assume the root appearance model is evaluated first even though this ordering is not a requirement. For each root location ω ∈ Ω, we evaluate score(ω) in n stages. The variable s accumulates the score over stages. In the i-th stage we compute scorei(η), the contribution of part vi, using the variable p. During evaluation of score(ω) there are two opportunities for pruning. Hypothesis pruning: If the score at ω with the first i parts is below ti, then the hypothesis at ω is pruned without eval- uating parts vi through vn (line 5). Intuitively, placing the remaining parts will not make score(ω) go above T . Deformation pruning: To compute vi’s contribution we need to search over deformations δ ∈ ∆. The algorithm will skip a particular δ if the score of the first i parts minus di(δ) is below t i (line 8). Intuitively, displacing vi by δ costs too much to allow the score(ω) to go above T . Note that memoizing the appearance models is important because several root locations might want to evaluate mi(ω) at the same location. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 end return D Algorithm 1: star-cascade Figure 2 shows how, in practice, the cascade algorithm avoids evaluating mi(ω) for most locations ω ∈ Ω except for one or two parts. 4. Pruning Thresholds Suppose we have a model M and a detection threshold T . Let x = (ω, I) be an example of a location within an image I where score(ω) ≥ T . Let D be a distribution over such examples. For a sequence of thresholds t = ((t1, t n)) let csc-score(t, ω) be the score computed for ω by the cas- If the cascade prunes ω we say cade algorithm using t. csc-score(t, ω) = −∞. We define the error of t on D as the probability that the cascade algorithm will incorrectly compute score(ω) on a 1), . . . , (tn, t 3
Figure 2. A sample image with a bicycle detection (left). Each image in the right shows (in white) positions where a particular part appearance model was evaluated at the scale of the bicycle detection. The images are shown in “cascade order” from left to right and top to bottom. The image for the root part, which was evaluated first, is not shown because its appearance model is evaluated at all locations. After evaluating the first non-root part, nearly all locations were pruned by the cascade. random example from D, error(t) = Px∼D(csc-score(t, ω) = score(ω)). (4) We would like to find a sequence of thresholds that has a small error. Note that in practice we are only interested in having a small error on positive examples. In particular we do not care if the cascade incorrectly prunes a negative example that scores above T . Thus we can take D to be a distribution over high-scoring positive examples. Below we show that we can learn a good sequence of thresholds by looking at a small number of examples from D. In analogy to PAC learning [14] we would like to select thresholds that have a small error with high probability. We say a sequence of thresholds t is (, δ) probably ap- proximately admissible (PAA) if the probability that the er- ror of t is greater than  is bounded by δ, P (error(t) > ) ≤ δ. (5) Let δ1, . . . , δn be the optimal displacements for the non- root parts of M on an example x = (ω, I). We can define partial scores that take into account the first i parts and the first i parts minus the i-th deformation cost, i−1 xi = m0(ω) + i = xi − di(δi). x j=1 mj(aj(ω) + δj) − dj(δj), (6) (7) The star-cascade algorithm will find an optimal configura- tion and score for x if and only if xi ≥ ti and x i for all 1 ≤ i ≤ n. Let X be m independent samples from D. We can select i ≥ t thresholds by picking, ti = min x∈X xi t i = min x∈X x i (8) These are the tightest thresholds that make no mistakes on X. The following theorem shows they also have low error with high probability provided that m is sufficiently large. 4 Theorem 1. If we select t according to equation (8) using m ≥ 2n/ ln(2n/δ) samples from D then t is (, δ) proba- bly approximately admissible with respect to D. Proof. By a union bound error(t) ≤  if Px∼D(xi < ti) i) are ≤ /(2n) for all i. Thus error(t) ≤ and Px∼D(x  unless for some i all m samples of the xi or x i are above the /(2n)-th percentile of their distribution. The probabil- ity of that event is ≤ 2n(1 − /(2n))m. To bound this by δ we only need m ≥ 2n/ ln(2n/δ) samples. i < t 5. Simplified Part Appearance Models So far we have considered a cascade for star models that is defined by a hierarchy of n + 1 models. Given a prede- fined part order, the i-th model is formed by adding the i-th part to the (i − 1)-st model. The goal of the cascade is to detect objects while making as few appearance model eval- uations as possible. The star-cascade algorithm achieves this goal by pruning hypotheses using intermediate scores. A complementary approach is to consider a simplified appearance model for each part, ˆmi(ω), that computes an inexpensive approximation of mi(ω). A hierarchy of 2(n + 1) models can be defined with the first n + 1 models constructed as before, but using the sim- plified appearance models, and the second n + 1 models defined by sequentially removing one of the remaining sim- plified appearance models ˆmi and replacing it with the orig- inal one mi. With simplified parts, the star-cascade operates as before for stages 1 through n+1, except that pruning decisions are based on the (memoized) evaluations of the simplified ap- pearance models ˆmi(ω). During the remaining stages, full appearance models replace simplified ones. When replacing an appearance model, the algorithm must redo the search over the deformation space because the optimal placement may change. But now we will often prune a hypothesis be- fore evaluating any of the expensive appearance models.
This version of the algorithm requires 4n + 1 interme- diate thresholds. Just as before, these thresholds can be se- lected using the method from the previous section. For the models in [11, 8], simplified appearance mod- els can be defined by projecting the HOG features and the weight vectors in the part filters to a low dimensional space. We did PCA on a large sample of HOG features from train- ing images in the PASCAL datasets. A simplified appear- ance model can be specified by the projection into the top k principal components. For the 31-dimensional HOG fea- tures used in [8], a setting of k = 5 leads to appearance models that are approximately 6 times faster to evaluate than the original ones. This approach is simple and only introduces a small amount of overhead — the cost of pro- jecting each feature vector in the feature pyramid onto the top k principal components. 6. General Grammar Models Here we consider a fairly general class of grammar mod- els for representing objects in terms of parts. It includes tree-structured pictorial structure models as well as more general models that have variable structure. For example, we can define a person model in which the face part is composed of eyes, a nose, and either a smiling or frown- ing mouth. We follow the framework and notation in [10]. Let N be a set of nonterminal symbols and T be a set of terminal symbols. Let Ω be a set of possible locations for a symbol within an image. For ω ∈ Ω we use X(ω) to denote the placement of a symbol at a location in the image. Appearance models for the terminals are defined by a function score(A, ω) that specifies a score for A(ω). The appearance of nonterminals is defined in terms of expan- sions into other symbols. Possible expansions are defined by a set of scored production rules of the form X(ω0) s→ Y1(ω1), . . . , Yn(ωn), (9) where X ∈ N , Yi ∈ N ∪T , and s ∈ R is a score. To avoid enumerating production rules that differ only by symbol placement, we define grammar models using a set of parameterized production schemas of the form X(ω0(z)) α(z)→ Y1(ω1(z)), . . . , Yn(ωn(z)). (10) Each schema defines a collection of productions consisting of one production for each value of a parameter z. Given a fixed value of z, the functions ω0(z), . . . , ωn(z), and α(z) yield a single production of the form in (9). Star models can be represented using a nonterminal Xi and a terminal Ai for each part. We have score(Ai, ω) = mi(ω). A placement of the root nonterminal X0 defines ideal locations for the remaining parts. This is captured by an instance of the following production for each ω, X0(ω) 0→ A0(ω), X1(a1(ω)), . . . , Xn(an(ω)). (11) We can encode these productions using a schema where z ranges over Ω. These rules are called structural rules. A part can be displaced from its ideal location at the ex- pense of a deformation cost. This is captured by an instance of the following production for each ω and δ, Xi(ω) −di(δ)→ Ai(ω ⊕ δ). (12) We can encode these productions using a schema where z ranges over Ω×∆. These rules are called deformation rules. We restrict our attention to acyclic grammars. We also require that no symbol may appear in the right hand side of multiple schemas. We call this class no-sharing acyclic grammars. It includes pictorial structures defined by arbi- trary trees as well as models where each part can be one of several subtypes. But it does not include models where a single part is used multiple times in one object instance such as a car model where one wheel part is used for both the front and rear wheels. For acyclic grammars, we can extend the scores of ter- minals to scores for nonterminals by the recursive equation score(X, ω) = max s+ score(Yi, ωi), n i=1 X(ω) s→Y1(ω1),...,Yn(ωn) (13) where the max is over rules with X(ω) in the left hand side. Since the grammar is acyclic, the symbols can be ordered such that a bottom-up dynamic programming algorithm can compute score tables VX[ω] = score(X, ω). Scores can also be computed by a recursive top-down procedure. To compute score(X, ω) we consider every rule with X(ω) in the left hand side and sequentially compute the scores of placed symbols in the right hand side using recursive calls. Computed scores should be memoized to avoid recomputing them in the future. For object detection we have a root symbol S and we would like to find all locations ω where score(S, ω) ≥ T . It is natural to introduce pruning into the top-down algorithm in analogy to the star-cascade method (Algorithm 1). As the top-down method traverses derivations in depth- first left-right order, we can keep track of a “prefix score” for the current derivation. Upon reaching X(ω) we can compare the current prefix score to a threshold t(X). If the prefix score is below t(X), then we could pretend score(X, ω) = −∞ without computing it. This is a form of pruning. However, generally there will be multiple re- quests for score(X, ω) and pruning may be problematic when memoized scores are reused. The value memoized for X(ω) depends on the prefix score of the first request for score(X, ω). Due to pruning, the memoized value might be different than what a later request would compute. In particular, if a later request has a higher prefix score, the associated derivation should undergo less pruning. 5
To address this issue, we define pscore(Y, ω) to be the maximum prefix score over all requests for Y (ω). In a grammar with no sharing there is a single schema with Y in the right hand side. For each schema of the form in (10) we have pscore(Yi, ω) = max −1 i z∈ω (ω) pscore(X, ω0(z)) i−1 + α(z) + score(Yj, ωj(z)). (14) j=1 i Thus pscores for Yi can be computed once we have pscores for X and scores for Y1, . . . , Yi−1. The set of parameter values z yielding ω = ωi(z) is denoted by ω−1 (ω). The grammar-cascade algorithm goes over schemas in a It computes pscore(X, ω) be- depth-first left-right order. fore computing score(X, ω), and it prunes computation by comparing pscore(X, ω) to a threshold t(X). The procedure compute takes a symbol X and a table of prefix scores PX[ω] ≈ pscore(X, ω). It returns a ta- ble of values VX[ω] ≈ score(X, ω). These tables are not exact due to pruning. As in the star-cascade, we can pick thresholds using a sample of positive examples. For each symbol X we can pick the highest threshold t(X) that does not prune optimal configurations on the positive examples. Data: X, PX Result: VX VX[ω] ← −∞ if X ∈ T then if PX[ω] ≥ t(X) then VX[ω] ← score(X, ω) return VX end foreach X(ω0(z)) α(z)→ Y1(ω1(z)), . . . , Yn(ωn(z)) do V0[ω] ← −∞ if PX[ω] ≥ t(X) then V0[ω] ← PX[ω] for i = 1 to n do −1 i (ω) α(z) +i−1 0 (ω) α(z) +n Pi[ω] ← maxz∈ω Vi ← compute(Yi, Pi) end V [ω] ← maxz∈ω VX[ω] ← max(VX[ω], V [ω]) −1 j=0 Vj[ωj(z)] j=1 Vj[ωj(z)] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 end return VX Procedure compute For a terminal X, compute evaluates score(X, ω) at lo- cations ω with high pscores. For a nonterminal, compute loops over schemas with X in the left hand side. Line 10 computes pscores for Yi before calling compute recur- sively to obtain scores for Yi. Line 13 computes scores for X under a particular schema. The result VX is the running max of scores under different schemas. 6 For a structural rule (11), ω−1 To understand the worst case runtime of this algorithm we need to consider the max over z that appears in lines 10 and 13. Suppose each schema is a structural rule similar to (11) with a bounded number of symbols in the right hand side or a deformation rule similar to (12). 0 (ω) = {ω}. In this case, line 13 simply sums the scores of the right hand side sym- bols after shifting each by its ideal displacement ai(ω). For 0 (ω) = {ω} × ∆. In this case, a deformation rule (12), ω−1 line 13 takes a max over ∆. The situations are similar for line 10. Thus, assuming it takes O(1) time to evalu- ate score(A, ω) for a terminal, the runtime of the algorithm is O(|Ω||∆|) per schema. When scores over deformation rules can be computed via fast distance transforms the run- time becomes O(|Ω|) per schema. Detection is performed by calling compute on a root symbol S with PS[ω] = 0. In the case of tree-structured pictorial structure models where fast distance transforms can be used, the time of compute(S, PS) is O(n|Ω|) for a model with n parts. This is the same as bottom-up dy- namic programming with fast distance transforms. In this case the grammar-cascade algorithm has better worst-case time than the star-cascade algorithm, but by implementing both methods we found that the specialized star-cascade al- gorithm outperformed compute by about a factor of two. This empirical result comes from a confluence of two fac- tors: the restricted structure of the star model avoids the need to maintain prefix score tables, and for the models we consider |∆| is small enough that brute force search over it, for a small number of locations, outperforms fast distance transforms over the full space Ω. 7. Experimental Results To evaluate our algorithm for star-structured models we compared it to the baseline detection method based on dy- namic programming and distance transforms. We used the publicly available system from [8] as a testbed. We note that [8] provides an implementation of the baseline detec- tion algorithm that is already quite efficient. We evaluated our algorithm by looking at the detection time speedup and average precision (AP) score with respect to the baseline. The evaluation was done over all 20 classes of the PASCAL 2007 dataset [5] as well as on the INRIA Person dataset [3]. For the PASCAL experiments we ob- tained the six-component models used by the UoC-TTI en- try in the 2009 PASCAL VOC Challenge [7]. For the IN- RIA experiments we obtained the one-component model from [8]. These models achieve state-of-the-art detection results. Our experiments show that the cascade algorithm achieves a significant speedup, of more than 20 times on average, with negligible decrease in detection accuracy. The PASCAL models were trained on the 2009 training and validation data, which includes the 2008 data as a sub-
set. We wanted “fresh” positive training examples for se- lecting thresholds, separate from the examples used to train the models, so we conducted our evaluation on the PASCAL 2007 dataset. Testing on the 2007 dataset ensured that the statistics for the threshold training and test data were the same. Note that testing on the 2007 dataset using models trained on the 2009 dataset might not lead to the best possi- ble detection accuracy, but we are only interested in the rel- ative performance of the cascade and the baseline method. In the case of the INRIA Person dataset we did not have access to fresh positive examples, so we used the same ex- amples with which the model was trained. Even though the PAA threshold theory does not apply in this setting, the cas- cade achieved exactly the same AP scores as the baseline. Our implementation of the cascade algorithm has a sin- gle parameter controlling the number of components used for the PCA approximation of the low-level features. This was set to 5 in advance based on the magnitude of the eigen- values from the PCA of HOG features. We compared the runtime of the cascade algorithm ver- sus the baseline for two global detection threshold settings. A higher global threshold allows for more pruning in the cascade at the cost of obtaining a lower recall rate. The first setting was selected so that the resulting precision-recall curve should reach the precision-equals-recall point. Em- pirically we found that this setting results in a detector with typical AP scores within 5 points of the maximum score. This setting is tuned for speed without sacrificing too much recall. The second setting results in the maximum possible AP score with less emphasis on speed. This configuration requires picking a global threshold so the detector achieves its full recall range. We approximated this goal by selecting a global threshold such that the detector would return re- sults down to the precision equals 5% level. For each global detection threshold we picked pruning thresholds using the procedure outlined in Section 4. Figure 3 illustrates precision-recall curves obtained with the cascade and baseline methods. The performance of the cascade algorithm follows the performance of the baseline very closely. The complete experimental results are sum- marized in Tables 1 and 2. We see that the cascade method achieves AP scores that are essentially identical to the base- line for both global threshold settings. Sometimes the cas- cade achieves slightly higher AP score due to pruning of false positives. The maximum recall obtained with the cas- cade is only slightly below the baseline, indicating that very few true positives were incorrectly pruned. The difference in recall rates is reported as the recall gap. For the purpose of timing the algorithms, we ignored the time it takes to compute the low-level feature pyramid from an image as that is the same for both methods (and can be shared among different detectors). Feature pyramid gen- eration took an average of 459ms per image on the PAS- CAL dataset and 730ms on the INRIA dataset. With the precision-equals-recall threshold, the cascade detector ran 22 times faster than the baseline on average. As an exam- ple, the mean detection time per image for the motorbike model was 10.1s for the baseline versus 313ms for the cas- cade, and the mean time per image for the person model was 8.5s for the baseline versus 682ms for the cascade. We also tested the star-cascade algorithm without PCA filters. In this mode, the mean speedup dropped to 8.7 over the baseline at the precision-equals-recall level. Note that [8] reports detection times of around 2 sec- onds per image because it uses a parallel implementation of the baseline algorithm. We turned off that feature to fa- cilitate comparison. Both the baseline and the cascade are equally easy to parallelize. For example, one could search over different scales at the same time. All experiments were conducted using single-threaded implementations on a 2.67GHz Intel Core i7 920 CPU computer running Linux. 8. Conclusion The results of this paper are both theoretical and practi- cal. At a theoretical level we have shown how to construct a cascade variant of a dynamic programming (DP) algorithm. From an abstract viewpoint, a DP algorithm fills values in DP tables. In the cascade version the tables are partial — not all values are computed. Partial DP tables are also used in A* search algorithms. However, the cascade variant of DP runs without the overhead of priority queue operations and with better cache coherence. A second theoretical con- tribution is a training algorithm for the thresholds used in the cascade and an associated high-confidence bound on the error rate of those thresholds — the number of desired de- tections that are missed because of the pruning of the inter- mediate DP tables. At a practical level, this paper describes a frame-rate im- plementation of a state-of-the-art object detector. The de- tector can easily be made to run at several frames per second on a multicore processor. This should open up new appli- cations for this class of detectors in areas such as robotics and HCI. It should also facilitate future research by making richer models computationally feasible. For example, the techniques described in this paper should make it practical to extend the deformable model paradigm for object detec- tion to include search over orientation or other pose param- eters. We believe the performance of deformable model de- tectors can still be greatly improved. References [1] Y. Amit and D. Geman. A computational model for visual selection. Neural Computation, 11(7):1691–1715, 1999. [2] L. Bourdev and J. Brandt. Robust object detection via soft cascade. In CVPR, 2005. 7
Figure 3. Sample precision-recall curves for bicycle, car, person, and INRIA person with the global threshold set to hit the precision-equals- recall point (top row) and the precision = 0.05 level (bottom row). aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv inria Speedup factor 22.7 22.1 16.5 11.6 Baseline AP 21.1 43.1 10.6 12.2 Cascade AP 21.1 42.9 10.4 12.4 3.0 Recall gap 0.7 3.9 1.1 22.1 24.0 24.1 0.4 36.0 13.3 25.6 23.4 23.2 29.8 15.2 10.7 42.2 48.0 15.9 13.4 19.0 10.7 42.5 48.1 15.5 13.4 19.0 4.7 2.0 1.8 7.1 8.0 1.5 2.0 1.3 1.2 16.2 31.3 31.3 0.9 32.6 32.9 33.0 1.5 12.7 34.4 34.8 0.0 23.3 12.0 12.0 1.0 32.8 20.3 20.3 3.7 18.1 23.3 27.2 13.5 20.8 29.3 36.3 80.1 20.2 28.8 36.5 80.1 1.7 0.3 4.6 0.3 Table 1. Results for the global threshold set so each PR curve would reach the precision = recall point. Mean speedup for all classes = 22.0. aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv inria Speedup factor 13.6 18.4 16.9 9.9 Baseline AP 22.8 49.4 10.6 12.9 Cascade AP 22.7 49.3 10.6 13.0 0.8 Recall gap 0.4 1.2 0.2 12.3 27.1 26.6 1.5 19.0 10.1 13.6 13.4 15.1 19.1 12.0 47.4 50.2 18.8 15.7 23.6 10.3 12.1 47.4 50.2 18.8 15.7 23.1 11.3 12.3 2.3 1.0 0.8 0.0 0.7 0.1 1.6 13.1 36.4 36.0 1.4 21.5 37.1 37.1 1.8 5.6 37.2 37.6 0.0 11.6 13.2 13.6 2.3 23.9 22.6 22.7 3.7 9.8 15.9 14.3 11.1 22.9 34.7 40.0 85.6 23.1 34.2 40.0 85.6 1.7 0.7 2.1 0.0 Table 2. Results for the global threshold set so each PR curve would reach precision = 0.05. Mean speedup for all classes = 14.3. [3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. [4] M. Elad, Y. Hel-Or, and R. Keshet. Pattern detection using a maximal rejection classifier. PRL, 23(12):1459–1471, 2002. [5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis- serman. The PASCAL VOC 2007 Results. [6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis- serman. The PASCAL VOC 2008 Results. [7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis- serman. The PASCAL VOC 2009 Results. [8] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Ob- ject detection with discriminatively trained part based models. PAMI, 2009. [9] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55–79, 2005. 8 [10] P. Felzenszwalb and D. McAllester. Object detection grammars. Uni- verity of Chicago, CS Dept., Tech. Rep. 2010-02. [11] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In CVPR, 2008. [12] F. Fleuret and D. Geman. Coarse-to-fine face detection. IJCV, 41(1):85–107, 2001. [13] S. Gangaputra and D. Geman. A design principle for coarse-to-fine classification. In CVPR, 2006. [14] M. Kearns and U. Vazirani. An Introduction to Computational Learn- ing Theory. MIT Press, 1994. [15] J. ˇSochman and J. Matas. Waldboost-learning for time constrained sequential detection. In CVPR, 2005. [16] P. Viola and M. Jones. Rapid object detection using a boosted cas- cade of simple features. In CVPR, 2001. 00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionclass: bicycle speedup factor: 22.1 cascade (AP 42.9)baseline (AP 43.1)00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionclass: car speedup factor: 13.3 cascade (AP 48.1)baseline (AP 48.0)00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionclass: person speedup factor: 12.7 cascade (AP 34.8)baseline (AP 34.4)00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionclass: inriaperson speedup factor: 13.5 cascade (AP 80.1)baseline (AP 80.1)00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionclass: bicycle speedup factor: 18.4 cascade (AP 49.3)baseline (AP 49.4)00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionclass: car speedup factor: 10.1 cascade (AP 50.2)baseline (AP 50.2)00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionclass: person speedup factor: 5.6 cascade (AP 37.6)baseline (AP 37.2)00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionclass: inriaperson speedup factor: 11.1 cascade (AP 85.6)baseline (AP 85.6)
分享到:
收藏