logo资料库

[2012PAMI]Pedestrian Detection:An Evaluation of the State of the....pdf

第1页 / 共19页
第2页 / 共19页
第3页 / 共19页
第4页 / 共19页
第5页 / 共19页
第6页 / 共19页
第7页 / 共19页
第8页 / 共19页
资料共19页,剩余部分请下载后查看
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 4, APRIL 2012 743 Pedestrian Detection: An Evaluation of the State of the Art Piotr Dolla´ r, Christian Wojek, Bernt Schiele, and Pietro Perona Abstract—Pedestrian detection is a key problem in computer vision, with several applications that have the potential to positively impact quality of life. In recent years, the number of approaches to detecting pedestrians in monocular images has grown steadily. However, multiple data sets and widely varying evaluation protocols are used, making direct comparisons difficult. To address these shortcomings, we perform an extensive evaluation of the state of the art in a unified framework. We make three primary contributions: 1) We put together a large, well-annotated, and realistic monocular pedestrian detection data set and study the statistics of the size, position, and occlusion patterns of pedestrians in urban scenes, 2) we propose a refined per-frame evaluation methodology that allows us to carry out probing and informative comparisons, including measuring performance in relation to scale and occlusion, and 3) we evaluate the performance of sixteen pretrained state-of-the-art detectors across six data sets. Our study allows us to assess the state of the art and provides a framework for gauging future efforts. Our experiments show that despite significant progress, performance still has much room for improvement. In particular, detection is disappointing at low resolutions and for partially occluded pedestrians. Index Terms—Pedestrian detection, object detection, benchmark, evaluation, data set, Caltech Pedestrian data set. Ç 1 INTRODUCTION PEOPLE are among the most important components of a machine’s environment, and endowing machines with the ability to interact with people is one of the most interesting and potentially useful challenges for modern engineering. Detecting and tracking people is thus an important area of research, and machine vision is bound to play a key role. Applications include robotics, entertain- ment, surveillance, care for the elderly and disabled, and content-based indexing. Just in the US, nearly 5,000 of the 35,000 annual traffic crash fatalities involve pedestrians [1]; hence the considerable interest in building automated vision systems for detecting pedestrians [2]. While there is much ongoing research in machine vision approaches for detecting pedestrians, varying evaluation protocols and use of different data sets makes direct comparisons difficult. Basic questions such as “Do current detectors work well?” “What is the best ap- proach?” “What are the main failure modes?” and “What are the most productive research directions?” are not easily answered. Our study aims to address these questions. We focus on methods for detecting pedestrians in individual monocular images; for an overview of how detectors are incorporated into full systems we refer readers to [2]. Our approach is three-pronged: We collect, annotate, and study a large data . P. Dolla´r and P. Perona are with the Department of Electrical Engineering, California Institute of Technology, MC 136-93, 1200 E. California Blvd., Pasadena, CA 91125. E-mail: {pdollar, perona}@caltech.edu. . C. Wojek and B. Schiele are with the Max Planck Institute for Informatics, Campus E1 4, Saarbru¨cken 66123, Germany. E-mail: {cwojek, schiele}@mpi-inf.mpg.de. Manuscript received 2 Nov. 2010; revised 17 June 2011; accepted 3 July 2011; published online 28 July 2011. Recommended for acceptance by G. Mori. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-2010-11-0837. Digital Object Identifier no. 10.1109/TPAMI.2011.155. set of pedestrian images collected from a vehicle navigating in urban traffic; we develop informative evaluation methodol- ogies and point out pitfalls in previous experimental procedures; finally, we compare the performance of 16 pre- trained pedestrian detectors on six publicly available data sets, including our own. Our study allows us to assess the state of the art and suggests directions for future research. All results of this study, and the data and tools for reproducing them, are posted on the project website: www. vision.caltech.edu/Image_Datasets/CaltechPedestrians/. 1.1 Contributions Data set. In earlier work [3], we introduced the Caltech Pedestrian Data Set, which includes 350,000 pedestrian bounding boxes (BB) labeled in 250,000 frames and remains the largest such data set to date. Occlusions and temporal correspondences are also annotated. Using the extensive ground truth, we analyze the statistics of pedestrian scale, occlusion, and location and help establish conditions under which detection systems must operate. Evaluation methodology. We aim to quantify and rank detector performance in a realistic and unbiased manner. To this effect, we explore a number of choices in the evaluation protocol and their effect on reported performance. Overall, the methodology has changed substantially since [3], resulting in a more accurate and informative benchmark. Evaluation. We evaluate 16 representative state-of-the- art pedestrian detectors (previously we evaluated seven [3]). Our goal was to choose diverse detectors that were most promising in terms of originally reported perfor- mance. We avoid retraining or modifying the detectors to ensure each method was optimized by its authors. In addition to overall performance, we explore detection rates under varying levels of scale and occlusion and on clearly visible pedestrians. Moreover, we measure localization accuracy and analyze runtime. To increase the scope of our analysis, we also benchmark the 16 detectors using a unified evaluation framework on 0162-8828/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society
744 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 4, APRIL 2012 Fig. 2. Overview of the Caltech Pedestrian Data Set. (a) Camera setup. (b) Summary of data set statistics (1k ¼ 103). The data set is large, realistic, and well annotated, allowing us to study statistics of the size, position, and occlusion of pedestrians in urban scenes and also to accurately evaluate the state or the art in pedestrian detection. Middlebury Stereo Data Set [11], and the Caltech 101 [12], Caltech 256 [13], and PASCAL [14] object recognition data sets all improved performance evaluation, added challenge, and helped drive innovation in their respective fields. Much in the same way, our goal in introducing the Caltech Pedestrian Data Set is to provide a better benchmark and to help identify conditions under which current detectors fail and thus focus research effort on these difficult cases. 2.1 Data Collection and Ground Truthing We collected approximately 10 hours of 30 Hz video ( 106 frames) taken from a vehicle driving through regular traffic in an urban environment (camera setup shown in Fig. 2a). The CCD video resolution is 640  480, and, not unexpectedly, the overall image quality is lower than that of still images of comparable resolution. There are minor variations in the camera position due to repeated mount- ings of the camera. The driver was independent from the authors of this study and had instructions to drive normally through neighborhoods in the greater Los Angeles metro- politan area chosen for their relatively high concentration of pedestrians, including LAX, Santa Monica, Hollywood, Pasadena, and Little Tokyo. In order to remove effects of the vehicle pitching and thus simplify annotation, the video was stabilized using the inverse compositional algorithm for image alignment by Baker and Matthews [15]. After video stabilization, 250,000 frames (in 137 approxi- mately minute long segments extracted from the 10 hours of video) were annotated for a total of 350,000 bounding boxes around 2,300 unique pedestrians. To make such a large scale labeling effort feasible we created a user-friendly labeling tool, shown in Fig. 3. Its most salient aspect is an interactive procedure where the annotator labels a sparse set of frames and the system automatically predicts pedestrian positions in intermediate frames. Specifically, after an annotator labels a bounding box around the same pedestrian in at least two frames, BBs in intermediate frames are interpolated using cubic interpolation (applied independently to each coordinate of the BBs). Thereafter, every time an annotator alters a BB, BBs in all the unlabeled frames are reinterpolated. The annotator continues until satisfied with the result. We experimented with more sophisticated interpolation schemes, including relying on tracking; however, cubic interpolation proved best. Label- ing the 2:3 hours of video, including verification, took 400 hours total (spread across multiple annotators). Fig. 1. Example images (cropped) and annotations from six pedestrian detection data sets. We perform an extensive evaluation of pedestrian detection, benchmarking 16 detectors on each of these six data sets. By using multiple data sets and a unified evaluation framework we can draw broad conclusion about the state of the art and suggest future research directions. six additional pedestrian detection data sets, including the ETH [4], TUD-Brussels [5], Daimler [6], and INRIA [7] data sets and two variants of the Caltech data set (see Fig. 1). By evaluating across multiple data sets, we can rank detector performance and analyze the statistical significance of the results and, more generally, draw conclusions both about the detectors and the data sets themselves. Two groups have recently published surveys which are complementary to our own. Geronimo et al. [2] performed a comprehensive survey of pedestrian detection for advanced driver assistance systems, with a clear focus on full systems. Enzweiler and Gavrila [6] published the Daimler detection data set and an accompanying evaluation of three detectors, performing additional experiments integrating the detec- tors into full systems. We instead focus on a more thorough and detailed evaluation of state-of-the-art detectors. This paper is organized as follows: We introduce the Caltech Pedestrian Data Set and analyze its statistics in Section 2; a comparison of existing data sets is given in Section 2.4. In Section 3, we discuss evaluation methodology in detail. A survey of pedestrian detectors is given in Section 4.1 and in Section 4.2 we discuss the 16 representative state-of-the-art detectors used in our evaluation. In Section 5, we report the results of the performance evaluation, both under varying conditions using the Caltech data set and on six additional data sets. We conclude with a discussion of the state of the art in pedestrian detection in Section 6. 2 THE CALTECH PEDESTRIAN DATA SET Challenging data sets are catalysts for progress in computer vision. The Barron et al. [8] and Middlebury [9] optical flow data sets, the Berkeley Segmentation Data Set [10], the
DOLLAR ET AL.: PEDESTRIAN DETECTION: AN EVALUATION OF THE STATE OF THE ART 745 Fig. 3. The annotation tool allows annotators to efficiently navigate and annotate a video in a minimum amount of time. Its most salient aspect is an interactive procedure where the annotator labels only a sparse set of frames and the system automatically predicts pedestrian positions in intermediate frames. The annotation tool is available on the project website. For every frame in which a given pedestrian is visible, annotators mark a BB that indicates the full extent of the entire pedestrian (BB-full); for occluded pedestrians this involves estimating the location of hidden parts. In addition a second BB is used to delineate the visible region (BB-vis), see Fig. 5a. During an occlusion event, the estimated full BB stays relatively constant while the visible BB may change rapidly. For comparison, in the PASCAL labeling scheme [14] only the visible BB is labeled and occluded objects are marked as “truncated.” Each sequence of BBs belonging to a single object was assigned one of three labels. Individual pedestrians were labeled “Person” ( 1;900 instances). Large groups for which it would have been tedious or impossible to label individuals were delineated using a single BB and labeled as “People” ( 300). In addition, the label “Person?” was assigned when clear identification of a pedestrian was ambiguous or easily mistaken ( 110). 2.2 Data Set Statistics A summary of the data set is given in Fig. 2b. About 50 percent of the frames have no pedestrians, while 30 percent have two or more, and pedestrians are visible for 5 s on average. Below, we analyze the distribution of pedestrian scale, occlusion, and location. This serves to establish the requirements of a real world system and to help identify constraints that can be used to improve automatic pedestrian detection systems. 2.2.1 Scale Statistics We group pedestrians by their image size (height in pixels) into three scales: near (80 or more pixels), medium (between 30-80 pixels), and far (30 pixels or less). This division into three scales is motivated by the distribution of sizes in the data set, human performance, and automotive system requirements. In Fig. 4a, we histogram the heights of the 350,000 BBs using logarithmic sized bins. The heights are roughly lognormally distributed with a median of 48 pixels and a log-average of 50 pixels (the log-average is equivalent to the geometric mean and is more representative of typical values for lognormally distributed data than the arithmetic mean, Fig. 4. (a) Distribution of pedestrian pixel heights. We define the near scale to include pedestrians over 80 pixels, the medium scale as 30- 80 pixels, and the far scale as under 30 pixels. Most observed pedestrians ( 69%) are at the medium scale. (b) Distribution of BB aspect ratio; on average w  0:41h. (c) Using the pinhole camera model, a pedestrian’s pixel height h is inversely proportional to distance to the camera d: h=f  H=d. (d) Pixel height h as a function of distance d. Assuming an urban speed of 55 km/h, an 80 pixel person is just 1.5 s away, while a 30 pixel person is 4 s away. Thus, for automotive settings, detection is most important at medium scales (see Section 2.2.1 for details). which is 60 pixels in this case). Cutoffs for the near/far scales are marked. Note that  69% of the pedestrians lie in the medium scale, and that the cutoffs for the near/far scales correspond to about 1 standard deviation (in log space) from the log-average height of 50 pixels. Below 30 pixels, annotators have difficulty identifying pedestrians reliably. Pedestrian width is likewise lognormally distributed and, moreover, so is the joint distribution of width and height (not shown). As any linear combination of the components of a multivariate normal distribution is also normally distributed, so should the BB aspect ratio be (defined as w=h) since logðw=hÞ ¼ logðwÞ logðhÞ. A histo- gram of the aspect ratios, using logarithmic bins, is shown in Fig. 4b, and indeed the distribution is lognormal. The log- average aspect ratio is 0.41, meaning that typically w  :41h. However, while BB height does not vary considerably given a constant distance to the camera, the BB width can change with the pedestrian’s pose (especially arm positions and relative angle). Thus, although we could have defined the near, medium and far scales using the width, the consis- tency of the height makes it better suited. Detection in the medium scale is essential for automotive applications. We chose a camera setup that mirrors expected automotive settings: 640  480 resolution, 27 degrees vertical field of view, and focal length fixed at 7.5 mm. The focal length in pixels is f  1;000 (obtained from 480=2=f ¼ tanð27=2Þ or using the camera’s pixel size of 7:5 m). Using a pinhole camera model (see Fig. 4c), an object’s observed
746 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 4, APRIL 2012 Fig. 5. Occlusion statistics. (a) For all occluded pedestrians annotators labeled both the full extent of the pedestrian (BB-full) and the visible region (BB-vis). (b) Most pedestrians (70 percent) are occluded in at least one frame, underscoring the importance of detecting occluded people. (c) Fraction of occlusion can vary significantly (0 percent occlusion indicates that a BB could not represent the extent of the visible region). (d) Occlusion is far from uniform with pedestrians typically occluded from below. (e) To observe further structure in the types of occlusions that actually occur, we quantize occlusion into a fixed number of types. (f) Over 97 percent of occluded pedestrians belong to just a small subset of the hundreds of possible occlusion types. Details in Section 2.2.2. pixel height h is inversely proportional to the distance d to the camera: h  Hf=d, where H is the true object height. Assuming H  1:8 m tall pedestrians, we obtain d  1;800=h m. With the vehicle traveling at an urban speed of 55 km/h ( 15 m=s), an 80 pixel person is just 1.5 s away, while a 30 pixel person is 4 s away (see Fig. 4d). Thus, detecting near scale pedestrians may leave insufficient time to alert the driver, while far scale pedestrians are less relevant. We shall use the near, medium, and far scale definitions throughout this work. Most pedestrians are observed at the medium scale and, for safety systems, detection must occur in this scale as well. Human performance is also quite good in the near and medium scales but degrades noticeably at the far scale. However, most current detectors are designed for the near scale and perform poorly even at the medium scale (see Section 5). Thus, there is an important mismatch in current research efforts and the requirements of real systems. Using higher resolution cameras would help; nevertheless, given the good human performance and lower cost, we believe that accurate detection in the medium scale is an important and reasonable goal. i.e., frequency of occlusion in Fig. 5b, 2.2.2 Occlusion Statistics Occluded pedestrians were annotated with two BBs that denote the visible and full pedestrian extent (see Fig. 5a). We plot for each pedestrian we measure the fraction of frames in which the pedestrian was at least somewhat occluded. The distribution has three distinct regions: pedestrians that are never occluded (29 percent), occluded in some frames (53 percent), and occluded in all frames (19 percent). Over 70 percent of pedestrians are occluded in at least one frame, underscoring the importance of detecting occluded people. Nevertheless, little previous work has been done to quantify occlusion or detection performance in the presence of occlusion (using real data). For each occluded pedestrian, we can compute the fraction of occlusion as 1 minus the visible pedestrian area divided by total pedestrian area (calculated from the visible and full BBs). Aggregating, we obtain the histogram in Fig. 5c. Over 80 percent occlusion typically indicates full occlusion, while 0 percent is used to indicate that a BB could not represent the extent of the visible region (e.g., due to a diagonal occluder). We further subdivide the cases in between into partial occlusion (1-35 percent area occluded) and heavy occlusion (35-80 percent occluded). We investigated which regions of a pedestrian were most likely to be occluded. For each frame in which a pedestrian was partially to heavily occluded (1-80 percent fraction of occlusion), we created a binary 50  100 pixel occlusion mask using the visible and full BBs. By averaging the resulting  54 k occlusion masks, we computed the prob- ability of occlusion for each pixel (conditioned on the person being partially occluded); the resulting heat map is shown in Fig. 5d. Observe the strong bias for the lower portion of the pedestrian to be occluded, particularly the feet, and for the top portion, especially the head, to be visible. An intuitive explanation is that most occluding objects are supported from below as opposed to hanging from above (another but less likely possibility is that it is difficult for annotators to detect pedestrians if only the feet are visible). Overall, occlusion is far from uniform and exploiting this finding could help improve the performance of pedestrian detectors. Not only is occlusion highly nonuniform, there is significant additional structure in the types of occlusions that actually occur. Below, we show that after quantizing occlusion masks into a large number of possible types, nearly all occluded pedestrians belong to just a handful of the resulting types. To quantize the occlusions, each BB-full is registered to a common reference BB that has been partitioned into qx by qy regularly spaced cells; each BB-vis can then be assigned a type according to the smallest set of
DOLLAR ET AL.: PEDESTRIAN DETECTION: AN EVALUATION OF THE STATE OF THE ART 747 constraints are not valid when photographing a scene from arbitrary viewpoints, e.g., in the INRIA data set. In the collected data, many objects, not just pedestrians, tend to be concentrated in this same region. In Fig. 6b, we show a heat map obtained by using BBs generated by the HOG [7] pedestrian detector with a low threshold. About half of the detections, including both true and false positives, occur in the same band as the ground truth. Thus, incorpor- ating this constraint could considerably speed up detection but it would only moderately reduce false positives. 2.3 Training and Testing Data We split the data set into training/testing sets and specify a precise evaluation methodology, allowing different re- search groups to compare detectors directly. We urge authors to adhere to one of four training/testing scenarios described below. The data were captured over 11 sessions, each filmed in one of five city neighborhoods as described. We divide the data roughly in half, setting aside six sessions for training (S0-S5) and five sessions for testing (S6-S10). For detailed statistics about the amount of data see the bottom row of Table 1. Images from all sessions (S0-S10) have been publicly available, as have annotations for the training sessions (S0-S5). At this time we are also releasing annotations for the testing sessions (S6-S10). Detectors can be trained using either the Caltech training data (S0-S5) or any “external” data, and tested on either the Caltech training data (S0-S5) or testing data (S6-S10). This results in four evaluation scenarios: . . . . Scenario ext0: Train on any external data, test on S0-S5. Scenario ext1: Train on any external data, test on S6-S10. Scenario cal0: Perform 6-fold cross validation using S0-S5. In each phase use five sessions for training and the sixth for testing, then merge and report results over S0-S5. Scenario cal1: Train using S0-S5, test on S6-S10. Fig. 6. Expected center location of pedestrian BBs for (a) ground truth and (b) HOG detections. The heat maps are log-normalized, meaning pedestrian location is even more concentrated than immediately apparent. qx;qy P cells that fully encompass it. Fig. 5e shows three example types for qx ¼ 3; qy ¼ 6 (with two BB-vis per type). There are i¼1;j¼1 ij ¼ qxqyðqx þ 1Þðqy þ 1Þ=4 possible types. a total of the  54 k For each, we compute the percentage of occlusions assigned to it and produce a heat map using the corresponding occlusion masks. The top seven of 126 types for qx ¼ 3; qy ¼ 6 are shown in Fig. 5f. Together, these seven types account for nearly 97 percent of all occlusions in the data set. As can be seen, pedestrians are almost always occluded from either below or the side; more complex occlusions are rare. We repeated the same analysis with a finer partitioning of qx ¼ 4; qy ¼ 8 (not shown). Of the resulting 360 possible types, the top 14 accounted for nearly 95 percent of occlusions. The knowledge that very few occlusion patterns are common should prove useful in detector design. 2.2.3 Position Statistics Viewpoint and ground plane geometry (Fig. 4c) constrain pedestrians to appear only in certain regions of the image. We compute the expected center position and plot the resulting heat map, log-normalized, in Fig. 6a. As can be seen pedestrians are typically located in a narrow band running horizontally across the center of the image (y-coordinate varies somewhat with distance/height). Note that the same Comparison of Pedestrian Detection Data Sets (See Section 2.4 for Details) TABLE 1
748 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 4, APRIL 2012 Scenarios ext0/ext1 allow for evaluation of existing, pre- trained pedestrian detectors, while cal0/cal1 involve train- ing using the Caltech training data (S0-S5). The results reported here use the ext0/ext1 scenarios, thus allowing for a broad survey of existing pretrained pedestrian detectors. Authors are encouraged to retrain their systems on our large training set and evaluate under scenarios cal0/cal1. Authors should use ext0/cal0 during detector development, and only evaluate after finalizing all parameters under scenarios ext1/cal1. 2.4 Comparison of Pedestrian Data Sets Existing data sets may be grouped into two types: 1) “person” data sets containing people in unconstrained pose in a wide range of domains and 2) “pedestrian” data sets containing upright, possibly moving people. The most widely used “person” data sets include subsets of the MIT LabelMe data [23] and the PASCAL VOC data sets [14]. In this work, we focus on pedestrian detection, which is more relevant to automotive safety. Table 1 provides an overview of existing pedestrian data sets. The data sets are organized into three groups. The first includes older or more limited data sets. The second includes more comprehensive data sets, including the INRIA [7], ETH [4], and TUD-Brussels [5] pedestrian data sets and the Daimler detection benchmark (Daimler-DB) [6]. The final row contains information about the Caltech Pedestrian Data Set. Details follow below. Imaging setup. Pedestrians can be labeled in photo- graphs [7], [16], surveillance video [17], [24], and images taken from a mobile recording setup, such as a robot or vehicle [4], [5], [6]. Data Sets gathered from photographs suffer from selection bias, as photographs are often manually selected, while surveillance videos have re- stricted backgrounds and thus rarely serve as a basis for detection data sets. Data Sets collected by continuously filming from a mobile recording setup, such as the Caltech Pedestrian Data Set, largely eliminate selection bias (unless some scenes are staged by actors, as in [6]) while having moderately diverse scenes. Data set size. The amount and type of data in each data set is given in the next six columns. The columns are: number of pedestrian windows (not counting reflections, shifts, etc.), number of images with no pedestrians (a y indicates cropped negative windows only), and number of uncropped images containing at least one pedestrian. The Caltech Pedestrian Data Set is two orders of magnitude larger than most existing data sets. Data set type. Older data sets, including the MIT [16], CVC [19], and NICTA [22] pedestrian data sets and the Daimler classification benchmark (Daimler-CB) [21], tend to contain cropped pedestrian windows only. These are known as “classification” data sets as their primary use is to train and test binary classification algorithms. In contrast, data sets that contain pedestrians in their original context are known as “detection” data sets and allow for the design and testing of full-image detection systems. The Caltech data set along with all the data sets in the second set (INRIA, ETH, TUD-Brussels, and Daimler-DB) can serve as “detection” data sets. Pedestrian scale. Table 1 additionally lists the 10th percentile, median, and 90th percentile pedestrian pixel heights for each data set. While the INRIA data set has fairly high-resolution pedestrians, most data sets gathered from mobile platforms have median heights that range from 50-100 pixels. This emphasizes the importance of detection of low-resolution pedestrians, especially for applications on mobile platforms. Data set properties. The final columns summarize additional data set features, including the availability of color images, video data, temporal correspondence between BBs and occlusion labels, and whether “per-image” evalua- tion and unbiased selection criteria were used. As mentioned, in our performance evaluation we additionally use the INRIA [7], ETH [4], TUD-Brussels [5], and Daimler-DB [6] data sets. The INRIA data set helped drive recent advances in pedestrian detection and remains one of the most widely used despite its limitations. Much like the Caltech data set, the ETH, TUD-Brussels, and Daimler-DB data sets are all captured in urban settings using a camera mounted to a vehicle (or stroller in the case of ETH). While annotated in less detail than the Caltech data set (see Table 1), each can serve as “detection” data set and is thus suitable for use in our evaluation. We conclude by summarizing the most important and novel aspects of the Caltech Pedestrian Data Set. The data set includes Oð105Þ pedestrian BBs labeled in Oð105Þ frames and remains the largest such data set to date. It contains color video sequences and includes pedestrians with a large range of scales and more scene variability than typical pedestrian data sets. Finally, it is the only data set with detailed occlusion labels and one of the few to provide temporal correspondence between BBs. 3 EVALUATION METHODOLOGY Proper evaluation methodology is a crucial and surpris- ingly tricky topic. In general, there is no single “correct” evaluation protocol. Instead, we have aimed to make our evaluation protocol quantify and rank detector performance in a realistic, unbiased, and informative manner. To allow for exact comparisons, we have posted the evaluation code, ground truth annotations, and detection results for all detectors on all data sets on the project website. Use of the exact same evaluation code (as opposed to a reimplementation) ensures consistent and reproducible comparisons. Additionally, given all the detector outputs, practitioners can define novel performance metrics with which to reevaluate the detectors. This flexibility is important because while we make every effort to define realistic and informative protocols, performance evaluation is ultimately task dependent. Overall, the evaluation protocol has changed substan- tially since our initial version described in [3], resulting in a more accurate and informative evaluation of the state of the art. We begin with an overview of full image evaluation in Section 3.1. Next, we discuss evaluation using subsets of the ground truth and detections in Sections 3.2 and 3.3, respectively. In Section 3.4, we propose and motivate standardizing BB aspect ratio. Finally, in Section 3.5, we examine the alternative per-window (PW) evaluation methodology.
DOLLAR ET AL.: PEDESTRIAN DETECTION: AN EVALUATION OF THE STATE OF THE ART 749 the log-average miss rate is similar to the performance at 101 FPPI but in general gives a more stable and informative assessment of performance. A similar perfor- mance measure was used in [27]. We conclude by listing additional details. Some detec- tors output BBs with padding around the pedestrian (e.g., HOG outputs 128  64 BBs around 96 pixel tall people), such padding is cropped (see also Section 3.4). Methods usually detect pedestrians at some minimum size; to coax smaller detections, we upscale the input images. For ground truth, the full BB is always used for matching, not the visible BB, even for partially occluded pedestrians. Finally, all reported results on the Caltech data set are computed using every 30th frame (starting with the 30th frame) due to the high-computational demands of some of the detectors evaluated (see Fig. 15). 3.2 Filtering Ground Truth Often we wish to exclude portions of a data set during evaluation. This serves two purposes: 1) excluding ambig- uous regions, e.g., crowds annotated as “People” where the locations of individuals is unknown, and 2) evaluating performance on various subsets of a data set, e.g., on pedestrians in a given scale range. However, we cannot simply discard a subset of ground truth labels as this would cause overreporting of false positives. Instead, to exclude portions of a data set, we introduce the notion of ignore regions. Ground truth BBs selected to be ignored, denoted using BBig, need not be matched; however, matches are not considered mistakes either. For example, to evaluate performance on unoccluded pedes- trians, we set all occluded pedestrian BBs to ignore. Evaluation is purposely lenient: Multiple detections can match a single BBig; moreover, a detection may match any subregion of a BBig. This is useful when the number or location of pedestrians within a single BBig is unknown as in the case of groups labeled as “People.” In the proposed criterion, a BBdt can match any subregion of a BBig. The subregion that maximizes area of is BBdt \ BBig, and the resulting overlap (1) with BBdt maximum area of overlap is ao ¼: areaðBBdt \ BBigÞ areaðBBdtÞ : ð2Þ Matching proceeds as before, except BBdt matched to BBig do not count as true positives and unmatched BBig do not count as false negatives. Matches to BBgt are preferred, meaning a BBdt can only match a BBig if it does not match any BBgt, and multiple matches to a single BBig are allowed. As discussed, setting a BBgt to ignore is not the same as discarding it; in the latter case detections in the ignore regions would count as false positives. Four types of BBs are always set to ignore: any BB under 20 pixels high or truncated by image boundaries, containing a “Person?” (ambiguous cases), or containing “People.” Detections within these regions do not affect performance. 3.3 Filtering Detections In order to evaluate on only a subset of the data set, we must filter detector responses outside the considered evaluation range (in addition to filtering ground truth labels). For example, when evaluating performance in a Fig. 7. Log-average miss rates for 50 pixel or taller pedestrians as a function of the threshold on overlap area (see (1)). Decreasing the threshold below 0.5 has little affect on reported performance. However, increasing it over 0:6 results in rapidly increasing log-average miss rates as improved localization accuracy is necessary. 3.1 Full Image Evaluation We perform single frame evaluation using a modified version of the scheme laid out in the PASCAL object detection challenges [14]. A detection system needs to take an image and return a BB and a score or confidence for each detection. The system should perform multiscale detection and any necessary nonmaximal suppression (NMS) for merging nearby detections. Evaluation is performed on the final output: the list of detected BBs. A detected BB (BBdt) and a ground truth BB (BBgt) form a potential match if they overlap sufficiently. Specifically, we employ the PASCAL measure, which states that their area of overlap must exceed 50 percent: ao ¼: areaðBBdt \ BBgtÞ areaðBBdt [ BBgtÞ > 0:5: ð1Þ The evaluation is insensitive to the exact threshold as long as it is below about 0.6, see Fig. 7. For larger values performance degrades rapidly as improved localization accuracy is necessary; thus, to focus on detection accuracy, we use the standard threshold of 0.5 throughout. Each BBdt and BBgt may be matched at most once. We resolve any assignment ambiguity by performing the matching greedily. Detections with the highest confidence are matched first; if a detected BB matches multiple ground truth BBs, the match with highest overlap is used (ties are broken arbitrarily). In rare cases this assignment may be suboptimal, e.g., in crowded scenes [25], but in practice the effect is minor. Unmatched BBdt count as false positives and unmatched BBgt as false negatives. To compare detectors we plot miss rate against false positives per image (FPPI) (using log-log plots) by varying the threshold on detection confidence (e.g., see Figs. 11 and 13). This is preferred to precision recall curves for certain tasks, e.g., automotive applications, as typically there is an upper limit on the acceptable false positives per image rate independent of pedestrian density. We use the log-average miss rate to summarize detector performance, computed by averaging miss rate at nine FPPI rates evenly spaced in log-space in the range 102 to 100 (for curves that end before reaching a given FPPI rate, the minimum miss rate achieved is used). Conceptually, the log-average miss rate is similar to the average precision [26] reported for the PASCAL challenge [14] in that it represents the entire curve by a single reference value. As curves are somewhat linear in this range (e.g., see Fig. 13),
750 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 4, APRIL 2012 Fig. 8. Comparison of detection filtering strategies used for evaluating performance in a fixed range of scales. Left: Strict filtering, used in our previous work [3], undercounts true positives, thus underreporting results. Right: Postfiltering undercounts false positives, thus over- reporting results. Middle: Expanded filtering as a function of r. Expanded filtering with r ¼ 1:25 offers a good compromise between strict and postfiltering for measuring both true and false positives accurately. Fig. 9. Standardizing aspect ratios. Shown are profile views of two pedestrians. The original annotations are displayed in green (best viewed in color); these were used to crop fixed size windows centered on each pedestrian. Observe that while BB height changes gradually, BB width oscillates significantly as it depends on the positions of the limbs. To remove any effect pose may have on the evaluation of detection, during benchmarking width is standardized to be a fixed fraction of the height (see Section 3.4). The resulting BBs are shown in yellow. fixed scale range, detections far outside the scale range under consideration should not influence the evaluation. The filtering strategy used in our previous work [3] was too stringent and resulted in underreporting of detector performance (this was also independently observed by Walk et al. [28]). Here, we consider three possible filtering strategies, strict filtering (used in our previous work), postfiltering, and expanded filtering, that we believe most accurately reflects true performance. In all cases, matches to BBgt outside the selected evaluation range neither count as true or false positives. Strict filtering. All detections outside the selected range are removed prior to matching. If a BBgt inside the range was matched only by a BBdt outside the range, then after strict filtering it would become a false negative. Thus, performance is underreported. Postfiltering. Detections outside the selected evaluation range are allowed to match BBgt inside the range. After matching, any unmatched BBdt outside the range is removed and does not count as a false positive. Thus, performance is overreported. Expanded filtering. Similar to strict filtering, except all detections outside an expanded evaluation range are removed prior to evaluation. For example, when evaluating in a scale range from S0 to S1 pixels, all detections outside a range S0=r to S1r are removed. This can result in slightly more false positives than postfiltering, but also fewer missed detections than strict filtering. Fig. 8 shows the log-average miss rate on 50 pixel and taller pedestrians under the three filtering strategies (see Section 4 for detector details) and for various choices of r (for expanded filtering). Expanded filtering offers a good compromise1 between strict filtering (which underreports 1. Additionally, strict and post filtering are flawed as they can be easily filtering, exploited (either purposefully or inadvertently). Under post generating large numbers of detections just outside the evaluation range can increase detection rate. Under strict filtering, running a detector in the exact evaluation range ensures all detections fall within that range which can also artificially increase detection rate. To demonstrate the latter exploit, in Fig. 8 we plot the performance of CHNFTRS50, which is CHNFTRS [29] applied to detect pedestrians over 50 pixels. Its performance is identical under each strategy; however, its relative performance is significantly inflated under strict filtering. Expanded filtering cannot be exploited in either manner. performance) and postfiltering (which overreports perfor- mance). Moreover, detector ranking is robust to the exact value of r. Thus, throughout this work, we use expanded filtering (with r ¼ 1:25). 3.4 Standardizing Aspect Ratios Significant variability in both ground truth and detector BB width can have an undesirable effect on evaluation. We discuss the sources of this variability and propose to standardize aspect ratio of both the ground truth and detected BBs to a fixed value. Doing so removes an extraneous and arbitrary choice from detector design and facilitates performance comparisons. The height of annotated pedestrians is an accurate reflection of their scale while the width also depends on pose. Shown in Fig. 9 are consecutive, independently annotated frames from the Daimler detection benchmark [6]. Observe that while BB height changes gradually, the width oscillates substantially. BB height depends on a person’s actual height and distance from the camera, but the width additionally depends on the positions of the limbs, especially in profile views. Moreover, the typical width of annotated BBs tends to vary across data sets. For example, although the log-mean aspect ratio (see Sec- tion 2.2.1) in the Caltech and Daimler data sets is 0.41 and 0.38, respectively, in the INRIA data set [7] it is just 0.33 (possibly due to the predominance of stationary people). Various detectors likewise return different width BBs. The aspect ratio of detections ranges from a narrow 0.34 for PLS to a wide 0.5 for MULTIFTR, while LATSVM attempts to estimate the width (see Section 4 for detector references). For older detectors that output uncropped BBs, we must choose the target width ourselves. In general, a detector’s aspect ratio depends on the data set used during develop- ment and is often chosen after training. To summarize, the width of both ground truth and detected BBs is more variable and arbitrary than the height. To remove any effects this may have on performance evaluation, we propose to standardize all BBs to an aspect ratio of 0.41 (the log-mean aspect ratio in the Caltech data set). We keep BB height and center fixed while adjusting the
分享到:
收藏