「深度学习人群计数」2020综述论文（北航发布）.pdf

发布时间：2022-06-19 发布人：admin 分类：说明书资料大小：2.60M 资料格式：pdf 举报版权申诉

37011ff0-c525-4f97-a0f2-e8030ac91087.pdf-第1页.png

第1页 / 共25页

37011ff0-c525-4f97-a0f2-e8030ac91087.pdf-第2页.png

第2页 / 共25页

37011ff0-c525-4f97-a0f2-e8030ac91087.pdf-第3页.png

第3页 / 共25页

37011ff0-c525-4f97-a0f2-e8030ac91087.pdf-第4页.png

第4页 / 共25页

37011ff0-c525-4f97-a0f2-e8030ac91087.pdf-第5页.png

第5页 / 共25页

37011ff0-c525-4f97-a0f2-e8030ac91087.pdf-第6页.png

第6页 / 共25页

37011ff0-c525-4f97-a0f2-e8030ac91087.pdf-第7页.png

第7页 / 共25页

37011ff0-c525-4f97-a0f2-e8030ac91087.pdf-第8页.png

第8页 / 共25页

I Introduction

I-A Related Works and Scope

I-B Related previous reviews and surveys

I-C Contributions of this paper

II Taxonomy for crowd counting

II-A Representative network architectures for crowd counting

II-A1 Basic CNN

II-A2 Multi-column

II-A3 Single column

II-B Learning paradigm

II-B1 Single-task based methods

II-B2 Multi-task based methods

II-C Inference manner

II-C1 Patch-based methods

II-C2 Whole image-based methods

II-D Supervision form

II-D1 Fully-supervised methods

II-D2 Un/semi/weakly/self-supervised methods

II-E Domain adaptation

II-F Instance-/image-based supervision

II-F1 Instance-level supervision

II-F2 Image-level supervision

III Datasets

III-A Most frequently-used datasets

III-B More recently datasets

III-C Some special crowd counting datasets

III-D Representing object Counting datasets in other fields

IV Evaluation metrics

IV-A Image-level metrics

IV-B Pixel-level metrics

IV-C point-level metrics

V Benchmarking and analysis

V-A Overall benchmarking results evaluation

V-B Properties-based evaluation

V-C Attributes-based analysis

VI Discussion

VI-A Model design

VI-B Dataset construction

VI-C The quality of density maps

VI-D Domain adaption or transfer learning

VI-E Robustness for background

VI-F Universality or generalization

VI-G Lightweight network

VI-H Combination of image and video

VI-I Wider-view crowd counting

VI-J Localization, classification and tracking beyond object counting

VI-K Small or tiny object counting

VII Conclusion

References

Biographies

Guangshuai Gao

Junyu Gao

Qingjie Liu

Qi Wang

Yunhong Wang

CNN-based Density Estimation and Crowd Counting: A Survey Guangshuai Gao1,2, Junyu Gao3, Student Member, IEEE, Qingjie Liu1,2∗, Member, IEEE, Qi Wang3, Senior Member, IEEE, and Yunhong Wang1,2, Fellow, IEEE 1 0 2 0 2 r a M 8 2 ] V C . s c [ 1 v 3 8 7 2 1 . 3 0 0 2 : v i X r a Abstract—Accurately estimating the number of objects in a single image is a challenging yet meaningful task and has been applied in many applications such as urban planning and public safety. In the various object counting tasks, crowd counting is particularly prominent due to its speciﬁc signiﬁcance to social security and development. Fortunately, the development of the techniques for crowd counting can be generalized to other related ﬁelds such as vehicle counting and environment survey, if without taking their characteristics into account. Therefore, many researchers are devoting to crowd counting, and many excellent works of literature and works have spurted out. In these works, they are must be helpful for the development of crowd counting. However, the question we should consider is why they are effective for this task. Limited by the cost of time and energy, we cannot analyze all the algorithms. In this paper, we have surveyed over 220 works to comprehensively and systematically study the crowd counting models, mainly CNN-based density map estimation methods. Finally, according to the evaluation metrics, we select the top three performers on their crowd counting datasets and analyze their merits and drawbacks. Through our analysis, we expect to make reasonable inference and prediction for the future development of crowd counting, and meanwhile, it can also provide feasible solutions for the problem of object counting in other ﬁelds. We provide the density maps and prediction results of some mainstream algorithm in the validation set of NWPU dataset for comparison and testing. Meanwhile, density map generation and evaluation tools are also provided. All the codes and evaluation results are made publicly available at https://github.com/gaoguangshuai/survey-for-crowd-counting. Index Terms—Object counting, crowd counting, density estimation, CNNs. I. INTRODUCTION O VER the past few decades, an increasing number of research communities, have considered the problem of object counting as their mainly research direction, as a conse- quence, many works have been published to count the number of objects in images or videos across wide variety of domains such as crowding counting [1]–[13], cell microscopy [14]– [16], animals [17], vehicles [2], [18]–[20], leaves [21], [22] and environment survey [23], [24]. In all these domains, crowd counting is of paramount to importance, and it is crucial Guangshuai Gao, Qingjie Liu and Yunhong Wang are with the State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Xueyuan Road, Haidian District, Beijing, 100191, China Beihang University, Institute, Hangzhou, gaoguangshuai1990@buaa.edu.cn; qingjie.liu@buaa.edu.cn;yhwang@buaa.edu.cn); Innovation (email: and Hangzhou 310051,China Junyu Gao and Qi Wang are with the School of Computer Science and with the Center for Optical Imagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, Shanxi, China (email: gjy3035@gmail.com;crabwq@gmail.com) * Corresponding author: Qingjie Liu building a more high-level cognitive ability in some crowd scenarios, such as crowd analysis [25], [26] and video surveil- lance [27]. As the increasing growth of the world’s population and subsequent urbanization result in a rapid crowd gathering in many scenarios such as parades, concerts and stadiums. In these scenarios, crowd counting plays an indispensable role for social safety and control management. Considering the speciﬁc importance of crowd counting aforementioned, more and more researchers have attempted to design various sophisticated projects to address the problem of crowd counting. Especially in the last half decades, with the advent of deep learning, Convolution Neural Networks (CNNs) based models have been overwhelmingly dominated in various computer vision tasks, including crowd counting. Although different tasks have their unique attributes, there ex- ist common features such as structural features and distribution patterns. Fortunately, the techniques for crowd counting can be extended to some other ﬁelds with speciﬁc tools. Therefore, in this paper, we expect to provide a reasonable solution for other tasks through the deep excavation of the crowd counting task, especially for CNN-based density estimation and crowd counting models. Our survey aims to involve various parts, which is ranging algorithm taxonomy from some interest- ing under-explored research direction. Beyond taxonomically reviewing existing CNN-based crowd counting and density estimation models, representing datasets and evaluation met- rics, some factors and attributes, which largely affect the performance the designed model, are also investigated, such as distractors and negative samples. We provide the density maps and prediction results of some mainstream algorithm in the validation set of NWPU dataset [28] for comparison and testing. Meanwhile, density map generation and evaluation tools are also provided. All the codes and evaluation results are made publicly available at https://github.com/gaoguangshuai/ survey-for-crowd-counting. A. Related Works and Scope The various approaches for crowd counting are mainly divided into four categories: detection-based, regression-based, density estimation, and more recently CNN-based density estimation approaches. We focus on the CNN-based density estimation and crowd counting model in this survey. For the sake of completeness, it is necessary to review some other related works in this subsection. Early works [29]–[32] on crowd counting use detection- based approaches. These approaches usually apply a person

To reduce the above problems, some works [27], [38], [39] introduce regression-based methods which directly learn the mapping from an image patch to the count. They usually ﬁrst extract global features [40] (texture, gradient, edge fea- tures), or local features [41] (SIFT [42], LBP [43], HOG [44], GLCM [45]). Then some regression techniques such as linear regression [46] and Gaussian mixture regression [47] are used to learn a mapping function to the crowd counting. These methods are successful in dealing with the prob- lems of occlusion and background clutter, but they always ignore spatial information. Therefore, Lemptisky et al. [16] ﬁrst adopt a density estimation based method by learning a linear mapping between local features and corresponding density maps. For reducing the difﬁculty of learning a linear mapping, [48] proposes a non-linear mapping, random forest regression, which obtains satisfactory performance by intro- ducing a crowdedness prior and using it to train two different forests. Besides, this method needs less memory to store the forest. These methods consider the spatial information, but they only use traditional hand-crafted features to extract low- level information, which cannot guide the high-quality density map to estimate more accurate counting. or head detector via a sliding window on an image. Recently many extraordinary object detectors such as R-CNN [33]– [35], YOLO [36], and SSD [37] have been presented, which may perform dramatic detection accuracy in the sparse scenes. However, they will present unsatisfactory results when en- countered the situation of occlusion and background clutter in extremely dense crowds. Recently, beneﬁting from the powerful feature represen- tation of CNNs, more researchers utilize it to improve the density estimation. Earlier heuristic models typically leverage basic CNNs to predict the density of the crowds [15], [49]– [51], which obtain signiﬁcant improvement compared with traditional hand-crafted features. Lately, more effective and efﬁcient models based on Fully Convolution Network (FCN), which has become the mainstream network architecture for the density estimation and crowd counting. Different supervised level and learning paradigm for different models, also there are some models designed in cross scene and multiple domains. A brief chronology is shown in Fig. 1, which illustrates the main advancements and milestones of crowd counting techniques. The goal of this survey is focused on the modern CNN-based for density estimation and crowd counting, Fig. 2 depicts a taxonomy of curial methodologies to be covered in the survey. Scope of the survey. Considering that reviewing all state- of-the-art methods is impractical (and fortunately unneces- sary), this paper sorts out some mainstream algorithms, which are all inﬂuential or essential papers published in, but not limited to, prestigious journals and conferences. The survey focuses on the modern CNN-based density estimation methods in recent years, and some early works are also included for the sake of completeness. We classify existing methods into several categories, in terms of network architecture, supervi- sion form, inﬂuence of cross-scene or multi-domain, etc. Such comprehensive and systematic taxonomies can be more helpful for the readers to in-depth understand the progress of crowd counting in the past years. 2 B. Related previous reviews and surveys Table I lists the existing reviews or surveys which are related to our paper. Notably, Zhan et al [24] and Junior et al. [58] are the ﬁrst ones for crowd analysis. Li et al. [62] review the task of crowded scene analysis with different methods, while Zitouni et al. [65] evaluate different methods with different criteria. Loy et al. [60] make detailed compar- isons of state-of-the-arts for crowd counting based on video imagery with the same protocol. Ryan et al. [60] present an evaluation across multiple datasets to compare various image features and regression models and Saleh et al. [64] survey two main approaches in direct and indirect manners. Grant et al. [66] explore two kinds of crowd analysis. While these surveys make detail analysis on crowd counting and scene analysis, they are only for traditional methods with hand- crafted features. In recent work, Sindagi et al. [67] provide a survey of recent state-of-the-art CNN-based approaches for crowd counting and density estimation for the single image. However, it only roughly introduces the latest advancement of CNN-based methods, which are only up to the year 2017. Tripathi et al. [68] put forward a review on crowd analysis using CNN, which is not just for crowd counting, thereby it was not adequate comprehensive and in-depth. As we know, the techniques are incremental month by month, and it is also an urgent need for us to document the development of crowd counting in the past half-decade. Different from previous surveys that focus on hand- crafted features or primitive CNNs, our work systematically and comprehensively reviews CNN-based density estimation crowd counting approaches. Speciﬁcally, we summarize the existing crowd counting models from various aspects and list the results of some representing mainstream algorithms in terms of evaluation metrics on several typical benchmark crowd counting datasets. Finally, we select the top three performers and carefully and thoroughly analyze the properties of these models. We also offer insights for essential open issues, challenges, and future direction. Through this survey, we expect to make reasonable inference and prediction for the future development of crowd counting, and meanwhile, it can also provide feasible solutions and make guidance for the problem of object counting in other domains. C. Contributions of this paper In summary, the contributions in this paper are mainly in the following folds: 1) Comprehensive and systematic overview from various aspects. We category the CNN-based models according to several taxonomies, including network architecture, supervised form, learning paradigm, etc. The taxonomies can motivate researches with a deep understanding of the critical techniques of CNN-based methods. 2) Attribute-based performance analysis. Based on the performance of the SOTA methods, we analyze the rea- sons why they perform well, the techniques they utilize. Besides, we discuss the various challenge factors that promote researchers to design more effective algorithms.

3 Fig. 1: A brief chronology of crowd counting. The ﬁrst incorporation of deep learning techniques for crowd counting is from 2015. See Section 1 for more detailed description. Milestone models in this ﬁgure: MLR [52], KRR [53], Chan et al. [27], Lemptisky et al. [16], RR [40], CA-RR [54], Count Forest [48], Wang et al. [49], Fu et al. [50], Cross scene [51], MCNN [1], Hydra-CNN [2], CP-CNN [6], CMTL [55], switching CNN [5], CSRNet [12], SANet [11], PSSDN [56] and LSF-CNN [57]. The trend in the past few years has been designing crowd counting models based on multi-column (in green), single-column (in red) network architecture and object localization or tracking depending on counting techniques (in crimson), which are either contemporary and potential direction in future. While traditional heuristic methods are highlighted with the blue-shaded area and the modern CNN-based density estimation and crowd counting models are with the red-shaded backgrounds, respectively. Fig. 2: The overall architecture of this work. We concentrate on the modern density map-based approaches mainly CNN-based for crowd counting. Fig. 3: Comparison of the structure of existing density map-based networks. 3) Open questions and future directions. We look through some important issues for model design, dataset col- lection, and some generalization to other domains with domain adaptation or transfer learning and explore some promising research directions in the future. These contributions provide detailed and in-depth review, which differs from the previous review or survey works to a large extent. The remainder of the paper is organized as follows. Section II conducts a comprehensive literature review of main- stream CNN-based density estimation and crowd counting models according to the proposed taxonomies. Section III examines the most notable datasets for crowd counting and some datasets for other object counting tasks, while section IV describes several widely used evaluation metrics. Section V benchmarks some representing models and makes an in-depth analysis. Section VI presents a discussion and put forward some open issues and possible future directions. Finally, the conclusion is concluded in Section VII. II. TAXONOMY FOR CROWD COUNTING In this section, we review CNN-based crowd counting algorithms in the following taxonomies. Chieﬂy is represen- tative network architectures for crowd counting (II-A). Next (Chan et al.) (Lemptisky et al.)2008MCNN (Zhang et al.) (Wang et al.) (Fu et al.)CP-CNN (Sindagi et al.)CSRNet (Li et al.)20182019RR (Chen et al.)CA-RR (Chen et al.)MLR (Wu et al.)KRR (An et al.)PSSDN (Liu et al.)20062007Detection-basedRegression-basedDensity estimation based2010200920112012201320142015Count Forest(Pham et al.) Cross Scene(Zhang et al.)2016Hydra-CNN (Onoro et al.)2017CMTL (Sindagi et al.)Switching CNN (Sam et al.)SA-Net (Cao et al.)LSF-CNN (Sam et al.)Heuristic ModelsModern Deep ModelsCrowd CountingTraditional ApproachesModern ApproachesObject level:Detection-basedImage/Patch level: Regression-basedGrid/Pixel level: Density estimationScale problemOcclusion Non-uniform distributionIllumination variationBackground noisesNetwork architecture:BasicMulti-columnSingle-columnReference manner:Patch-basedWhole image basedSupervision form:Supervision-basedUn/semi/self-supervision-basedDomains: Domains-specificMulti-domainSupervision level: Instance-levelImage-levelInputDensity mapInputDensity mapMulti-Column NetworksSingle-Column NetworksInputDensity mapBasic NetworksCNNFully-connected layer

4 TABLE I: Summary of previous reviews. # 1 2 3 4 5 6 7 8 9 10 11 12 Title Crowd analysis: a survey [24] Year Venue 2008 MVA Crowd analysis using computer vision tech- niques [58] 2010 ISPM A Survey of Human-Sensing:Methods for Detect- ing Presence, Count, Location, Track, and Iden- tity [59] Crowd counting and proﬁling: Methodology and evaluation [60] 2010 ACM Computing Surveys 2013 MSVAC Performance evaluation of crowd image analysis using the PETS2009 dataset [61] Crowded scene analysis: A survey [62] 2014 PRL 2015 TCSVT An evaluation of crowd counting methods, fea- tures and regression models [63] 2015 CVIU Recent survey on crowd density estimation and counting for visual surveillance [64] Advances and trends in visual crowd analysis: A systematic survey and evaluation of crowd mod- elling techniques [65] Crowd scene understanding from video: a sur- vey [66] A survey of recent advances in cnn-based single image crowd counting and density estimation [67] Convolutional neural networks for crowd be- haviour analysis: a survey [68] 2015 EAAI 2016 Neurocomputing 2017 TOMM 2018 PRL 2019 VC Brief description This paper presents a survey on crowd analysis methods employed in computer vision research and discusses perspectives from other research disciplines and how they can contribute to the computer vision approach. A survey on crowd analysis by using computer vision techniques, including different aspects such as people tracking, crowd density estimation, event detection, validation and simulation. a survey of the inherently multidisciplinary literature of human-sensing , focusing mainly on the extraction of ﬁve commonly needed spatio-temporal properties: namely presence, count, location, track and identity. This study describes and compares the state-of-the-art methods for video imagery based crowd counting, and provides a systematic evaluation of different methods using the same protocol. This paper presents PETS2009 crowd analysis dataset and highlights detection and tracking performance on it This paper surveys the state-of-the-art techniques on crowded scene analysis with different methods such as crowd motion pattern learning, crowd behavior, activity analysis and anomaly detection in crowds. This paper presents an evaluation across multiple datasets to compare holistic, local and histogram based methods, and to compare various image features and regression models. This paper presents a survey on crowd density estimation and counting methods employed for visual surveillance in the perspective of computer vision research. This paper aims to give an account of such issues by deducing key statistical evidence from the existing literature and providing recommendations towards focusing on the general aspects of techniques rather than any speciﬁc algorithm. This survey explores crowd analysis as it relates to two primary research areas: crowd statistics and behavior understanding. A review of various single image crowd counting and density estimation methods with a speciﬁc focus on recent CNN-based approaches. A survey for crowd analysis using CNN is the learning paradigm of the methods (II-B), and then is the inference manner of the networks (II-C). Additionally, the supervision forms of networks are also introduced in II-D. Meanwhile, to evaluate the generalization ability of the algorithms, we classify existing works into domain-speciﬁc and multi-domain ones (II-E). Finally, based on the supervised level, we classify the CNN-based models into instance-level and image-level ones (II-F). We group the important models and describe them roughly in chronological order. A summary of the state-of-the-art is presented in Table II. A. Representative network architectures for crowd counting In view of different types of network architectures, we divide crowd counting models into three categories: basic CNN based methods, multi-column based methods, and single- column based methods. The category of network architectures is illustrated in Fig. 3. 1) Basic CNN: This network architecture adopts the basic CNN layers which convolutional layers, pooling layers, uniquely fully connected layers, without additional feature information required. They generally are involved in the initial works using CNN for density estimation and crowd counting. • Fu et al. [50] put forward the ﬁrst CNN-based model for crowd counting, which accelerates the speed and accuracy of the model by removing some similar network connections existed in feature maps and cascading two ConvNet classiﬁers. • Wang et al. [49] propose a deep network based on Alexnet architecture [102] for extremely dense crowd counting, the adoption of expanded negative samples, whose ground truth counting are zeros, to reduce the interference. • CNN-boosting [15] employs basic CNNs in a layer-wise manner, and leverages layered boosting and selective sampling to improve the counting accuracy and reduce training time. Since without additional feature information provided, basic CNNs are simple and easy to implement yet usually perform low accuracy. 2) Multi-column: These network architectures usually adopt different columns to capture multi-scale information corresponding to different receptive ﬁelds, which have brought about excellent performance for crowd counting. • MCNN [1], a pioneering work explicitly focusing on the multi-scale problem. MCNN is a multi-column architecture with three branches that use different kernel sizes (large, medium, small). However, the similar even the same depth and structure of the three branches, which makes the network look like a simple assembling of several weak regressors. • Hydra-CNN [2] uses a pyramid of image patches corre- sponding to different scales to learn a multi-scale non-linear regression model for the ﬁnal density map estimation. • CrowdNet [3] combines shallow and deep networks at different columns, of which the shallow one captures the low- level features corresponding to large scale variation and the deep one captures the high-level semantic information. • Switching CNN [5] trains several independent CNN crowd density regressors on the image patches, the regressors have the same structure with MCNN [1]. In addition, a switch

TABLE II: Summary of state-of-the-art methods. See II for more detailed description. 5 Network architecture Reference manner Supervision form Learning paradigm Supervision level Basic Basic Basic Patch-based Patch-based Patch-based Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Filly-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Self-Sup. Semi-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Semi-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Self-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Fully-Sup. Un-Sup. Fully-Sup Fully-Sup Fully-Sup Fully-Sup. STL STL MTL STL STL STL MTL STL MTL MTL MTL STL STL STL MTL MTL MTL MTL MTL MTL MTL MTL MTL MTL STL MTL STL MTL STL MTL MTL STL STL STL STL MTL MTL MTL STL MTL STL STL MTL MTL STL STL STL STL STL MTL Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level – – Instance level Instance level Instance level – Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level Instance level – – Instance level Instance level Instance level Image level Instance/Image level Methods Fu et al. [50] Wang et al. [49] Cross scene [51] MCNN [1] Crowdnet [3] CNN-Boosting [15] Hydra-CNN [2] Shang et al. [69] CMTL [55] Switching CNN [5] CP-CNN [6] D-ConvNet [70] CSRNet [12] DRSAN [71] DecideNet [7] SaCNN [9] SACNN [11] IG-CNN [72] ic-CNN [73] ACSCP [74] NetVLAD [75] CL [76] L2R [77] GAN-MTR [78] PaDNet [79] ASD [80] SPN [81] SR-GAN [82] ADCrowdnet [83] SAAN [8] SAA-Net [13] SFCN†2 [84] SE Cycle GAN [84] PACNN [85] CAN&ECAN [86] CFF [87] PCC Net [88] SFANet [89] W-Net [90] SL2R [91] TEDnet [92] RReg [93] RAZNet [94] AT-CNN [95] GWTA-CCNN [96] HA-CCN [97] L2SM [98] RANet [99] McML [100] ILC [101] Year&Venue 2015 EAAI 2015 ACMMM 2015 CVPR 2016 CVPR 2016 ACMMM 2016 ECCV 2016 ECCV 2016 ECCV 2017 AVSS 2017 CVPR 2017 ICCV 2018 CVPR 2018 CVPR 2018 IJCAI 2018 CVPR 2018 WACV 2018 ECCV 2018 CVPR 2018 ECCV 2018 CVPR 2018 TII 2018 ECCV 2018 CVPR 2018 WACV 2019 TIP 2019 ICASSP 2019 WACV 2019 CVIU 2019 CVPR 2019 WACV 2019 CVPR 2019 CVPR 2019 CVPR 2019 CVPR 2019 CVPR 2019 ICCV 2019 TCSVT 2019 CVPR 2019 CVPR 2019 CVPR 2019 CVPR 2019 CVPR 2019 CVPR 2019 CVPR 2019 AAAI 2019 TIP 2019 ICCV 2019 ICCV 2019 ACM MM 2019 CVPR Multi-column Multi-column Basic Multi-column Multi-column Multi-column Multi-column Multi-column Single-column Single-column Multi-column Multi-column Single column Single column Multi-column Multi-column Multi-column Single-column Single-column Basic Basic Single-column Multi-column Single column Basic Single column Multi-column Single column Single column Single column Single column Single column Single-column Multi-column Single column Single column Basic Single column Multi-column Multi-column Single-column Single column Single column Single column Multi-column Multi-column Multi-column Whole image-based Patch-based Patch-based Patch-based Whole image-based Whole image-based Patch-based Whole image-based Whole image-based Whole image-based Whole image-based Patch-based Whole image-based Patch-based Patch-based Whole image-based Patch-based Whole image-based Patch-based Whole image-based Whole image-based Patch-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Whole image-based Patch-based Patch-based Whole image-based Whole image-based Whole image-based Whole image-based Fully-Sup./Weak-Sup classiﬁer is also trained alternatively on the regressions to select the best one for the density estimation. • CP-CNN [6] is a contextual pyramid CNN that combines global and local contextual information to generate high- quality density maps. Moreover, adversarial learning [103] is utilized to fuse the features from different levels. • TDF-CNN [104] delivers top-down information to the bottom-up network to amend the density estimation. • DRSAN [71] handles the issues of scale variation and rotation variation taking advantages of Spatial Transformer Network (STN) [105]. • SAAN [8] is similar to the idea of MoC-CNN [106] and CP-CNN [6], but utilizes visual attention mechanism to automatically select the particular scale both for the global image level and local image patch level. • RANet [99] provides local self-attention (LSA) and global self-attention (GSA) to capture short-range and long-range in- terdependence information respectively, furthermore, a relation module is introduced to merge LSA and GSA to obtain more informative aggregated feature representations. • McML [100] incorporates a statistical network into the multi-column network to estimate the mutual information be- tween different columns, the proposed mutual learning scheme which can optimize each column alternately whilst retaining other columns ﬁxed on each mini-batch training data. • DADNet [107] takes dilated-CNN with different dilated rates to capture more contextual information as front-end and adaptive deformable convolution as a back-end to locate the positions of the objects accurately. Albeit great progress has been achieved by these multi- column network, they still suffer from several signiﬁcant dis- advantages, which have been demonstrated through conducting experiments by Li et al. [12]. First of all, it is difﬁcult to train the multi-column networks since it requires more time and

a more bloated structure. Next, using different branches but almost the same network structures, it inevitably leads to a lot of information redundancy. Moreover, multi-column networks always require density-level classiﬁers before sending images into the networks. However, due to the number of crowds is varying greatly in the congested scene of the real world, making it difﬁcult to deﬁne the granularity of density level. Meanwhile, more ﬁne-grained classiﬁers also mean that more columns and more sophisticated structures are required to be designed, thereby causing more redundancy. Finally, these networks consume a large number of parameters for density- level classiﬁers rather than preparing them for the generation of ﬁnal density maps. Thus the lack of parameters for density map generation will degrade the quality. As all the disadvantages mentioned above, multi-column network architectures may be ineffective in a narrow sense. Thus it motivates many researchers to exploit simpler yet effective and efﬁcient networks. Therefore, single column network architectures are come out to cater to the demands of more challenging situations in the crowd counting. 3) Single column: The single-column network architec- tures usually deploy single and deeper CNNs rather than the bloated structure of multi-column network architecture, and the premise is not to increase the complexity of the network. • W-VLAD [108] takes account of semantic features and spatial cues, additionally, a novel locality-aware feature (LAF) is introduced to represent the spatial information. • SaCNN [9] is a scale-adaptive CNN that takes an FCN with ﬁxed small receptive ﬁelds as backbone and adapts the feature maps extracted from multiple layers to the same sizes and then combines them to generate the ﬁnal density map. • D-ConvNet [70] called as De-correlated ConvNet, takes advantage of negative correlation learning (NCL) to improve the generalization capability of the ensemble models with a set of weak regressors with convolutional feature maps. • CSRNet [12] adopts dilated convolution layers to expand the receptive ﬁeld while maintaining the resolution as back- end network. • SANet [11] is built on the shoulder of Inception architec- ture [109] in the encoder to extract multi-scale features and using Transposed convolution layers in the decoder to up- sampling the extracted feature maps. • SPN [81] leverages a shared deep single-column structure and extracts the multi-scale features in the high-layers by Scale Pyramid Module (SPM), which deploys four parallel dilated convolution with different dilation rates. • ADCrowdNet [83] combines visual attention mechanism and multi-scale deformable convolutional scheme into a cas- cading framework. • SAA-Net [13] mimics multi-branches but single column by learning a set of soft gate attention mask on the intermediate feature maps, which uses the hierarchical structure of CNNs. The ides behind it is somewhat similar to SaCNN [9] but adding attention mask on corresponding feature maps. • W-Net [90] is inspired by U-Net [110], adding an auxiliary Reinforcement branch to accelerate the convergence and retain local pattern consistency, and using Structural Similarity Index (SSIM) to estimate the ﬁnal density maps. 6 • TEDnet [92] is a trellis encoder-decoder network archi- tecture, which integrates multiple decoding paths to capture multi-scale features and exploits dense skip connections to obtain the supervised information. In addition, to alleviate the gradient vanishing problem and improve the back-propagation ability, a combinational loss comprising local coherence and spatial correlation loss is also presented. Due to their architectural simplicity and training efﬁ- ciency, single column network architecture has received more and more attention in the recent years. B. Learning paradigm From the view of different paradigms, crowd counting networks can be bifurcated as single-task and multi-task based methods. 1) Single-task based methods: The classical methodol- ogy is to learn one task at one time, i.e., single-task learn- ing [111]. Most CNN-based crowd counting methods belong to this paradigm, which generally generates density maps and then sum all the pixels to obtain the total count number, or the count number directly. 2) Multi-task based methods: More recently, inspired by the success of multi-task learning in various computer vision tasks, it has shown better performance by combing density estimation and other tasks such as classiﬁcation, detection, segmentation, etc. Multi-task based methods are generally de- signed with multiple subnets; besides, in contrast to pure single column architecture, there may be other branches correspond- ing to different tasks. In summary, multi-task architectures can be regarded as the cross-fertilize between multi-column and single-column but different from either one. • CMTL [55] combines crowd count classiﬁcation and density map estimation into an end-to-end cascaded framework. It divides crowd count into groups and takes this as a high-level prior to integrate into the density map estimation network. • Decidenet [7] predicts the crowd count by generating the detection-based and regression-based density maps, respec- tively. To adaptively decide which model is appropriate, an attention module is adopted to guide the network to allocate relative weights and further select suitable mode. It can automatically switch between detection and regression mode. However, it may suffer from a huge number of parameters by utilizing the multi-column structure. • IG-CNN [72] is a hierarchical clustering model, which can generate image groups in the dataset and a set of particular networks specialized in their respective group. It can adapt and grow regarding the complexity of the dataset. • ic-CNN [73] puts forward a two-branch network, one of which is generating low-resolution density maps, and the other is reﬁning the low-resolution maps and feature maps extracted from previous layers to produce higher resolution density maps. • ACSCP [74] ACSCP introduces an adversarial loss to make the blurring density maps sharp. Moreover, a scale-consistency regularizer is designed to guarantee the calibration of cross- scale model and collaboration between different scale paths. • CL [76] simultaneously addresses three tasks, including crowd counting, density map estimation, and localization in

dense crowds, according to the fact that they are related to each other making the loss function in the optimization of deep CNN decomposable. • CFF [87] assumes that point annotations not just for con- structing density maps, repurposing the point annotations for free in two ways. One is supervised focus from segmentation, and the other is from global density. The focus for free can be regarded as the complement of other excellent approaches, which beneﬁts counting if ignoring the base network. • PCC Net [88] takes perspective change into account, which is composed of three components, Density Map Estimation for leaning local features, Random High-level Density Classi- ﬁcation for predicting density labels of image patches, and Fore-/Background Segmentation (FBS) for segmenting the foreground and background. • RAZ-Net [94] observes that the density map is not consistent with the correct person density, which implies that crowd localization cannot depend on the density map. A recurrent attentive zooming network is proposed to increase the resolu- tion for localization and an adaptive fusion strategy to enhance the mutual ability between counting and localization. • ATCNN [95] fuses three heterogenous attributes, i.e., ge- ometric, semantic and numeric attributes, taking them as auxiliary tasks to assist the crowd counting task. • CDT [112] not only makes an overall comparison of density maps on counting, but also extends to detection and tracking. • NetVLAD [75], [113] is a multi-scale and multi-task frame- work which assembles multi-scale features captured from the input image into a compact feature vector in the means of ”Vector of Locally Aggregated Descriptors” (VLAD). Addi- tionally, ”deeply supervised” operations are exploited on the bottom layers to provide additional information to boost the performance. C. Inference manner Based on the different training manners, the CNN-based crowd counting approaches can be classiﬁed as patch-based inference and the whole image-based inference. 1) Patch-based methods: This inference manner is re- quired to train using patches randomly cropped from the image. In the test phase, using a sliding window spreads over the whole test image, and getting the estimations of each window and then assembling them to obtain the ﬁnal total count of the image. • Cross-scene [51] randomly selects overlapping patches from the training images to serve as training samples, and the density maps of corresponding image patches are treated as the ground truth. The total count of the selected training patch is computed by integrating over the density map. The value of count is a decimal, rather than an integer. • CCNN [2] is primarily leaning a regression function to project the appearance of the image patches onto their cor- responding object density maps. The model adopts the same sizes of all patches and the same covariance value of the Gaussian function in the groundtruth density map generation process, which limits the accuracy when encounters the large scale variation scenarios. 7 • DML [114] integrates metric learning into a deep regres- sion network, which can simultaneously extract density-level features and learn better distance measurement. • PaDNet [79] present a novel Density-Aware Network (DAN) module to discriminate variable density of the crowds, and Feature Enhancement Layer (FEL) module is to boost the global and local recognition performance. • L2SM [98], [115] attempts to address the density pattern shift issue, which is resulting from nonuniform density be- tween sparse and dense regions, by providing two modules, i.e., Scale Sreserving Network (SPN) to obtain patch-level density maps and a learn to scale module (L2SM) to compute scale ratios for dense regions. • GSP [116] devises a global sum pooling operation to replace global average pooling (GAP) or fully connected layers (FC), considering the counting task as a simple linear mapping problem and avoiding patchwise cancellation and overﬁtting in the training phase with small datasets of large images. 2) Whole image-based methods: Patch-based methods always neglect global information and burden much computa- tion cost due to the sliding window operation. Thus the whole image-based methods usually take the whole image as input, and output corresponding density map or a total number of the crowds, which is more convergence but may lose local information sometimes. • JLLG [69] feeds the whole image into a pre-trained CNN to obtain high-level features, then maps these features to local counting numbers. It takes advantage of contextual information both in the global and local count. • Weighted VLAD [117] integrates semantic information into learning locality-aware feature (LAF) sets for crowd counting. First, mapping the original pixel space onto a dense attribute feature map, then utilizing the LAF to capture more spatial context and local information. D. Supervision form According to whether human-labeled annotations are used for training, crowd counting methods can be classiﬁed into two categories: fully-supervised methods and un-/self- /semi-supervised methods. 1) Fully-supervised methods: The vast majority of CNN-based crowd counting methods rely on large-scale ac- curately hand-annotated and diversiﬁed data. However, the acquisition of these data is a time-consuming and more onerous labeling burden than usual. Beyond that, due to the rarely labeled data, the methods may suffer from the problem of over-ﬁtting, which leads to a signiﬁcant degradation in performance when transferring them in the wild or other domains. Therefore, training data with less or even without labeled annotations is a promising research topic in the future. 2) Un/semi/weakly/self-supervised methods: Un/semi- supervised learning denotes that learning without or with a few ground-truth labels, while self-supervised learning represents that adding an auxiliary task which is different from but related to supervised tasks. Some methods exploit unlabeled data for training have achieved comparative performance in contrast with supervised methods.

• GWTA-CCNN [96] presents a stacked convolution autoen- coder based on Grid Winner-Take-All [118] paradigm for unsupervised feature learning, of which 99% parameters can be trained without any labeled data. • SR-GAN [82] generalizes semi-supervised GANs from classiﬁcation problems to regression problems by introducing a loss function of feature contrasting. • GAN-MTR [78] applies semi-supervised learning GANs objectives to multiple object regression problem, which trains a basic network the same as [51] with the use of unlabeled data. • DG-GAN [119] presents a semi-supervised dual-goal GAN framework to seek both the number of individuals in the crowd scene and discriminate whether the real or fake images. • CCLL [120] puts forward a semi-supervised method by utilizing a sub-modular to choose the most representative frames from the sequences to circumvent redundancy and retain densities, graph Laplacian regularization and spatiotem- poral constraints are also incorporated into the model. • L2R [77], [91] exploits unlabeled crowd data for pre-training CNNs in a multi-task framework, which is inspired by self- supervised learning and based on the observation that the crowd count number of the patches must be fewer or equal to the larger patch which contains them. The method is fully supervised in essence but an additional task of count ranking in a self-supervised manner. • HA-CNN [97] offers the ﬁrst attempt to ﬁne-turn the network to new scenes in a weakly supervised manner, by leveraging the image-level labels of crowd images into density levels. • CCWld [84] provides a data collector and labeler for crowd counting, where the data is from an electronic game. With the collector and labeler, it can collect and annotate data automatically, and the ﬁrst large-scale synthetic crowd counting dataset is constructed. • CODA [121] presents a novel scale-aware adversarial den- sity adaption approach for object counting, which can be used to generalize the trained model to unseen scenes in an unsupervised manner. • OSSS [122] designs a one-shot scene-speciﬁc crowd count- ing model by taking advantage of ﬁne-turning. 8 parameters. It can also be extended to perform a visual domain classiﬁcation even in an unseen observed domain. • SE CycleGAN [84] takes advantage of domain adap- tation technique, incorporating Structural Similarity Index (SSIM) [125] into traditional CycleGAN framework to make up the domain gap between synthetic data and real-world data. • MFA+SDA [126] is drawing the idea from SE Cycle GAN, which is also a GAN-based adaptation model. The authors propose a Multi-level Feature-aware Adaptation to reduce the domain gap and present a Structured Density map Alignment for handling the unseen crowd scenes. • DACC [127] is composed of two modules: Inter-domain Features Segregation (IFS) and Gaussian-prior Reconstruction (GPR). IFS is designed to translate the synthetic data to realistic images, and GPR is used to generate higher-ﬁdelity density maps with pseudo labels. • FSC [128] extracts semantic domain-invariant features via crowd masks generated by a pre-trained crowd segmentation model. The error estimations in the background regions are reduced signiﬁcantly. F. Instance-/image-based supervision The aim of object counting is to estimate the number of objects. If the ground truth is labeled with point or bounding box, the method pertains to instance-level supervision. In con- trast, image-level supervision just needs to count the number of different object instance instead. 1) Instance-level supervision: Most crowd density es- timation methods are based on instance-level (point-level or bounding box) supervision, which needs hand-labeled annota- tions for each instance location. 2) Image-level supervision: Image-level supervision- based methods need to count the number of instances within or beyond the subitizing range, which do not require location information. It can be regarded as estimating the count at one shot or glance [129]. • ILC [101] generates a density map of object categories, which obtains the total object count estimation and spatial distribution of object instances simultaneously. E. Domain adaptation III. DATASETS Almost all the existing counting methods are designed in a speciﬁc domain; therefore, designing a counting model which can count any object domain is a challenging yet meaningful task. The domain adaptation technique may be a powerful tool to tackle this problem. • CAC [123] formulates the counting as a matching problem, which presents a Generic Matching Network (GMN) in a class-agnostic manner. GMN can be trained by the amount of video data labeled for tracking due to counting as a matching problem. In a few-shot learning way, it can use an adapter module to apply to different domains. • PPPD [124] provides a patch-based, multi-domain object counting network by leveraging a set of domain-speciﬁc scaling and normalization layers which only uses a few of With the blooming development of crowd counting, nu- merous datasets have been introduced, which can motivate many more algorithms to cater to various challenges such as scale variations, background clutter in the surveillance video and changeable environment, illumination variation in the wild. In this section, we review almost all the crowd counting datasets from beginning up to now. Table III summarizes some representing datasets, including crowd counting datasets with real-world data and one with synthetic data, for the sake of completeness, we also survey several datasets applied in other domains, to evaluate the generalization ability of the designed algorithms. The datasets are sorted by chronology and the speciﬁc statistics of them are listed in Table III. Some samples from the representing datasets are depicted in Fig. 4.

分享到：

赞收藏

资料库

「深度学习人群计数」2020综述论文（北航发布）.pdf

相关推荐

人工智能

热门标签

最新资料