CNN-based Density Estimation and Crowd
Counting: A Survey
Guangshuai Gao1,2, Junyu Gao3, Student Member, IEEE, Qingjie Liu1,2∗, Member, IEEE, Qi Wang3, Senior
Member, IEEE, and Yunhong Wang1,2, Fellow, IEEE
1
0
2
0
2
r
a
M
8
2
]
V
C
.
s
c
[
1
v
3
8
7
2
1
.
3
0
0
2
:
v
i
X
r
a
Abstract—Accurately estimating the number of objects in
a single image is a challenging yet meaningful task and has
been applied in many applications such as urban planning
and public safety. In the various object counting tasks, crowd
counting is particularly prominent due to its specific significance
to social security and development. Fortunately, the development
of the techniques for crowd counting can be generalized to other
related fields such as vehicle counting and environment survey,
if without taking their characteristics into account. Therefore,
many researchers are devoting to crowd counting, and many
excellent works of literature and works have spurted out. In these
works, they are must be helpful for the development of crowd
counting. However, the question we should consider is why they
are effective for this task. Limited by the cost of time and energy,
we cannot analyze all the algorithms. In this paper, we have
surveyed over 220 works to comprehensively and systematically
study the crowd counting models, mainly CNN-based density map
estimation methods. Finally, according to the evaluation metrics,
we select the top three performers on their crowd counting
datasets and analyze their merits and drawbacks. Through our
analysis, we expect to make reasonable inference and prediction
for the future development of crowd counting, and meanwhile,
it can also provide feasible solutions for the problem of object
counting in other fields. We provide the density maps and
prediction results of some mainstream algorithm in the validation
set of NWPU dataset for comparison and testing. Meanwhile,
density map generation and evaluation tools are also provided.
All the codes and evaluation results are made publicly available
at https://github.com/gaoguangshuai/survey-for-crowd-counting.
Index Terms—Object counting, crowd counting, density
estimation, CNNs.
I. INTRODUCTION
O VER the past few decades, an increasing number of
research communities, have considered the problem of
object counting as their mainly research direction, as a conse-
quence, many works have been published to count the number
of objects in images or videos across wide variety of domains
such as crowding counting [1]–[13], cell microscopy [14]–
[16], animals [17], vehicles [2], [18]–[20], leaves [21], [22]
and environment survey [23], [24]. In all these domains, crowd
counting is of paramount
to
importance, and it
is crucial
Guangshuai Gao, Qingjie Liu and Yunhong Wang are with the
State Key Laboratory of Virtual Reality Technology and Systems,
Beihang University, Xueyuan Road, Haidian District, Beijing, 100191,
China
Beihang University,
Institute,
Hangzhou,
gaoguangshuai1990@buaa.edu.cn;
qingjie.liu@buaa.edu.cn;yhwang@buaa.edu.cn);
Innovation
(email:
and Hangzhou
310051,China
Junyu Gao and Qi Wang are with the School of Computer Science and
with the Center for Optical Imagery Analysis and Learning (OPTIMAL),
Northwestern Polytechnical University, Xi’an 710072, Shanxi, China (email:
gjy3035@gmail.com;crabwq@gmail.com)
* Corresponding author: Qingjie Liu
building a more high-level cognitive ability in some crowd
scenarios, such as crowd analysis [25], [26] and video surveil-
lance [27]. As the increasing growth of the world’s population
and subsequent urbanization result in a rapid crowd gathering
in many scenarios such as parades, concerts and stadiums. In
these scenarios, crowd counting plays an indispensable role
for social safety and control management.
Considering the specific importance of crowd counting
aforementioned, more and more researchers have attempted to
design various sophisticated projects to address the problem
of crowd counting. Especially in the last half decades, with
the advent of deep learning, Convolution Neural Networks
(CNNs) based models have been overwhelmingly dominated
in various computer vision tasks, including crowd counting.
Although different tasks have their unique attributes, there ex-
ist common features such as structural features and distribution
patterns. Fortunately, the techniques for crowd counting can
be extended to some other fields with specific tools. Therefore,
in this paper, we expect to provide a reasonable solution for
other tasks through the deep excavation of the crowd counting
task, especially for CNN-based density estimation and crowd
counting models. Our survey aims to involve various parts,
which is ranging algorithm taxonomy from some interest-
ing under-explored research direction. Beyond taxonomically
reviewing existing CNN-based crowd counting and density
estimation models, representing datasets and evaluation met-
rics, some factors and attributes, which largely affect
the
performance the designed model, are also investigated, such
as distractors and negative samples. We provide the density
maps and prediction results of some mainstream algorithm
in the validation set of NWPU dataset [28] for comparison
and testing. Meanwhile, density map generation and evaluation
tools are also provided. All the codes and evaluation results are
made publicly available at https://github.com/gaoguangshuai/
survey-for-crowd-counting.
A. Related Works and Scope
The various approaches for crowd counting are mainly
divided into four categories: detection-based, regression-based,
density estimation, and more recently CNN-based density
estimation approaches. We focus on the CNN-based density
estimation and crowd counting model in this survey. For the
sake of completeness, it is necessary to review some other
related works in this subsection.
Early works [29]–[32] on crowd counting use detection-
based approaches. These approaches usually apply a person
To reduce the above problems, some works [27], [38],
[39] introduce regression-based methods which directly learn
the mapping from an image patch to the count. They usually
first extract global features [40] (texture, gradient, edge fea-
tures), or local features [41] (SIFT [42], LBP [43], HOG [44],
GLCM [45]). Then some regression techniques such as linear
regression [46] and Gaussian mixture regression [47] are used
to learn a mapping function to the crowd counting.
These methods are successful in dealing with the prob-
lems of occlusion and background clutter, but they always
ignore spatial information. Therefore, Lemptisky et al. [16]
first adopt a density estimation based method by learning
a linear mapping between local features and corresponding
density maps. For reducing the difficulty of learning a linear
mapping, [48] proposes a non-linear mapping, random forest
regression, which obtains satisfactory performance by intro-
ducing a crowdedness prior and using it to train two different
forests. Besides, this method needs less memory to store the
forest. These methods consider the spatial information, but
they only use traditional hand-crafted features to extract low-
level information, which cannot guide the high-quality density
map to estimate more accurate counting.
or head detector via a sliding window on an image. Recently
many extraordinary object detectors such as R-CNN [33]–
[35], YOLO [36], and SSD [37] have been presented, which
may perform dramatic detection accuracy in the sparse scenes.
However, they will present unsatisfactory results when en-
countered the situation of occlusion and background clutter
in extremely dense crowds.
Recently, benefiting from the powerful feature represen-
tation of CNNs, more researchers utilize it to improve the
density estimation. Earlier heuristic models typically leverage
basic CNNs to predict the density of the crowds [15], [49]–
[51], which obtain significant improvement compared with
traditional hand-crafted features. Lately, more effective and
efficient models based on Fully Convolution Network (FCN),
which has become the mainstream network architecture for the
density estimation and crowd counting. Different supervised
level and learning paradigm for different models, also there are
some models designed in cross scene and multiple domains. A
brief chronology is shown in Fig. 1, which illustrates the main
advancements and milestones of crowd counting techniques.
The goal of this survey is focused on the modern CNN-based
for density estimation and crowd counting, Fig. 2 depicts a
taxonomy of curial methodologies to be covered in the survey.
Scope of the survey. Considering that reviewing all state-
of-the-art methods is impractical (and fortunately unneces-
sary), this paper sorts out some mainstream algorithms, which
are all influential or essential papers published in, but not
limited to, prestigious journals and conferences. The survey
focuses on the modern CNN-based density estimation methods
in recent years, and some early works are also included for
the sake of completeness. We classify existing methods into
several categories, in terms of network architecture, supervi-
sion form, influence of cross-scene or multi-domain, etc. Such
comprehensive and systematic taxonomies can be more helpful
for the readers to in-depth understand the progress of crowd
counting in the past years.
2
B. Related previous reviews and surveys
Table I lists the existing reviews or surveys which are
related to our paper. Notably, Zhan et al [24] and Junior et
al. [58] are the first ones for crowd analysis. Li et al. [62]
review the task of crowded scene analysis with different
methods, while Zitouni et al. [65] evaluate different methods
with different criteria. Loy et al. [60] make detailed compar-
isons of state-of-the-arts for crowd counting based on video
imagery with the same protocol. Ryan et al. [60] present an
evaluation across multiple datasets to compare various image
features and regression models and Saleh et al. [64] survey
two main approaches in direct and indirect manners. Grant
et al. [66] explore two kinds of crowd analysis. While these
surveys make detail analysis on crowd counting and scene
analysis, they are only for traditional methods with hand-
crafted features. In recent work, Sindagi et al. [67] provide
a survey of recent state-of-the-art CNN-based approaches for
crowd counting and density estimation for the single image.
However, it only roughly introduces the latest advancement
of CNN-based methods, which are only up to the year 2017.
Tripathi et al. [68] put forward a review on crowd analysis
using CNN, which is not just for crowd counting, thereby it
was not adequate comprehensive and in-depth. As we know,
the techniques are incremental month by month, and it is also
an urgent need for us to document the development of crowd
counting in the past half-decade.
Different from previous surveys that focus on hand-
crafted features or primitive CNNs, our work systematically
and comprehensively reviews CNN-based density estimation
crowd counting approaches. Specifically, we summarize the
existing crowd counting models from various aspects and
list the results of some representing mainstream algorithms
in terms of evaluation metrics on several typical benchmark
crowd counting datasets. Finally, we select
the top three
performers and carefully and thoroughly analyze the properties
of these models. We also offer insights for essential open
issues, challenges, and future direction. Through this survey,
we expect to make reasonable inference and prediction for
the future development of crowd counting, and meanwhile, it
can also provide feasible solutions and make guidance for the
problem of object counting in other domains.
C. Contributions of this paper
In summary, the contributions in this paper are mainly in
the following folds:
1) Comprehensive and systematic overview from various
aspects. We category the CNN-based models according
to several taxonomies, including network architecture,
supervised form, learning paradigm, etc. The taxonomies
can motivate researches with a deep understanding of the
critical techniques of CNN-based methods.
2) Attribute-based performance analysis. Based on the
performance of the SOTA methods, we analyze the rea-
sons why they perform well, the techniques they utilize.
Besides, we discuss the various challenge factors that
promote researchers to design more effective algorithms.
3
Fig. 1: A brief chronology of crowd counting. The first incorporation of deep learning techniques for crowd counting is from 2015. See
Section 1 for more detailed description. Milestone models in this figure: MLR [52], KRR [53], Chan et al. [27], Lemptisky et al. [16],
RR [40], CA-RR [54], Count Forest [48], Wang et al. [49], Fu et al. [50], Cross scene [51], MCNN [1], Hydra-CNN [2], CP-CNN [6],
CMTL [55], switching CNN [5], CSRNet [12], SANet [11], PSSDN [56] and LSF-CNN [57]. The trend in the past few years has been
designing crowd counting models based on multi-column (in green), single-column (in red) network architecture and object localization or
tracking depending on counting techniques (in crimson), which are either contemporary and potential direction in future. While traditional
heuristic methods are highlighted with the blue-shaded area and the modern CNN-based density estimation and crowd counting models are
with the red-shaded backgrounds, respectively.
Fig. 2: The overall architecture of this work. We concentrate on the modern density map-based approaches mainly CNN-based for crowd
counting.
Fig. 3: Comparison of the structure of existing density map-based networks.
3) Open questions and future directions. We look through
some important
issues for model design, dataset col-
lection, and some generalization to other domains with
domain adaptation or transfer learning and explore some
promising research directions in the future.
These contributions provide detailed and in-depth review,
which differs from the previous review or survey works to a
large extent.
The remainder of the paper is organized as follows.
Section II conducts a comprehensive literature review of main-
stream CNN-based density estimation and crowd counting
models according to the proposed taxonomies. Section III
examines the most notable datasets for crowd counting and
some datasets for other object counting tasks, while section IV
describes several widely used evaluation metrics. Section V
benchmarks some representing models and makes an in-depth
analysis. Section VI presents a discussion and put forward
some open issues and possible future directions. Finally, the
conclusion is concluded in Section VII.
II. TAXONOMY FOR CROWD COUNTING
In this section, we review CNN-based crowd counting
algorithms in the following taxonomies. Chiefly is represen-
tative network architectures for crowd counting (II-A). Next
(Chan et al.) (Lemptisky et al.)2008MCNN (Zhang et al.) (Wang et al.) (Fu et al.)CP-CNN (Sindagi et al.)CSRNet (Li et al.)20182019RR (Chen et al.)CA-RR (Chen et al.)MLR (Wu et al.)KRR (An et al.)PSSDN (Liu et al.)20062007Detection-basedRegression-basedDensity estimation based2010200920112012201320142015Count Forest(Pham et al.) Cross Scene(Zhang et al.)2016Hydra-CNN (Onoro et al.)2017CMTL (Sindagi et al.)Switching CNN (Sam et al.)SA-Net (Cao et al.)LSF-CNN (Sam et al.)Heuristic ModelsModern Deep ModelsCrowd CountingTraditional ApproachesModern ApproachesObject level:Detection-basedImage/Patch level: Regression-basedGrid/Pixel level: Density estimationScale problemOcclusion Non-uniform distributionIllumination variationBackground noisesNetwork architecture:BasicMulti-columnSingle-columnReference manner:Patch-basedWhole image basedSupervision form:Supervision-basedUn/semi/self-supervision-basedDomains: Domains-specificMulti-domainSupervision level: Instance-levelImage-levelInputDensity mapInputDensity mapMulti-Column NetworksSingle-Column NetworksInputDensity mapBasic NetworksCNNFully-connected layer
4
TABLE I: Summary of previous reviews.
#
1
2
3
4
5
6
7
8
9
10
11
12
Title
Crowd analysis: a survey [24]
Year
Venue
2008 MVA
Crowd analysis using computer vision tech-
niques [58]
2010
ISPM
A Survey of Human-Sensing:Methods for Detect-
ing Presence, Count, Location, Track, and Iden-
tity [59]
Crowd counting and profiling: Methodology and
evaluation [60]
2010
ACM Computing Surveys
2013 MSVAC
Performance evaluation of crowd image analysis
using the PETS2009 dataset [61]
Crowded scene analysis: A survey [62]
2014
PRL
2015
TCSVT
An evaluation of crowd counting methods, fea-
tures and regression models [63]
2015
CVIU
Recent survey on crowd density estimation and
counting for visual surveillance [64]
Advances and trends in visual crowd analysis: A
systematic survey and evaluation of crowd mod-
elling techniques [65]
Crowd scene understanding from video: a sur-
vey [66]
A survey of recent advances in cnn-based single
image crowd counting and density estimation [67]
Convolutional neural networks for crowd be-
haviour analysis: a survey [68]
2015
EAAI
2016
Neurocomputing
2017
TOMM
2018
PRL
2019
VC
Brief description
This paper presents a survey on crowd analysis methods employed in computer
vision research and discusses perspectives from other research disciplines and
how they can contribute to the computer vision approach.
A survey on crowd analysis by using computer vision techniques, including
different aspects such as people tracking, crowd density estimation, event
detection, validation and simulation.
a survey of the inherently multidisciplinary literature of human-sensing , focusing
mainly on the extraction of five commonly needed spatio-temporal properties:
namely presence, count, location, track and identity.
This study describes and compares the state-of-the-art methods for video imagery
based crowd counting, and provides a systematic evaluation of different methods
using the same protocol.
This paper presents PETS2009 crowd analysis dataset and highlights detection
and tracking performance on it
This paper surveys the state-of-the-art techniques on crowded scene analysis
with different methods such as crowd motion pattern learning, crowd behavior,
activity analysis and anomaly detection in crowds.
This paper presents an evaluation across multiple datasets to compare holistic,
local and histogram based methods, and to compare various image features and
regression models.
This paper presents a survey on crowd density estimation and counting methods
employed for visual surveillance in the perspective of computer vision research.
This paper aims to give an account of such issues by deducing key statistical
evidence from the existing literature and providing recommendations towards
focusing on the general aspects of techniques rather than any specific algorithm.
This survey explores crowd analysis as it relates to two primary research areas:
crowd statistics and behavior understanding.
A review of various single image crowd counting and density estimation methods
with a specific focus on recent CNN-based approaches.
A survey for crowd analysis using CNN
is the learning paradigm of the methods (II-B), and then is
the inference manner of the networks (II-C). Additionally,
the supervision forms of networks are also introduced in
II-D. Meanwhile, to evaluate the generalization ability of the
algorithms, we classify existing works into domain-specific
and multi-domain ones (II-E). Finally, based on the supervised
level, we classify the CNN-based models into instance-level
and image-level ones (II-F). We group the important models
and describe them roughly in chronological order. A summary
of the state-of-the-art is presented in Table II.
A. Representative network architectures for crowd counting
In view of different types of network architectures, we
divide crowd counting models into three categories: basic
CNN based methods, multi-column based methods, and single-
column based methods. The category of network architectures
is illustrated in Fig. 3.
1) Basic CNN: This network architecture adopts the
basic CNN layers which convolutional layers, pooling layers,
uniquely fully connected layers, without additional feature
information required. They generally are involved in the initial
works using CNN for density estimation and crowd counting.
• Fu et al. [50] put forward the first CNN-based model for
crowd counting, which accelerates the speed and accuracy
of the model by removing some similar network connections
existed in feature maps and cascading two ConvNet classifiers.
• Wang et al. [49] propose a deep network based on Alexnet
architecture [102] for extremely dense crowd counting, the
adoption of expanded negative samples, whose ground truth
counting are zeros, to reduce the interference.
• CNN-boosting [15] employs basic CNNs in a layer-wise
manner, and leverages layered boosting and selective sampling
to improve the counting accuracy and reduce training time.
Since without additional feature information provided,
basic CNNs are simple and easy to implement yet usually
perform low accuracy.
2) Multi-column: These network architectures usually
adopt different columns to capture multi-scale information
corresponding to different receptive fields, which have brought
about excellent performance for crowd counting.
• MCNN [1], a pioneering work explicitly focusing on the
multi-scale problem. MCNN is a multi-column architecture
with three branches that use different kernel sizes (large,
medium, small). However, the similar even the same depth
and structure of the three branches, which makes the network
look like a simple assembling of several weak regressors.
• Hydra-CNN [2] uses a pyramid of image patches corre-
sponding to different scales to learn a multi-scale non-linear
regression model for the final density map estimation.
• CrowdNet [3] combines shallow and deep networks at
different columns, of which the shallow one captures the low-
level features corresponding to large scale variation and the
deep one captures the high-level semantic information.
• Switching CNN [5] trains several independent CNN crowd
density regressors on the image patches, the regressors have
the same structure with MCNN [1]. In addition, a switch
TABLE II: Summary of state-of-the-art methods. See II for more detailed description.
5
Network architecture
Reference manner
Supervision form
Learning paradigm
Supervision level
Basic
Basic
Basic
Patch-based
Patch-based
Patch-based
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Filly-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Self-Sup.
Semi-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Semi-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Self-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Fully-Sup.
Un-Sup.
Fully-Sup
Fully-Sup
Fully-Sup
Fully-Sup.
STL
STL
MTL
STL
STL
STL
MTL
STL
MTL
MTL
MTL
STL
STL
STL
MTL
MTL
MTL
MTL
MTL
MTL
MTL
MTL
MTL
MTL
STL
MTL
STL
MTL
STL
MTL
MTL
STL
STL
STL
STL
MTL
MTL
MTL
STL
MTL
STL
STL
MTL
MTL
STL
STL
STL
STL
STL
MTL
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
–
–
Instance level
Instance level
Instance level
–
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
Instance level
–
–
Instance level
Instance level
Instance level
Image level
Instance/Image level
Methods
Fu et al. [50]
Wang et al. [49]
Cross scene [51]
MCNN [1]
Crowdnet [3]
CNN-Boosting [15]
Hydra-CNN [2]
Shang et al. [69]
CMTL [55]
Switching CNN [5]
CP-CNN [6]
D-ConvNet [70]
CSRNet [12]
DRSAN [71]
DecideNet [7]
SaCNN [9]
SACNN [11]
IG-CNN [72]
ic-CNN [73]
ACSCP [74]
NetVLAD [75]
CL [76]
L2R [77]
GAN-MTR [78]
PaDNet [79]
ASD [80]
SPN [81]
SR-GAN [82]
ADCrowdnet [83]
SAAN [8]
SAA-Net [13]
SFCN†2 [84]
SE Cycle GAN [84]
PACNN [85]
CAN&ECAN [86]
CFF [87]
PCC Net [88]
SFANet [89]
W-Net [90]
SL2R [91]
TEDnet [92]
RReg [93]
RAZNet [94]
AT-CNN [95]
GWTA-CCNN [96]
HA-CCN [97]
L2SM [98]
RANet [99]
McML [100]
ILC [101]
Year&Venue
2015 EAAI
2015 ACMMM
2015 CVPR
2016 CVPR
2016 ACMMM
2016 ECCV
2016 ECCV
2016 ECCV
2017 AVSS
2017 CVPR
2017 ICCV
2018 CVPR
2018 CVPR
2018 IJCAI
2018 CVPR
2018 WACV
2018 ECCV
2018 CVPR
2018 ECCV
2018 CVPR
2018 TII
2018 ECCV
2018 CVPR
2018 WACV
2019 TIP
2019 ICASSP
2019 WACV
2019 CVIU
2019 CVPR
2019 WACV
2019 CVPR
2019 CVPR
2019 CVPR
2019 CVPR
2019 CVPR
2019 ICCV
2019 TCSVT
2019 CVPR
2019 CVPR
2019 CVPR
2019 CVPR
2019 CVPR
2019 CVPR
2019 CVPR
2019 AAAI
2019 TIP
2019 ICCV
2019 ICCV
2019 ACM MM
2019 CVPR
Multi-column
Multi-column
Basic
Multi-column
Multi-column
Multi-column
Multi-column
Multi-column
Single-column
Single-column
Multi-column
Multi-column
Single column
Single column
Multi-column
Multi-column
Multi-column
Single-column
Single-column
Basic
Basic
Single-column
Multi-column
Single column
Basic
Single column
Multi-column
Single column
Single column
Single column
Single column
Single column
Single-column
Multi-column
Single column
Single column
Basic
Single column
Multi-column
Multi-column
Single-column
Single column
Single column
Single column
Multi-column
Multi-column
Multi-column
Whole image-based
Patch-based
Patch-based
Patch-based
Whole image-based
Whole image-based
Patch-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Patch-based
Whole image-based
Patch-based
Patch-based
Whole image-based
Patch-based
Whole image-based
Patch-based
Whole image-based
Whole image-based
Patch-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Patch-based
Patch-based
Whole image-based
Whole image-based
Whole image-based
Whole image-based
Fully-Sup./Weak-Sup
classifier is also trained alternatively on the regressions to
select the best one for the density estimation.
• CP-CNN [6] is a contextual pyramid CNN that combines
global and local contextual
information to generate high-
quality density maps. Moreover, adversarial learning [103]
is utilized to fuse the features from different levels.
• TDF-CNN [104] delivers top-down information to the
bottom-up network to amend the density estimation.
• DRSAN [71] handles the issues of scale variation and
rotation variation taking advantages of Spatial Transformer
Network (STN) [105].
• SAAN [8] is similar to the idea of MoC-CNN [106]
and CP-CNN [6], but utilizes visual attention mechanism to
automatically select the particular scale both for the global
image level and local image patch level.
• RANet [99] provides local self-attention (LSA) and global
self-attention (GSA) to capture short-range and long-range in-
terdependence information respectively, furthermore, a relation
module is introduced to merge LSA and GSA to obtain more
informative aggregated feature representations.
• McML [100] incorporates a statistical network into the
multi-column network to estimate the mutual information be-
tween different columns, the proposed mutual learning scheme
which can optimize each column alternately whilst retaining
other columns fixed on each mini-batch training data.
• DADNet [107] takes dilated-CNN with different dilated
rates to capture more contextual information as front-end and
adaptive deformable convolution as a back-end to locate the
positions of the objects accurately.
Albeit great progress has been achieved by these multi-
column network, they still suffer from several significant dis-
advantages, which have been demonstrated through conducting
experiments by Li et al. [12]. First of all, it is difficult to train
the multi-column networks since it requires more time and
a more bloated structure. Next, using different branches but
almost the same network structures, it inevitably leads to a lot
of information redundancy. Moreover, multi-column networks
always require density-level classifiers before sending images
into the networks. However, due to the number of crowds
is varying greatly in the congested scene of the real world,
making it difficult to define the granularity of density level.
Meanwhile, more fine-grained classifiers also mean that more
columns and more sophisticated structures are required to
be designed, thereby causing more redundancy. Finally, these
networks consume a large number of parameters for density-
level classifiers rather than preparing them for the generation
of final density maps. Thus the lack of parameters for density
map generation will degrade the quality.
As all the disadvantages mentioned above, multi-column
network architectures may be ineffective in a narrow sense.
Thus it motivates many researchers to exploit simpler yet
effective and efficient networks. Therefore, single column
network architectures are come out to cater to the demands
of more challenging situations in the crowd counting.
3) Single column: The single-column network architec-
tures usually deploy single and deeper CNNs rather than the
bloated structure of multi-column network architecture, and
the premise is not to increase the complexity of the network.
• W-VLAD [108] takes account of semantic features and
spatial cues, additionally, a novel locality-aware feature (LAF)
is introduced to represent the spatial information.
• SaCNN [9] is a scale-adaptive CNN that takes an FCN with
fixed small receptive fields as backbone and adapts the feature
maps extracted from multiple layers to the same sizes and then
combines them to generate the final density map.
• D-ConvNet [70] called as De-correlated ConvNet, takes
advantage of negative correlation learning (NCL) to improve
the generalization capability of the ensemble models with a
set of weak regressors with convolutional feature maps.
• CSRNet [12] adopts dilated convolution layers to expand
the receptive field while maintaining the resolution as back-
end network.
• SANet [11] is built on the shoulder of Inception architec-
ture [109] in the encoder to extract multi-scale features and
using Transposed convolution layers in the decoder to up-
sampling the extracted feature maps.
• SPN [81] leverages a shared deep single-column structure
and extracts the multi-scale features in the high-layers by Scale
Pyramid Module (SPM), which deploys four parallel dilated
convolution with different dilation rates.
• ADCrowdNet [83] combines visual attention mechanism
and multi-scale deformable convolutional scheme into a cas-
cading framework.
• SAA-Net [13] mimics multi-branches but single column by
learning a set of soft gate attention mask on the intermediate
feature maps, which uses the hierarchical structure of CNNs.
The ides behind it is somewhat similar to SaCNN [9] but
adding attention mask on corresponding feature maps.
• W-Net [90] is inspired by U-Net [110], adding an auxiliary
Reinforcement branch to accelerate the convergence and retain
local pattern consistency, and using Structural Similarity Index
(SSIM) to estimate the final density maps.
6
• TEDnet [92] is a trellis encoder-decoder network archi-
tecture, which integrates multiple decoding paths to capture
multi-scale features and exploits dense skip connections to
obtain the supervised information. In addition, to alleviate the
gradient vanishing problem and improve the back-propagation
ability, a combinational loss comprising local coherence and
spatial correlation loss is also presented.
Due to their architectural simplicity and training effi-
ciency, single column network architecture has received more
and more attention in the recent years.
B. Learning paradigm
From the view of different paradigms, crowd counting
networks can be bifurcated as single-task and multi-task based
methods.
1) Single-task based methods: The classical methodol-
ogy is to learn one task at one time, i.e., single-task learn-
ing [111]. Most CNN-based crowd counting methods belong
to this paradigm, which generally generates density maps and
then sum all the pixels to obtain the total count number, or
the count number directly.
2) Multi-task based methods: More recently, inspired by
the success of multi-task learning in various computer vision
tasks, it has shown better performance by combing density
estimation and other tasks such as classification, detection,
segmentation, etc. Multi-task based methods are generally de-
signed with multiple subnets; besides, in contrast to pure single
column architecture, there may be other branches correspond-
ing to different tasks. In summary, multi-task architectures can
be regarded as the cross-fertilize between multi-column and
single-column but different from either one.
• CMTL [55] combines crowd count classification and density
map estimation into an end-to-end cascaded framework. It
divides crowd count into groups and takes this as a high-level
prior to integrate into the density map estimation network.
• Decidenet [7] predicts the crowd count by generating the
detection-based and regression-based density maps, respec-
tively. To adaptively decide which model is appropriate, an
attention module is adopted to guide the network to allocate
relative weights and further select suitable mode. It can
automatically switch between detection and regression mode.
However, it may suffer from a huge number of parameters by
utilizing the multi-column structure.
• IG-CNN [72] is a hierarchical clustering model, which can
generate image groups in the dataset and a set of particular
networks specialized in their respective group. It can adapt
and grow regarding the complexity of the dataset.
• ic-CNN [73] puts forward a two-branch network, one of
which is generating low-resolution density maps, and the other
is refining the low-resolution maps and feature maps extracted
from previous layers to produce higher resolution density
maps.
• ACSCP [74] ACSCP introduces an adversarial loss to make
the blurring density maps sharp. Moreover, a scale-consistency
regularizer is designed to guarantee the calibration of cross-
scale model and collaboration between different scale paths.
• CL [76] simultaneously addresses three tasks, including
crowd counting, density map estimation, and localization in
dense crowds, according to the fact that they are related to
each other making the loss function in the optimization of
deep CNN decomposable.
• CFF [87] assumes that point annotations not just for con-
structing density maps, repurposing the point annotations for
free in two ways. One is supervised focus from segmentation,
and the other is from global density. The focus for free can
be regarded as the complement of other excellent approaches,
which benefits counting if ignoring the base network.
• PCC Net [88] takes perspective change into account, which
is composed of three components, Density Map Estimation
for leaning local features, Random High-level Density Classi-
fication for predicting density labels of image patches, and
Fore-/Background Segmentation (FBS) for segmenting the
foreground and background.
• RAZ-Net [94] observes that the density map is not consistent
with the correct person density, which implies that crowd
localization cannot depend on the density map. A recurrent
attentive zooming network is proposed to increase the resolu-
tion for localization and an adaptive fusion strategy to enhance
the mutual ability between counting and localization.
• ATCNN [95] fuses three heterogenous attributes, i.e., ge-
ometric, semantic and numeric attributes,
taking them as
auxiliary tasks to assist the crowd counting task.
• CDT [112] not only makes an overall comparison of density
maps on counting, but also extends to detection and tracking.
• NetVLAD [75], [113] is a multi-scale and multi-task frame-
work which assembles multi-scale features captured from the
input image into a compact feature vector in the means of
”Vector of Locally Aggregated Descriptors” (VLAD). Addi-
tionally, ”deeply supervised” operations are exploited on the
bottom layers to provide additional information to boost the
performance.
C. Inference manner
Based on the different training manners, the CNN-based
crowd counting approaches can be classified as patch-based
inference and the whole image-based inference.
1) Patch-based methods: This inference manner is re-
quired to train using patches randomly cropped from the
image. In the test phase, using a sliding window spreads over
the whole test image, and getting the estimations of each
window and then assembling them to obtain the final total
count of the image.
• Cross-scene [51] randomly selects overlapping patches from
the training images to serve as training samples, and the
density maps of corresponding image patches are treated as
the ground truth. The total count of the selected training patch
is computed by integrating over the density map. The value
of count is a decimal, rather than an integer.
• CCNN [2] is primarily leaning a regression function to
project the appearance of the image patches onto their cor-
responding object density maps. The model adopts the same
sizes of all patches and the same covariance value of the
Gaussian function in the groundtruth density map generation
process, which limits the accuracy when encounters the large
scale variation scenarios.
7
• DML [114] integrates metric learning into a deep regres-
sion network, which can simultaneously extract density-level
features and learn better distance measurement.
• PaDNet [79] present a novel Density-Aware Network (DAN)
module to discriminate variable density of the crowds, and
Feature Enhancement Layer (FEL) module is to boost the
global and local recognition performance.
• L2SM [98], [115] attempts to address the density pattern
shift issue, which is resulting from nonuniform density be-
tween sparse and dense regions, by providing two modules,
i.e., Scale Sreserving Network (SPN) to obtain patch-level
density maps and a learn to scale module (L2SM) to compute
scale ratios for dense regions.
• GSP [116] devises a global sum pooling operation to replace
global average pooling (GAP) or fully connected layers (FC),
considering the counting task as a simple linear mapping
problem and avoiding patchwise cancellation and overfitting
in the training phase with small datasets of large images.
2) Whole image-based methods: Patch-based methods
always neglect global information and burden much computa-
tion cost due to the sliding window operation. Thus the whole
image-based methods usually take the whole image as input,
and output corresponding density map or a total number of
the crowds, which is more convergence but may lose local
information sometimes.
• JLLG [69] feeds the whole image into a pre-trained CNN
to obtain high-level features, then maps these features to local
counting numbers. It takes advantage of contextual information
both in the global and local count.
• Weighted VLAD [117] integrates semantic information into
learning locality-aware feature (LAF) sets for crowd counting.
First, mapping the original pixel space onto a dense attribute
feature map, then utilizing the LAF to capture more spatial
context and local information.
D. Supervision form
According to whether human-labeled annotations are
used for training, crowd counting methods can be classified
into two categories: fully-supervised methods and un-/self-
/semi-supervised methods.
1) Fully-supervised methods: The vast majority of
CNN-based crowd counting methods rely on large-scale ac-
curately hand-annotated and diversified data. However, the
acquisition of these data is a time-consuming and more
onerous labeling burden than usual. Beyond that, due to the
rarely labeled data, the methods may suffer from the problem
of over-fitting, which leads to a significant degradation in
performance when transferring them in the wild or other
domains. Therefore, training data with less or even without
labeled annotations is a promising research topic in the future.
2) Un/semi/weakly/self-supervised methods: Un/semi-
supervised learning denotes that learning without or with a few
ground-truth labels, while self-supervised learning represents
that adding an auxiliary task which is different from but related
to supervised tasks. Some methods exploit unlabeled data for
training have achieved comparative performance in contrast
with supervised methods.
• GWTA-CCNN [96] presents a stacked convolution autoen-
coder based on Grid Winner-Take-All [118] paradigm for
unsupervised feature learning, of which 99% parameters can
be trained without any labeled data.
• SR-GAN [82] generalizes semi-supervised GANs from
classification problems to regression problems by introducing
a loss function of feature contrasting.
• GAN-MTR [78] applies semi-supervised learning GANs
objectives to multiple object regression problem, which trains
a basic network the same as [51] with the use of unlabeled
data.
• DG-GAN [119] presents a semi-supervised dual-goal GAN
framework to seek both the number of individuals in the crowd
scene and discriminate whether the real or fake images.
• CCLL [120] puts forward a semi-supervised method by
utilizing a sub-modular to choose the most representative
frames from the sequences to circumvent redundancy and
retain densities, graph Laplacian regularization and spatiotem-
poral constraints are also incorporated into the model.
• L2R [77], [91] exploits unlabeled crowd data for pre-training
CNNs in a multi-task framework, which is inspired by self-
supervised learning and based on the observation that the
crowd count number of the patches must be fewer or equal
to the larger patch which contains them. The method is fully
supervised in essence but an additional task of count ranking
in a self-supervised manner.
• HA-CNN [97] offers the first attempt
to fine-turn the
network to new scenes in a weakly supervised manner, by
leveraging the image-level labels of crowd images into density
levels.
• CCWld [84] provides a data collector and labeler for
crowd counting, where the data is from an electronic game.
With the collector and labeler, it can collect and annotate
data automatically, and the first large-scale synthetic crowd
counting dataset is constructed.
• CODA [121] presents a novel scale-aware adversarial den-
sity adaption approach for object counting, which can be
used to generalize the trained model to unseen scenes in an
unsupervised manner.
• OSSS [122] designs a one-shot scene-specific crowd count-
ing model by taking advantage of fine-turning.
8
parameters. It can also be extended to perform a visual domain
classification even in an unseen observed domain.
• SE CycleGAN [84] takes advantage of domain adap-
tation technique,
incorporating Structural Similarity Index
(SSIM) [125] into traditional CycleGAN framework to make
up the domain gap between synthetic data and real-world data.
• MFA+SDA [126] is drawing the idea from SE Cycle GAN,
which is also a GAN-based adaptation model. The authors
propose a Multi-level Feature-aware Adaptation to reduce the
domain gap and present a Structured Density map Alignment
for handling the unseen crowd scenes.
• DACC [127] is composed of two modules: Inter-domain
Features Segregation (IFS) and Gaussian-prior Reconstruction
(GPR). IFS is designed to translate the synthetic data to
realistic images, and GPR is used to generate higher-fidelity
density maps with pseudo labels.
• FSC [128] extracts semantic domain-invariant features via
crowd masks generated by a pre-trained crowd segmentation
model. The error estimations in the background regions are
reduced significantly.
F. Instance-/image-based supervision
The aim of object counting is to estimate the number of
objects. If the ground truth is labeled with point or bounding
box, the method pertains to instance-level supervision. In con-
trast, image-level supervision just needs to count the number
of different object instance instead.
1) Instance-level supervision: Most crowd density es-
timation methods are based on instance-level (point-level or
bounding box) supervision, which needs hand-labeled annota-
tions for each instance location.
2) Image-level supervision:
Image-level supervision-
based methods need to count the number of instances within
or beyond the subitizing range, which do not require location
information. It can be regarded as estimating the count at one
shot or glance [129].
• ILC [101] generates a density map of object categories,
which obtains the total object count estimation and spatial
distribution of object instances simultaneously.
E. Domain adaptation
III. DATASETS
Almost all the existing counting methods are designed
in a specific domain; therefore, designing a counting model
which can count any object domain is a challenging yet
meaningful task. The domain adaptation technique may be a
powerful tool to tackle this problem.
• CAC [123] formulates the counting as a matching problem,
which presents a Generic Matching Network (GMN) in a
class-agnostic manner. GMN can be trained by the amount of
video data labeled for tracking due to counting as a matching
problem. In a few-shot learning way, it can use an adapter
module to apply to different domains.
• PPPD [124] provides a patch-based, multi-domain object
counting network by leveraging a set of domain-specific
scaling and normalization layers which only uses a few of
With the blooming development of crowd counting, nu-
merous datasets have been introduced, which can motivate
many more algorithms to cater to various challenges such as
scale variations, background clutter in the surveillance video
and changeable environment,
illumination variation in the
wild. In this section, we review almost all the crowd counting
datasets from beginning up to now. Table III summarizes some
representing datasets, including crowd counting datasets with
real-world data and one with synthetic data, for the sake of
completeness, we also survey several datasets applied in other
domains, to evaluate the generalization ability of the designed
algorithms. The datasets are sorted by chronology and the
specific statistics of them are listed in Table III. Some samples
from the representing datasets are depicted in Fig. 4.