Survey on Reliable Deep Learning-Based Person Re-Identification
Models: Are We There Yet?
Bahram Lavi∗1, Ihsan Ullah2, Mehdi Fatan3, and Anderson Rocha1
1Institute of Computing, University of Campinas (UNICAMP), Campinas, S˜ao Paulo, Brazil.
2Data Mining & Machine Learning Group, Discipline of IT, National University of Ireland Galway, Ireland.
3Department of Computer Engineering and Mathematics, University Rovira i Virgili, Tarragona, Spain
0
2
0
2
r
p
A
0
3
]
V
C
.
s
c
[
1
v
5
5
3
0
0
.
5
0
0
2
:
v
i
X
r
a
Abstract
Intelligent video-surveillance (IVS) is currently an active research field in computer vision and
machine learning and provides useful tools for surveillance operators and forensic video investiga-
tors. Person re-identification (PReID) is one of the most critical problems in IVS, and it consists
of recognizing whether or not an individual has already been observed over a camera in a network.
Solutions to PReID have myriad applications including retrieval of video-sequences showing an
individual of interest or even pedestrian tracking over multiple camera views. Different techniques
have been proposed to increase the performance of PReID in the literature, and more recently
researchers utilized deep neural networks (DNNs) given their compelling performance on similar
vision problems and fast execution at test time. Given the importance and wide range of applica-
tions of re-identification solutions, our objective herein is to discuss the work carried out in the area
and come up with a survey of state-of-the-art DNN models being used for this task. We present
descriptions of each model along with their evaluation on a set of benchmark datasets. Finally, we
show a detailed comparison among these models, which are followed by some discussions on their
limitations that can work as guidelines for future research.
1
Introduction
The importance of security and safety of people in society at large is continuously growing. Gov-
ernmental and private organizations are seriously concerned with the security of public areas such as
airports and shopping malls. It requires significant effort and financial expense to provide security to
the public. To optimize such efforts, video surveillance systems are playing a pivotal role. Nowadays,
a panoply of video cameras is growing as a useful tool for addressing various kinds of security issues
such as forensic investigations, crime prevention, and safeguarding restricted areas.
Daily continuous recording of videos from network cameras results in daunting amounts of videos
for analysis in a manual video surveillance system. Surveillance operators need to analyze them at
the same time for specific incidents or anomalies, which is a challenging and tiresome task. Intelligent
video surveillance systems (IVSS) aim to automate the issue of monitoring and analyzing videos from
camera networks to help surveillance operators in handling and understanding the acquired videos.
This makes the IVSS area one of the most active and challenging research areas in computer engineering
and computer science for which computer vision (CV) and machine-learning (ML) techniques plays
a key role. This field of research enables various tools such as online applications for people/object
detection and tracking, recognizing suspicious action/behavior from the camera network; and off-line
applications to support operators and forensic investigators to retrieve images of the individual of
interest from video frames acquired on different camera views.
Person identification is one of the problems of interest in IVSS. It consists of recognizing an individ-
ual over a network of video surveillance cameras with possibly non-overlapping fields of view [1, 2, 3].
In general, the application of PReID is to support surveillance operators and forensic investigators in
∗Corresponding author: bahram.lavi@ic.unicamp.br
1
Figure 1: Standard person re-identification system. Given a probe image and set of template images,
the goal is to generate a robust image signature from each and compute the similarity between them,
and finally presented by a sorted, ranked list.
retrieving videos showing an individual of interest, given an image as a query (a.k.a. probe). Therefore,
video frames or tracks of all the individuals (a.k.a. template gallery) recorded by the camera network
are sorted in descending order of similarity to the probe. It allows the user to find occurrences (if any)
of the individual of interest in the top positions.
Person re-identification is a challenging task due to low-image resolution, unconstrained pose,
illumination changes, and occlusions, which adhere to the use of robust biometric features like face,
among others. Whereas, some cues like gait and anthropometric measures have been used in some
existing PReID systems. Most of the existing techniques rely on defining a specific descriptor of
clothing (typically including color and texture), and a specific similarity measure between a pair of
descriptors (evaluated as a matching score) which can be either manually defined or learned directly
from data [1, 4, 5, 6, 7].
Standard PReID Methodology: For a given image of an individual (a.k.a. probe), a PReID
system aims to seek the corresponding images of that person within the gallery of templates. Take into
consideration that creating the template gallery depends on a re-identification setup, which we can
categorize as: (i) single-shot which has only one template frame per individual, and (ii) multiple-shots
that contains more than one template frame per individual. In this case, a continuous PReID system is
employed in real-time whereby the individual of interest is continuously matched against the template
image with the gallery set, using the currently seen frame as a probe. Figure. 1 demonstrates a basic
PReID framework. After an image description is generated for a probe, and the template images of
the gallery set, matching scores between each of them are computed; and finally, the ranked list is
generated by sorting the matching scores in decreasing order.
The strategy of many existing descriptors use hand-crafted features. Deep-learning (DL) models –
e.g.,convolutional neural networks (CNNs) [8, 9, 10]– have been particularly used to solve the problem
of PReID by learning from data. A CNN-based model generates a set of feature maps, whereby each
pixel of a given image corresponds to specific feature representation. The desired output is expected at
the top of the employed model. There are different approaches to train a deep neural network (DNN)
model. A DNN model can be trained in a Supervised, Semi- and Un-Supervised manner depending on
the problem scenarios and availability of labelled data. In the task of PReID only a small set of training
data is available. Thus, developing a learning model in semi- and un-supervised manner is usually
challenging task and the model might result in failure or poor performance in PReID. Most of the
papers discussed at the end of this paper engage with supervised learning techniques, and only a few
of them considered semi- or un-supervised approach. Further, we consider the models used for PReID
in three categories as {single, pairwise, and triplet} feature-learning strategies. Details are presented
2
and discussed in section 3. This paper presents the state-of-the-art methods of PReID techniques
based on DNNs and provides a significant detailed information about them. The literature review
involves papers which were published between the year 2014 to date. We provide a taxonomy of deep
feature learning methods for PReID including comparisons, limitations, and future research directions
and potential as well as opportunities for research in the horizon. Unlike [11, 12], we provide a
comprehensive and detailed review of the existing techniques, particularly, the more modern ones that
rely upon DNN feature learning strategies. We stress that, in this paper, we only consider recent DNN
techniques which directly involved on the procedure of PReID task. For each technique, we analyze its
experimental result and further make comparisons of the achieved performances considering different
perspectives such as comparing DNNs performances when adopting different strategies to solve a
problem such as the learning strategy (e.g., single, pairwise, and triplet learnings).
The structure of this paper is organized as follows: Section 2 briefly explain the benchmark datasets
employed for PreID. Section 3 describes DNN methods by highlighting the impact of their important
content such as objective function, loss functions, data augmentation, among others. Whereas, section
4 discusses performance measures, results and their comparisons, and their limitations and future
directions. Finally, section 5 concludes the paper and gives final remarks about the PReID and the
paper.
2 Person Re-identification Benchmark Data sets
Data is one of the important factors for current DNN models.
Some factors must be taken into account to reach a reliable recognition rate when evaluating person
re-identification solutions. Each dataset is collected to specially target one or more of these factors.
The factors that create issues for PReID task includes occlusion (apparent in i-LIDS dataset) and
illumination variation (common in most of them). On the other hand, background and foreground
segmentation to distinguish the person’s body is a challenging task. Some of the datasets provide
the segmented region of a person’s body (e.g., on VIPeR, ETHZ, and CAVIAR datasets). While
other datasets have been prepared to evaluate the re-identification task. The most widely datasets
are VIPeR, CUHK01, and CUHK03. VIPeR, CAVIAR, and PRID datasets are used when only two
fixed camera views are given to evaluate the performance of person re-identification methods. Table. 1
gives a summary of each dataset. Below we briefly discuss each of them.
VIPeR [4]: VIPeR is a challenging dataset due to its small number of images for each individual.
It consists of pose and
It is made up of two images of 632 individuals from two camera views.
illumination variations. The images are cropped and scaled to 128× 48 pixels. This is one of the most
widely used datasets for PReID and a good starting point for new researchers in PReID. Enhancing
Rank-1 performance on this dataset is still an open challenge.
i-LIDS [13]:
It contains 476 images of 119 pedestrians taken at an airport hall from non-
overlapping cameras with pose and lighting variations and strong occlusions. A minimum of two
images and an average of four images exist for each pedestrian.
ETHZ [14]: It contains three video sequences of a crowded street from two moving cameras;
images exhibit considerable illumination changes, scale variations, and occlusions. The images are of
different sizes. This dataset provides three sequences of multiple images of an individual from each
sequence. Sequences 1, 2, and 3 have 83, 35, and 28 pedestrians, respectively.
CAVIAR [15]: It contains 72 persons and two views in which 50 persons appear in both views
while 22 persons appear only in one view. Each person has five images per view, with different
appearance variations due to resolution changes, light conditions, occlusions, and different poses.
CUHK: This dataset is divided into three distinct partitions with specific setups. CUHK01 [16]
includes 1, 942 images of 971 pedestrians. It consists of two images captured in two disjoint camera
views, camera (A) with several variations of viewpoints and pose, and camera (B) mainly include im-
ages of the frontal and back view of the camera. CUHK02 [17] contains 1, 816 individuals constructed
by five pairs of camera views (P1-P5 with ten camera views). Each pair includes 971, 306, 107, 193,
and 239 individuals, respectively. Each individual has two images in each camera view. This dataset
is employed to evaluate the performance when the camera views in the test are different from those in
3
training. Finally, CUHK03 [18] includes 13, 164 images of 1, 360 pedestrians. This data set has been
captured with six surveillance cameras. Each identity is observed by two disjoint camera views and
has an average of 4.8 images in each view; all manually cropped pedestrian images exhibit illumination
changes, misalignment, occlusions, and body parts missing.
PRID [19]: This dataset is specially designed for PReID, focusing on a single-shot scenario.
It contains two image sets containing 385 and 749 persons captured by camera A and camera B,
respectively. The two subsets of this dataset share 200 persons in common.
WARD [20]: This dataset has 4,786 images of 70 persons acquired in a real-surveillance scenario
with three non-overlapping cameras having huge illumination variation, resolution, and pose changes.
Re-identification Across indoor-outdoor Dataset (RAiD) [21]: It comprise of 6,920 bound-
ing boxes of 43 identities captured by four cameras. The cameras are categorized into four partitions
in which the first two cameras are indoors while the remaining two are outdoors. Images show con-
siderable illumination variations because of indoor and outdoor changes.
Market-1501 [22]: A total of six cameras are used, including 5 high-resolution cameras, and one
low-resolution camera. Overlap exists among different cameras. Overall, this dataset contains 32,668
annotated bounding boxes of 1,501 identities. Among them, 12,936 images from 751 identities are
used for training, and 19,732 images from 750 identities plus distractors are used for gallery set.
MARS [23]: This dataset comprises 1,261 identities with each identity captured by at least two
cameras. It consists of 20,478 tracklets and 1,191,003 bounding boxes.
DukeMTMC [24]: This dataset contains 36,441 manually-cropped images of 1,812 persons cap-
tured by eight outdoor cameras. The data set gives access to some additional information such as full
frames, frame-level ground-truth, and calibration details.
MSMT [25]: It consists of 126,441 images of 4,101 individuals acquired from 12 indoor and three
outdoor cameras, with different illumination changes, poses, and scale variations.
RPIfield [26]: This dataset is constructed using 12 synchronized cameras provided by 112 explic-
itly time-stamped actor pedestrians through out specific paths among about 4000 distractor pedestri-
ans.
Indoor Train Station Dataset (ITSD) [27]: This dataset has the images of people from a
real-world surveillance camera captured at a railway station. It presents the image size of 64 × 128
pixels and contains 5607 images, 443 identities, with different viewpoints.
Year
Dataset
2007
VIPeR
2007
ETHZ
2011
PRID
2011
CAVIAR
2012
WARD
2012
CUHK01
2013
CUHK02
2014
CUHK03
2014
i-LIDS
2014
RAiD
2015
Market-1501
MARS
2016
DukeMTMC 2017
2018
MSMT
2018
RPIfield
ITSD
2019
Multiple
images
×
×
Multiple
camera
×
Illumination
variations
×
variations
Pose
×
Partial
occlusions
×
×
×
×
variations
Scale
×
×
×
×
×
×
×
×
×
Crop image
size
128 × 48
vary
128 × 64
vary
128 × 48
160 × 60
160 × 60
vary
vary
128 × 64
128 × 64
256 × 128
vary
vary
vary
64 × 128
Table 1: Summary on benchmark PReID datasets.
3 Deep Neural Networks for PReID
Deep learning techniques has been widely applied in several CV problems. This is due to the dis-
criminative and generalization power of these learned models that results in promising performance
and achievements. PReID is one of the challenging tasks in the area of CV for which DL models
4
are one of the current best choice in research community.
In the following section, we provide an
overview of recent DL works for the task of PReID. Several interesting DL models have been proposed
to improve PReID performance. These state-of-the-art DL approaches can be categorized by taking
into account the learning methodology of their models that have been utilized in the PReID systems.
Some works consider the PReID as a standard classification problem. On the other hand, some works
have considered the issue of lack of training data samples in the PReID task and proposed a learning
model to learn more discriminative features in a pair or triplet units. Figure 2 shows the taxonomy
of the types of models being used for PReID that will be discussed in the coming subsections of this
paper.
Figure 2: Taxonomy of deep feature-learning methods for PReID
3.1 Single Feature-Learning Based Methods
A model based on a single feature-learning model or single deep model can be developed similarly
to other multi-class classification problems.
In a PReID system, a classification model is designed
to determine the probability of identity of an individual that it belongs to [28]. Figure 3 shows an
example of a DL based model for a single feature-learning PReID model. This single stream deep
model can be further divided in following categories as being shown in Figure 2.
Figure 3: Single feature-learning model in PReID system: The model takes the raw image of an
individual as input, and computes the probability of the corresponding class of the individual.
Deep model features fusion with hand-crafted features: There are number of papers pub-
lished to boost the performance of PReID by generating deep features. Among them some works
additionally involved the hand-crafted features as the complementary features to be fused alongside
5
DL features. These features are further reduced by using traditional dimensionality reduction tech-
niques – e.g., Principal component analysis(PCA).
Wu et al. [29] proposed a feature fusion DNN to regularize CNN features, with joint of hand-crafted
features. The network takes a single image of size 224 × 224 × 3 pixels as input of the network, and
hand-crafted features extracted using one of the state-of-the-art PReID descriptor (best performance
obtained from ensemble of local features (ELF) descriptor [30]). Then, both extracted features are
followed by a buffer layer and a fully-connected layer, which both layers act as a fusion layer. The
buffer layer is used for the fusion, which is essential since it bridges the gap between two features
with different domains (i.e., hand-crafted features and deep features). A softmax loss layer then takes
the output vector of the fully-connected layer to minimizing the cross-entropy loss, and outputs the
deep feature representation. The whole network is trained by applying mini-batch stochastic gradient
descent algorithm for back propagation. In [31], two low-level descriptors, SIFT and color-histograms,
are extracted from the LAB color space over a set of 14 overlapping patches in size of 32 × 32 pixels
with 16 pixels of stride. Then, a dimensionality reduction method such as PCA, is applied to scale-
invariant feature transform (SIFT) and color-histogram features to reduce the dimensionality of feature
space. Those features are further embedded to produce feature representations using Fisher vector
encoding, which are linear separable. One Fisher vector is computed on the SIFT and another one
on the color histogram features, and finally, two fisher vectors are concatenated as a single feature
vector. A hybrid network builds fully-connected layers on the input of Fisher vectors and employs the
linear discriminative analysis (LDA) as an objective function in order to maximize margin between
two classes.
A structured graph Laplacian algorithm was utilized in a CNN-based model [32]. Different from
traditional contrastive and triplet loss in terms of joint learning, the structured graph Laplacian
algorithm is additionally embedded at the top of the network. They, indeed, formulate the triplet
network into a single feature-learning method, and further, used the generated deep features for joint
learning on the training sample. Softmax function is used to maximize the inter-class variations of
different individual, while the structured graph Laplacian algorithm is employed to minimize the intra-
class variations. As the authors pointed out, the designed network needs no additional network branch,
which makes the training process more efficient. Later on, the same authors proposed a structured
graph Laplacian embedding approach [33]; where joint CNNs are leveraged by reformulating structured
Euclidean distance relationships into the graph Laplacian form. A triplet embedding method was
proposed to generate high-level features by taking into account of inter-personal dispersion and intra-
personal compactness.
Part-based & Body-based features: Some works have attempted to generate more discrim-
inant features for their model by extracting features from specific body parts as well as extracting
features from whole person’s body, that can be used as part of feature vector by fusing it with the deep
learning model resultant features. In [34], a deep-convolutional model was proposed to handle mis-
alignments and pose variations of pedestrian images. The overall multi-class person re-identification
network is composed by two sub-networks: first a convolutional model is adopted to learn global
features from the original images; then a part-based network is used to learn local features from an
image, which includes six different parts of pedestrian bodies. Finally, both sub-networks are com-
bined in a fusion layer as the output of the network, with shared weight parameters during training.
The output of the network is further used as an image signature to evaluate the performance of their
person re-identification approach with Euclidean distance. The proposed deep architecture explicitly
enables to learn effective feature representations on the person’s body part and adaptive similarity
measurements. Li et al. [35] designed a multi-scale context aware network to learn powerful features
throughout the body and different body parts, which can capture knowledge of the local context by
stacking convolutions of multiple scales in each layer. In addition, instead of using predefined rigid
parts, the proposed model learns and locates deformable pedestrian parts through networks of spa-
tial transformers with new spatial restrictions. Because of variations and background clutter that
creates some difficulties in representations based on body-parts, the learning processes of full-body
representation is integrated with body-parts for multi-class identification. Chen et al. [36] proposed a
Deep Pyramidal Feature Learning (DPFL) CNN architecture for explicitly learning multi-scale deep
6
features from a single input image. In addition, a fusion branch over m scales was devised for learning
complementary combination of multi-scale features.
Embedding Learning: Embedding- and attribute-learning approaches have also been considered
as a complementary feature by some researchers, where the authors proposed to design a model
that can jointly learn additional mid-level features obtained by joint learning of high- and low-level
features. In [37], a matching strategy is proposed to compute the similarity between features maps of
an individual and corresponding embedding text. Their method is learned by optimizing the global
and local association between local visual and linguistic features, where it computes attention weights
for each sample. The attention weights are further used by long short-term memory (LSTM) network
to enrich the final prediction.
It shows that learning based on visual information could be more
robust. Similarly, Chi et al. [38] proposed a multi-task learning model that learns from embedded
attributes. The attribute embedding is employed as a low-rank attribute embedding integrated with
low- and mid-level features to describe the person’s appearance. On the other hand, deep features
are obtained by utilizing a DL framework as a high-level feature extractor. All the features are then
learned simultaneously by making use of finding a significant correlation among tasks.
Attribute-based Learner:
A joint DL network is proposed in [39], which consists of two
branches of DL frameworks; in the first branch, the network aims to learn the identity information from
person’s appearance under a triplet Siamese network (see section 3.2.3 for more details), meanwhile,
an attribute-based classification is utilized in the second branch to learn a hierarchical loss-guided
structure to extract meaningful features. The obtained feature vectors of both branches are then
concatenated into a single feature vector. Finally, the person’s images in gallery set are ranked
according to their feature distances to the final representations. A method of attention mask-based
feature learning is proposed in [40]; the authors proposed a CNN-based hybrid architecture that
enables the network to focus on more discriminative parts from person’s image. A multi-tasking based
solution where the model predicts the attention mask from an input image, and further imposes it on
the low-level features in order to re-weighting local features in the feature space.
Semi- and un-supervised learning: There are also few works related to semi- and un-supervised
learning methods that attempted to predict person’s identity (i.e., probability of corresponding class
label for an individual). Li et al. [41] proposed a novel unsupervised learning method attempts to
replace the fact of manually labelling of data. The method jointly optimizes unlabelled person data
within-camera view jointly with cross-camera view under the strategy of end-to-end classification
problem.
It utilizes deep features generated by a CNN model for the input of their unsupervised
learning model. Wang et al. [42] proposed a heterogeneous multi-task model by domain transfer
learning and addressed the scalable unsupervised learning for the PReID problem. Two branches of
CNNs were employed to capture and learn identity and attribute from a person’s image simultaneously.
The output from both branches are fused with another branch which composed by a shallow NN for a
joint learning manner. The information from both branches are inferred into a single attribute space.
It showed promising results when their model was trained on a source data set and tested on an
unlabeled target data set.
The approach in [43] addressed issues such as misalignment and occlusion in PReID. It aims to
extract features from different pre-defined person’s body-parts, and considers them as pose features
and attention aware feature. Yu et al. [44] proposed a novel unsupervised loss function, in which the
model can learn the asymmetric metric and further embeds it into an end-to-end deep feature learning
network. Moreover, Huang et al. [45] addressed the issue of lack of training data by introducing
a multi-pseudo regularized label. The proposed method attempts to generate images based on an
adversarial ML techniques, where corresponding class labels are estimated based on semi-supervised
learning on a small training set. This could be one possible way of creating synthetic data to train
recent deeper NN models.
Data Driven: To address the lack of training data samples, data driven techniques have also
been considered for the task of PReID. Xiao et al. [46] proposed learning deep feature representations
from multiple data sets by using CNNs to discover effective neurons for each training set. They first
produced a strong baseline model that works on multiple data sets simultaneously by combining the
data and labels from several re-id data sets and training the CNN with a softmax loss. Next, for each
7
data set, they performed the forward pass on all its samples and compute for each neuron its average
impact on the objective function. Then, they replaced the standard dropout with the deterministic
’domain guided dropout’ to learn generalization by dropping certain neurons during training, and
continue to train the CNN model for several epochs. Some neurons are effective only for specific
datasets, which might be useless for others due to dataset biases. For instance, the i-LIDS is the
only dataset that contains pedestrians with luggage, thus the neurons that capture luggage features
will be useless to recognize people from another data set. Another technique to overcome the lack of
training data samples, data augmentation techniques have proposed. Those techniques are included
the methods for flipping, rotating, sheering, etc. which can be applied on original image. Despite
those techniques, in [47] a novel data augmentation technique was proposed for PReID, in which a
camera style model was developed to generate training data samples via style transfer-learning.
3.2 Multi-Stream Network Structure: Pairwise and Triplet Feature-Learning Meth-
ods
DL models in the PReID problem are still suffering from the lack of training data samples; this is
because some of the PReID data sets provide only a few images per individual (e.g., VIPeR dataset [4]
which only contains pair of images per person) that makes the model to fail on evaluating the perfor-
mance of model caused by overfitting problem. Therefore, Siamese networks have been developed to
this aim [18].
Siamese network models have been widely employed in PReID due to the lack of training instances
in this research area. Siamese neural network (SNN) is a type of NN architectures that contains two or
more identical sub-networks (i.e., identical refers to sub-networks when they share the same network
architecture, parameters, and weights –a.k.a. shared weight parameters). A Siamese network can be
employed as a pairwise model (when two sub-networks are included e.g.
[48, 3]), or triplet model
(when three sub-networks are present [49, 50]). The output of a Siamese model is a similarity score,
which takes place at the top of the network. For instance, a model based on pairwise feature-learning
takes two images as its input and outputs similarity score between them. Employing such a siamese
model could be an excellent solution to train on existing PReID data set[51], when a few training
samples are available. These models can be divided in the way we discussed single stream models in
Figure 4: Pairwise-loss feature-learning model.
previous section 3.1 and as shown in Figure 2. However, the rest of this section is organized in three
subsections. First we gave a brief explanation of the similarity functions engaged in DL-based PReID
methods. These are essential concepts to compute the distance of similarity between the output of
the multi (two/three) models from the given multi input images during training DL model. In the
second subsection, we described the published DL-based work for the pairwise methods followed by
triplet methods in third subsection. Both of these pairwise and triplet follows the single stream feature
learning approaches.
8