logo资料库

最新《深度学习行人重识别》综述论文.pdf

第1页 / 共24页
第2页 / 共24页
第3页 / 共24页
第4页 / 共24页
第5页 / 共24页
第6页 / 共24页
第7页 / 共24页
第8页 / 共24页
资料共24页,剩余部分请下载后查看
1 Introduction
2 Person Re-identification Benchmark Data sets
3 Deep Neural Networks for PReID
3.1 Single Feature-Learning Based Methods
3.2 Multi-Stream Network Structure: Pairwise and Triplet Feature-Learning Methods
3.2.1 Similarity functions
3.2.2 Pairwise-loss methods
3.2.3 Triplet-loss methods
4 Results and Open Issues
5 Final Remarks
Survey on Reliable Deep Learning-Based Person Re-Identification Models: Are We There Yet? Bahram Lavi∗1, Ihsan Ullah2, Mehdi Fatan3, and Anderson Rocha1 1Institute of Computing, University of Campinas (UNICAMP), Campinas, S˜ao Paulo, Brazil. 2Data Mining & Machine Learning Group, Discipline of IT, National University of Ireland Galway, Ireland. 3Department of Computer Engineering and Mathematics, University Rovira i Virgili, Tarragona, Spain 0 2 0 2 r p A 0 3 ] V C . s c [ 1 v 5 5 3 0 0 . 5 0 0 2 : v i X r a Abstract Intelligent video-surveillance (IVS) is currently an active research field in computer vision and machine learning and provides useful tools for surveillance operators and forensic video investiga- tors. Person re-identification (PReID) is one of the most critical problems in IVS, and it consists of recognizing whether or not an individual has already been observed over a camera in a network. Solutions to PReID have myriad applications including retrieval of video-sequences showing an individual of interest or even pedestrian tracking over multiple camera views. Different techniques have been proposed to increase the performance of PReID in the literature, and more recently researchers utilized deep neural networks (DNNs) given their compelling performance on similar vision problems and fast execution at test time. Given the importance and wide range of applica- tions of re-identification solutions, our objective herein is to discuss the work carried out in the area and come up with a survey of state-of-the-art DNN models being used for this task. We present descriptions of each model along with their evaluation on a set of benchmark datasets. Finally, we show a detailed comparison among these models, which are followed by some discussions on their limitations that can work as guidelines for future research. 1 Introduction The importance of security and safety of people in society at large is continuously growing. Gov- ernmental and private organizations are seriously concerned with the security of public areas such as airports and shopping malls. It requires significant effort and financial expense to provide security to the public. To optimize such efforts, video surveillance systems are playing a pivotal role. Nowadays, a panoply of video cameras is growing as a useful tool for addressing various kinds of security issues such as forensic investigations, crime prevention, and safeguarding restricted areas. Daily continuous recording of videos from network cameras results in daunting amounts of videos for analysis in a manual video surveillance system. Surveillance operators need to analyze them at the same time for specific incidents or anomalies, which is a challenging and tiresome task. Intelligent video surveillance systems (IVSS) aim to automate the issue of monitoring and analyzing videos from camera networks to help surveillance operators in handling and understanding the acquired videos. This makes the IVSS area one of the most active and challenging research areas in computer engineering and computer science for which computer vision (CV) and machine-learning (ML) techniques plays a key role. This field of research enables various tools such as online applications for people/object detection and tracking, recognizing suspicious action/behavior from the camera network; and off-line applications to support operators and forensic investigators to retrieve images of the individual of interest from video frames acquired on different camera views. Person identification is one of the problems of interest in IVSS. It consists of recognizing an individ- ual over a network of video surveillance cameras with possibly non-overlapping fields of view [1, 2, 3]. In general, the application of PReID is to support surveillance operators and forensic investigators in ∗Corresponding author: bahram.lavi@ic.unicamp.br 1
Figure 1: Standard person re-identification system. Given a probe image and set of template images, the goal is to generate a robust image signature from each and compute the similarity between them, and finally presented by a sorted, ranked list. retrieving videos showing an individual of interest, given an image as a query (a.k.a. probe). Therefore, video frames or tracks of all the individuals (a.k.a. template gallery) recorded by the camera network are sorted in descending order of similarity to the probe. It allows the user to find occurrences (if any) of the individual of interest in the top positions. Person re-identification is a challenging task due to low-image resolution, unconstrained pose, illumination changes, and occlusions, which adhere to the use of robust biometric features like face, among others. Whereas, some cues like gait and anthropometric measures have been used in some existing PReID systems. Most of the existing techniques rely on defining a specific descriptor of clothing (typically including color and texture), and a specific similarity measure between a pair of descriptors (evaluated as a matching score) which can be either manually defined or learned directly from data [1, 4, 5, 6, 7]. Standard PReID Methodology: For a given image of an individual (a.k.a. probe), a PReID system aims to seek the corresponding images of that person within the gallery of templates. Take into consideration that creating the template gallery depends on a re-identification setup, which we can categorize as: (i) single-shot which has only one template frame per individual, and (ii) multiple-shots that contains more than one template frame per individual. In this case, a continuous PReID system is employed in real-time whereby the individual of interest is continuously matched against the template image with the gallery set, using the currently seen frame as a probe. Figure. 1 demonstrates a basic PReID framework. After an image description is generated for a probe, and the template images of the gallery set, matching scores between each of them are computed; and finally, the ranked list is generated by sorting the matching scores in decreasing order. The strategy of many existing descriptors use hand-crafted features. Deep-learning (DL) models – e.g.,convolutional neural networks (CNNs) [8, 9, 10]– have been particularly used to solve the problem of PReID by learning from data. A CNN-based model generates a set of feature maps, whereby each pixel of a given image corresponds to specific feature representation. The desired output is expected at the top of the employed model. There are different approaches to train a deep neural network (DNN) model. A DNN model can be trained in a Supervised, Semi- and Un-Supervised manner depending on the problem scenarios and availability of labelled data. In the task of PReID only a small set of training data is available. Thus, developing a learning model in semi- and un-supervised manner is usually challenging task and the model might result in failure or poor performance in PReID. Most of the papers discussed at the end of this paper engage with supervised learning techniques, and only a few of them considered semi- or un-supervised approach. Further, we consider the models used for PReID in three categories as {single, pairwise, and triplet} feature-learning strategies. Details are presented 2
and discussed in section 3. This paper presents the state-of-the-art methods of PReID techniques based on DNNs and provides a significant detailed information about them. The literature review involves papers which were published between the year 2014 to date. We provide a taxonomy of deep feature learning methods for PReID including comparisons, limitations, and future research directions and potential as well as opportunities for research in the horizon. Unlike [11, 12], we provide a comprehensive and detailed review of the existing techniques, particularly, the more modern ones that rely upon DNN feature learning strategies. We stress that, in this paper, we only consider recent DNN techniques which directly involved on the procedure of PReID task. For each technique, we analyze its experimental result and further make comparisons of the achieved performances considering different perspectives such as comparing DNNs performances when adopting different strategies to solve a problem such as the learning strategy (e.g., single, pairwise, and triplet learnings). The structure of this paper is organized as follows: Section 2 briefly explain the benchmark datasets employed for PreID. Section 3 describes DNN methods by highlighting the impact of their important content such as objective function, loss functions, data augmentation, among others. Whereas, section 4 discusses performance measures, results and their comparisons, and their limitations and future directions. Finally, section 5 concludes the paper and gives final remarks about the PReID and the paper. 2 Person Re-identification Benchmark Data sets Data is one of the important factors for current DNN models. Some factors must be taken into account to reach a reliable recognition rate when evaluating person re-identification solutions. Each dataset is collected to specially target one or more of these factors. The factors that create issues for PReID task includes occlusion (apparent in i-LIDS dataset) and illumination variation (common in most of them). On the other hand, background and foreground segmentation to distinguish the person’s body is a challenging task. Some of the datasets provide the segmented region of a person’s body (e.g., on VIPeR, ETHZ, and CAVIAR datasets). While other datasets have been prepared to evaluate the re-identification task. The most widely datasets are VIPeR, CUHK01, and CUHK03. VIPeR, CAVIAR, and PRID datasets are used when only two fixed camera views are given to evaluate the performance of person re-identification methods. Table. 1 gives a summary of each dataset. Below we briefly discuss each of them. VIPeR [4]: VIPeR is a challenging dataset due to its small number of images for each individual. It consists of pose and It is made up of two images of 632 individuals from two camera views. illumination variations. The images are cropped and scaled to 128× 48 pixels. This is one of the most widely used datasets for PReID and a good starting point for new researchers in PReID. Enhancing Rank-1 performance on this dataset is still an open challenge. i-LIDS [13]: It contains 476 images of 119 pedestrians taken at an airport hall from non- overlapping cameras with pose and lighting variations and strong occlusions. A minimum of two images and an average of four images exist for each pedestrian. ETHZ [14]: It contains three video sequences of a crowded street from two moving cameras; images exhibit considerable illumination changes, scale variations, and occlusions. The images are of different sizes. This dataset provides three sequences of multiple images of an individual from each sequence. Sequences 1, 2, and 3 have 83, 35, and 28 pedestrians, respectively. CAVIAR [15]: It contains 72 persons and two views in which 50 persons appear in both views while 22 persons appear only in one view. Each person has five images per view, with different appearance variations due to resolution changes, light conditions, occlusions, and different poses. CUHK: This dataset is divided into three distinct partitions with specific setups. CUHK01 [16] includes 1, 942 images of 971 pedestrians. It consists of two images captured in two disjoint camera views, camera (A) with several variations of viewpoints and pose, and camera (B) mainly include im- ages of the frontal and back view of the camera. CUHK02 [17] contains 1, 816 individuals constructed by five pairs of camera views (P1-P5 with ten camera views). Each pair includes 971, 306, 107, 193, and 239 individuals, respectively. Each individual has two images in each camera view. This dataset is employed to evaluate the performance when the camera views in the test are different from those in 3
training. Finally, CUHK03 [18] includes 13, 164 images of 1, 360 pedestrians. This data set has been captured with six surveillance cameras. Each identity is observed by two disjoint camera views and has an average of 4.8 images in each view; all manually cropped pedestrian images exhibit illumination changes, misalignment, occlusions, and body parts missing. PRID [19]: This dataset is specially designed for PReID, focusing on a single-shot scenario. It contains two image sets containing 385 and 749 persons captured by camera A and camera B, respectively. The two subsets of this dataset share 200 persons in common. WARD [20]: This dataset has 4,786 images of 70 persons acquired in a real-surveillance scenario with three non-overlapping cameras having huge illumination variation, resolution, and pose changes. Re-identification Across indoor-outdoor Dataset (RAiD) [21]: It comprise of 6,920 bound- ing boxes of 43 identities captured by four cameras. The cameras are categorized into four partitions in which the first two cameras are indoors while the remaining two are outdoors. Images show con- siderable illumination variations because of indoor and outdoor changes. Market-1501 [22]: A total of six cameras are used, including 5 high-resolution cameras, and one low-resolution camera. Overlap exists among different cameras. Overall, this dataset contains 32,668 annotated bounding boxes of 1,501 identities. Among them, 12,936 images from 751 identities are used for training, and 19,732 images from 750 identities plus distractors are used for gallery set. MARS [23]: This dataset comprises 1,261 identities with each identity captured by at least two cameras. It consists of 20,478 tracklets and 1,191,003 bounding boxes. DukeMTMC [24]: This dataset contains 36,441 manually-cropped images of 1,812 persons cap- tured by eight outdoor cameras. The data set gives access to some additional information such as full frames, frame-level ground-truth, and calibration details. MSMT [25]: It consists of 126,441 images of 4,101 individuals acquired from 12 indoor and three outdoor cameras, with different illumination changes, poses, and scale variations. RPIfield [26]: This dataset is constructed using 12 synchronized cameras provided by 112 explic- itly time-stamped actor pedestrians through out specific paths among about 4000 distractor pedestri- ans. Indoor Train Station Dataset (ITSD) [27]: This dataset has the images of people from a real-world surveillance camera captured at a railway station. It presents the image size of 64 × 128 pixels and contains 5607 images, 443 identities, with different viewpoints. Year Dataset 2007 VIPeR 2007 ETHZ 2011 PRID 2011 CAVIAR 2012 WARD 2012 CUHK01 2013 CUHK02 2014 CUHK03 2014 i-LIDS 2014 RAiD 2015 Market-1501 MARS 2016 DukeMTMC 2017 2018 MSMT 2018 RPIfield ITSD 2019 Multiple images × × Multiple camera × Illumination variations × variations Pose × Partial occlusions × × × × variations Scale × × × × × × × × × Crop image size 128 × 48 vary 128 × 64 vary 128 × 48 160 × 60 160 × 60 vary vary 128 × 64 128 × 64 256 × 128 vary vary vary 64 × 128 Table 1: Summary on benchmark PReID datasets. 3 Deep Neural Networks for PReID Deep learning techniques has been widely applied in several CV problems. This is due to the dis- criminative and generalization power of these learned models that results in promising performance and achievements. PReID is one of the challenging tasks in the area of CV for which DL models 4
are one of the current best choice in research community. In the following section, we provide an overview of recent DL works for the task of PReID. Several interesting DL models have been proposed to improve PReID performance. These state-of-the-art DL approaches can be categorized by taking into account the learning methodology of their models that have been utilized in the PReID systems. Some works consider the PReID as a standard classification problem. On the other hand, some works have considered the issue of lack of training data samples in the PReID task and proposed a learning model to learn more discriminative features in a pair or triplet units. Figure 2 shows the taxonomy of the types of models being used for PReID that will be discussed in the coming subsections of this paper. Figure 2: Taxonomy of deep feature-learning methods for PReID 3.1 Single Feature-Learning Based Methods A model based on a single feature-learning model or single deep model can be developed similarly to other multi-class classification problems. In a PReID system, a classification model is designed to determine the probability of identity of an individual that it belongs to [28]. Figure 3 shows an example of a DL based model for a single feature-learning PReID model. This single stream deep model can be further divided in following categories as being shown in Figure 2. Figure 3: Single feature-learning model in PReID system: The model takes the raw image of an individual as input, and computes the probability of the corresponding class of the individual. Deep model features fusion with hand-crafted features: There are number of papers pub- lished to boost the performance of PReID by generating deep features. Among them some works additionally involved the hand-crafted features as the complementary features to be fused alongside 5
DL features. These features are further reduced by using traditional dimensionality reduction tech- niques – e.g., Principal component analysis(PCA). Wu et al. [29] proposed a feature fusion DNN to regularize CNN features, with joint of hand-crafted features. The network takes a single image of size 224 × 224 × 3 pixels as input of the network, and hand-crafted features extracted using one of the state-of-the-art PReID descriptor (best performance obtained from ensemble of local features (ELF) descriptor [30]). Then, both extracted features are followed by a buffer layer and a fully-connected layer, which both layers act as a fusion layer. The buffer layer is used for the fusion, which is essential since it bridges the gap between two features with different domains (i.e., hand-crafted features and deep features). A softmax loss layer then takes the output vector of the fully-connected layer to minimizing the cross-entropy loss, and outputs the deep feature representation. The whole network is trained by applying mini-batch stochastic gradient descent algorithm for back propagation. In [31], two low-level descriptors, SIFT and color-histograms, are extracted from the LAB color space over a set of 14 overlapping patches in size of 32 × 32 pixels with 16 pixels of stride. Then, a dimensionality reduction method such as PCA, is applied to scale- invariant feature transform (SIFT) and color-histogram features to reduce the dimensionality of feature space. Those features are further embedded to produce feature representations using Fisher vector encoding, which are linear separable. One Fisher vector is computed on the SIFT and another one on the color histogram features, and finally, two fisher vectors are concatenated as a single feature vector. A hybrid network builds fully-connected layers on the input of Fisher vectors and employs the linear discriminative analysis (LDA) as an objective function in order to maximize margin between two classes. A structured graph Laplacian algorithm was utilized in a CNN-based model [32]. Different from traditional contrastive and triplet loss in terms of joint learning, the structured graph Laplacian algorithm is additionally embedded at the top of the network. They, indeed, formulate the triplet network into a single feature-learning method, and further, used the generated deep features for joint learning on the training sample. Softmax function is used to maximize the inter-class variations of different individual, while the structured graph Laplacian algorithm is employed to minimize the intra- class variations. As the authors pointed out, the designed network needs no additional network branch, which makes the training process more efficient. Later on, the same authors proposed a structured graph Laplacian embedding approach [33]; where joint CNNs are leveraged by reformulating structured Euclidean distance relationships into the graph Laplacian form. A triplet embedding method was proposed to generate high-level features by taking into account of inter-personal dispersion and intra- personal compactness. Part-based & Body-based features: Some works have attempted to generate more discrim- inant features for their model by extracting features from specific body parts as well as extracting features from whole person’s body, that can be used as part of feature vector by fusing it with the deep learning model resultant features. In [34], a deep-convolutional model was proposed to handle mis- alignments and pose variations of pedestrian images. The overall multi-class person re-identification network is composed by two sub-networks: first a convolutional model is adopted to learn global features from the original images; then a part-based network is used to learn local features from an image, which includes six different parts of pedestrian bodies. Finally, both sub-networks are com- bined in a fusion layer as the output of the network, with shared weight parameters during training. The output of the network is further used as an image signature to evaluate the performance of their person re-identification approach with Euclidean distance. The proposed deep architecture explicitly enables to learn effective feature representations on the person’s body part and adaptive similarity measurements. Li et al. [35] designed a multi-scale context aware network to learn powerful features throughout the body and different body parts, which can capture knowledge of the local context by stacking convolutions of multiple scales in each layer. In addition, instead of using predefined rigid parts, the proposed model learns and locates deformable pedestrian parts through networks of spa- tial transformers with new spatial restrictions. Because of variations and background clutter that creates some difficulties in representations based on body-parts, the learning processes of full-body representation is integrated with body-parts for multi-class identification. Chen et al. [36] proposed a Deep Pyramidal Feature Learning (DPFL) CNN architecture for explicitly learning multi-scale deep 6
features from a single input image. In addition, a fusion branch over m scales was devised for learning complementary combination of multi-scale features. Embedding Learning: Embedding- and attribute-learning approaches have also been considered as a complementary feature by some researchers, where the authors proposed to design a model that can jointly learn additional mid-level features obtained by joint learning of high- and low-level features. In [37], a matching strategy is proposed to compute the similarity between features maps of an individual and corresponding embedding text. Their method is learned by optimizing the global and local association between local visual and linguistic features, where it computes attention weights for each sample. The attention weights are further used by long short-term memory (LSTM) network to enrich the final prediction. It shows that learning based on visual information could be more robust. Similarly, Chi et al. [38] proposed a multi-task learning model that learns from embedded attributes. The attribute embedding is employed as a low-rank attribute embedding integrated with low- and mid-level features to describe the person’s appearance. On the other hand, deep features are obtained by utilizing a DL framework as a high-level feature extractor. All the features are then learned simultaneously by making use of finding a significant correlation among tasks. Attribute-based Learner: A joint DL network is proposed in [39], which consists of two branches of DL frameworks; in the first branch, the network aims to learn the identity information from person’s appearance under a triplet Siamese network (see section 3.2.3 for more details), meanwhile, an attribute-based classification is utilized in the second branch to learn a hierarchical loss-guided structure to extract meaningful features. The obtained feature vectors of both branches are then concatenated into a single feature vector. Finally, the person’s images in gallery set are ranked according to their feature distances to the final representations. A method of attention mask-based feature learning is proposed in [40]; the authors proposed a CNN-based hybrid architecture that enables the network to focus on more discriminative parts from person’s image. A multi-tasking based solution where the model predicts the attention mask from an input image, and further imposes it on the low-level features in order to re-weighting local features in the feature space. Semi- and un-supervised learning: There are also few works related to semi- and un-supervised learning methods that attempted to predict person’s identity (i.e., probability of corresponding class label for an individual). Li et al. [41] proposed a novel unsupervised learning method attempts to replace the fact of manually labelling of data. The method jointly optimizes unlabelled person data within-camera view jointly with cross-camera view under the strategy of end-to-end classification problem. It utilizes deep features generated by a CNN model for the input of their unsupervised learning model. Wang et al. [42] proposed a heterogeneous multi-task model by domain transfer learning and addressed the scalable unsupervised learning for the PReID problem. Two branches of CNNs were employed to capture and learn identity and attribute from a person’s image simultaneously. The output from both branches are fused with another branch which composed by a shallow NN for a joint learning manner. The information from both branches are inferred into a single attribute space. It showed promising results when their model was trained on a source data set and tested on an unlabeled target data set. The approach in [43] addressed issues such as misalignment and occlusion in PReID. It aims to extract features from different pre-defined person’s body-parts, and considers them as pose features and attention aware feature. Yu et al. [44] proposed a novel unsupervised loss function, in which the model can learn the asymmetric metric and further embeds it into an end-to-end deep feature learning network. Moreover, Huang et al. [45] addressed the issue of lack of training data by introducing a multi-pseudo regularized label. The proposed method attempts to generate images based on an adversarial ML techniques, where corresponding class labels are estimated based on semi-supervised learning on a small training set. This could be one possible way of creating synthetic data to train recent deeper NN models. Data Driven: To address the lack of training data samples, data driven techniques have also been considered for the task of PReID. Xiao et al. [46] proposed learning deep feature representations from multiple data sets by using CNNs to discover effective neurons for each training set. They first produced a strong baseline model that works on multiple data sets simultaneously by combining the data and labels from several re-id data sets and training the CNN with a softmax loss. Next, for each 7
data set, they performed the forward pass on all its samples and compute for each neuron its average impact on the objective function. Then, they replaced the standard dropout with the deterministic ’domain guided dropout’ to learn generalization by dropping certain neurons during training, and continue to train the CNN model for several epochs. Some neurons are effective only for specific datasets, which might be useless for others due to dataset biases. For instance, the i-LIDS is the only dataset that contains pedestrians with luggage, thus the neurons that capture luggage features will be useless to recognize people from another data set. Another technique to overcome the lack of training data samples, data augmentation techniques have proposed. Those techniques are included the methods for flipping, rotating, sheering, etc. which can be applied on original image. Despite those techniques, in [47] a novel data augmentation technique was proposed for PReID, in which a camera style model was developed to generate training data samples via style transfer-learning. 3.2 Multi-Stream Network Structure: Pairwise and Triplet Feature-Learning Meth- ods DL models in the PReID problem are still suffering from the lack of training data samples; this is because some of the PReID data sets provide only a few images per individual (e.g., VIPeR dataset [4] which only contains pair of images per person) that makes the model to fail on evaluating the perfor- mance of model caused by overfitting problem. Therefore, Siamese networks have been developed to this aim [18]. Siamese network models have been widely employed in PReID due to the lack of training instances in this research area. Siamese neural network (SNN) is a type of NN architectures that contains two or more identical sub-networks (i.e., identical refers to sub-networks when they share the same network architecture, parameters, and weights –a.k.a. shared weight parameters). A Siamese network can be employed as a pairwise model (when two sub-networks are included e.g. [48, 3]), or triplet model (when three sub-networks are present [49, 50]). The output of a Siamese model is a similarity score, which takes place at the top of the network. For instance, a model based on pairwise feature-learning takes two images as its input and outputs similarity score between them. Employing such a siamese model could be an excellent solution to train on existing PReID data set[51], when a few training samples are available. These models can be divided in the way we discussed single stream models in Figure 4: Pairwise-loss feature-learning model. previous section 3.1 and as shown in Figure 2. However, the rest of this section is organized in three subsections. First we gave a brief explanation of the similarity functions engaged in DL-based PReID methods. These are essential concepts to compute the distance of similarity between the output of the multi (two/three) models from the given multi input images during training DL model. In the second subsection, we described the published DL-based work for the pairwise methods followed by triplet methods in third subsection. Both of these pairwise and triplet follows the single stream feature learning approaches. 8
分享到:
收藏