logo资料库

2019ICCV_EGNet 论文原文 显著性检测SOD.pdf

第1页 / 共10页
第2页 / 共10页
第3页 / 共10页
第4页 / 共10页
第5页 / 共10页
第6页 / 共10页
第7页 / 共10页
第8页 / 共10页
资料共10页,剩余部分请下载后查看
IEEE International Conference on Computer Vision (ICCV) 2019 EGNet: Edge Guidance Network for Salient Object Detection Jia-Xing Zhao, Jiang-Jiang Liu, Deng-Ping Fan, Yang Cao, Ju-Feng Yang, Ming-Ming Cheng* TKLNDST, CS, Nankai University http://mmcheng.net/egnet/ Abstract Fully convolutional neural networks (FCNs) have shown their advantages in the salient object detection task. How- ever, most existing FCNs-based methods still suffer from coarse object boundaries. In this paper, to solve this prob- lem, we focus on the complementarity between salient edge information and salient object information. Accordingly, we present an edge guidance network (EGNet) for salient object detection with three steps to simultaneously model these two kinds of complementary information in a single network. In the first step, we extract the salient object features by a progressive fusion way. In the second step, we integrate the local edge information and global loca- tion information to obtain the salient edge features. Fi- nally, to sufficiently leverage these complementary features, we couple the same salient edge features with salient ob- ject features at various resolutions. Benefiting from the rich edge information and location information in salient edge features, the fused features can help locate salient objects, especially their boundaries more accurately. Ex- perimental results demonstrate that the proposed method performs favorably against the state-of-the-art methods on six widely used datasets without any pre-processing and post-processing. The source code is available at http: //mmcheng.net/egnet/. 1. Introduction The goal of salient object detection (SOD) is to find the most visually distinctive objects in an image. It has received widespread attention recently and been widely used in many vision and image processing related areas, such as content- aware image editing [6], object recognition [42], photo- synth [4], non-photo-realist rendering [41], weakly super- vised semantic segmentation [19] and image retrieval [15]. Besides, there are many works focusing on video salient object detection [12, 54] and RGB-D salient object detec- tion [11, 66]. Inspired by cognitive studies of visual attention [7, 21, *M.M. Cheng (cmm@nankai.edu.cn) is the corresponding author. Source Baseline Ours GT Figure 1. Visual examples of our method. After we model and fuse the salient edge information, the salient object boundaries become clearer. 39], early works are mainly based on the fact that contrast plays the most important role in saliency detection. These methods benefit mostly from either global or local contrast cues and their learned fusion weights. Unfortunately, these hand-crafted features though can locate the most salient objects sometimes, the produced saliency maps are with irregular shapes because of the undesirable segmentation methods and unreliable when the contrast between the fore- ground and the background is inadequate. Recently, convolutional neural networks (CNNs) [25] have successfully broken the limits of traditional hand- crafted features, especially after the emerging of Fully Con- volutional Neural Networks (FCNs) [34]. These CNN- based methods have greatly refreshed the leaderboards on almost all the widely used benchmarks and are gradually replacing conventional salient object detection methods be- cause of the efficiency as well as high performance. In SOD approaches based on CNNs architecture, the majority of them which regard the image patches [64, 65] as input use the multi-scale or multi-context information to obtain the final saliency map. Since the fully convolutional network is proposed for pixel labeling problems, several end-to-end deep architectures [17, 18, 23, 28, 31, 50, 60, 67] for salient object detection appear. The basic unit of output saliency map becomes per pixel from the image region. On the one hand, the result highlights the details because each pixel has its saliency value. However, on the other hand, it ignores the structure information which is important for SOD.
With the increase of the network receptive field, the po- sitioning of salient objects becomes more and more accu- rate. However, at the same time, spatial coherence is also ignored. Recently, to obtain the fine edge details, some SOD U-Net [40] based works [32, 33, 59, 61] used a bi- directional or recursive way to refine the high-level features with the local information. However, the boundaries of salient objects are still not explicitly modeled. The comple- mentarity between the salient edge information and salient object information has not been noticed. Besides, there are some methods using pre-processing (Superpixel) [20] or post-processing (CRF) [17, 28, 33] to preserve the object boundaries. The main inconvenience with these approaches is their low inference speed. In this paper, we focus on the complementarity between salient edge information and salient object information. We aim to leverage the salient edge features to help the salient object features locate objects, especially their boundaries more accurately. In summary, this paper makes three major contributions: • We propose an EGNet to explicitly model complemen- tary salient object information and salient edge infor- mation within the network to preserve the salient ob- ject boundaries. At the same time, the salient edge fea- tures are also helpful for localization. • Our model jointly optimizes these two complementary tasks by allowing them to mutually help each other, which significantly improves the predicted saliency maps. • We compare the proposed methods with 15 state-of- the-art approaches on six widely used datasets. With- out bells and whistles, our method achieves the best performance under three evaluation metrics. 2. Related Works Over the past years, some methods were proposed to de- tect the salient objects in an image. Early methods pre- dicted the saliency map using a bottom-up pattern by the hand-craft feature, such as contrast [5], boundary back- ground [57, 68], center prior [24, 44] and so on [22, 44, 51]. More details are introduced in [1, 2, 9]. Recently, Convolutional neural networks (CNNs) per- form their advantages and refresh the state-of-the-art records in many fields of computer vision. Li et al. [27] resized the image regions to three differ- ent scales to extract the multi-scale features and then aggre- gated these multiple saliency maps to obtain the final pre- diction map. Wang et al. [45] designed a neural network to extract the local estimation for the input patches and inte- grated these features with the global contrast and geometric information to describe the image patches. However, the result is limited by the performance of image patches in these methods. In [34], long et al. firstly proposed a net- work (FCN) to predict the semantic label for each pixel. Inspired by FCN, more and more pixel-wise saliency detec- tion methods were proposed. Wang et al. [47] proposed a recurrent FCN architecture for salient object detection. Hou et al. proposed a short connection [17, 18] based on HED [55] to integrate the low-level features and high-level features to solve the scale-space problem. In [62], Zhang et al. introduced a reformulated dropout and an effective hy- brid upsampling to learn deep uncertain convolutional fea- tures to encourage robustness and accuracy. In [61], Zhang et al. explicitly aggregated the multi-level features into mul- tiple resolutions and then combined these feature maps by a bidirectional aggregation method. Zhang et al. [59] pro- posed a bi-directional message-passing model to integrate multi-level features for salient object detection. Wang et al. [53] leveraged the fixation maps to help the model to locate the salient object more accurately. In [35], Luo et al. proposed a U-Net based architecture which contains an IOU edge loss to leverage the edge cues to detect the salient objects. In other saliency-related tasks, some methods of using edge cues have appeared. In [26], li et al. generated the contour of the object to obtain the salient instance seg- mentation results. In [29], li et al. leveraged the well-trained contour detection models to generate the saliency masks to overcome the limitation caused by manual annotations. Compared with most of the SOD U-Net based methods [32,33,59,61], we explicitly model edge information within the network to leverage the edge cues. Compared with the methods which use the edge cues [14,58,69], the major dif- ferences are that we use a single base network and jointly optimize the salient edge detection and the salient object de- tection, allowing them to help each other mutually. which results in better performance. Compared with NLDF [35], they implemented a loss function inspired by the Mumford- Shah function [38] to penalize errors on the edges. Since the salient edges are derived from salient objects through a fixed sober operator, this penalty essentially only affects the gradient in the neighborhood of salient edges on feature maps. In this way, the edge details are optimized to some extent, but the complementarity between salient edge detec- tion and salient object detection is not sufficiently utilized. In our method, we design two modules to extract these two kinds of features independently. Then we fuse these com- plementary features by a one-to-one guidance module. In this way, the salient edge information can not only improve the quality of edges but also make the localization more ac- curate. The experimental part verifies our statement. 3. Salient Edge Guidance Network The overall architecture is shown in Fig. 2. In this sec- tion, we begin by describing the motivations in Sec. 3.1, then introduce the adopted salient object feature extraction
Figure 2. The pipeline of the proposed approach. We use brown thick lines to represent information flows between the scales. PSFEM: progressive salient object features extraction module. NLSEM: non-local salient edge features extraction module. O2OGM : one-to-one guidance module. FF: feature fusion. Spv.: supervision. module and the proposed non-local salient edge features ex- traction module in Sec. 3.2, and finally introduce the pro- posed one-to-one guidance module in Sec. 3.3. 3.1. Motivation The pixel-wise salient object detection methods have shown their advantages compared with region-based meth- ods. However, they ignored the spatial coherence in the images, resulting in the unsatisfied salient object bound- aries. Most methods [17,18,31,33,59,61] hope to solve this problem by fusing multi-scale information. Some meth- ods [17, 28, 33] used the post-processing such as CRF to refine the salient object boundaries. In NLDF [35], they proposed an IOU loss to affect the gradient of the location around the edge. None of them pay attention to the com- plementarity between salient edge detection and salient ob- ject detection. A good salient edge detection result can help salient object detection task in both segmentation and local- ization, and vice versa. Based on this idea, we proposed an EGNet to model and fuse the complementary salient edge information and salient object information within a single network in an end-to-end manner. 3.2. Complementary information modeling Our proposed network is independent of the backbone network. Here we use the VGG network suggested by other deep learning based methods [17, 35] to describe the pro- posed method. First, we truncate the last three fully con- nected layers. Following DSS [17, 18], we connect another side path to the last pooling layer in VGG. Thus from the backbone network, we obtain six side features Conv1-2, Conv2-2, Conv3-3, Conv4-3, Conv5-3, Conv6-3. Because the Conv1-2 is too close to the input and the receptive field is too small, we throw away this side path S(1). There are five side paths S(2), S(3), S(4), S(5), S(6) remaining in our method. For simplicity, these five features could be denoted by a backbone features set C: C = {C (2), C (3), C (4), C (5), C (6)}, (1) where C (2) denotes the Conv2-2 features and so on. Conv2- 2 preserves better edge information [61]. Thus we leverage the S(2) to extract the edge features and other side paths to extract the salient object features. 3.2.1 Progressive salient object features extraction As shown in PSFEM of Fig. 2, to obtain richer con- text features, we leverage the widely used architecture U- Net [40] to generate the multi-resolution features. Differ- ent from the original U-Net, in order to obtain more robust salient object features, We add three convolutional layers (Conv in Fig. 2) on each side path, and after each convo- lutional layer, a ReLU layer is added to ensure the nonlin- earity. To illustrate simply, we use the T (Tab. 1) to denote these convolutional layers and ReLU layers. Besides, deep supervision is used on each side path. We adopt a convolu- tional layer to convert the feature maps to the single-channel prediction mask and use D (Tab. 1) to denote it. The details of the convolutional layers could be found in Tab. 1. 3.2.2 Non-local salient edge features extraction In this module, we aim to model the salient edge in- formation and extract the salient edge features. As men- tioned above, the Conv2-2 preserves better edge informa- tion. Hence we extract local edge information from Conv2- 2. However, in order to get salient edge features, only local information is not enough. High-level semantic information or location information is also needed. When information U+Conv2-2Conv3-3Conv4-3Conv5-3Conv6-3ConvConvPSFEMExplicit edge modellingFFFFFFFF: Conv layerU: Upsample+: Pixel-wise addConv1-2Top-down location propagationConvConvConvNLSEMO2OGM: Saliency Spv.: Edge Spv.ConvConvConvConvConvFF
S 2 3 4 5 6 T1 1 1 2 2 3 128 256 512 512 512 3 3 5 5 7 T2 1 1 2 2 3 128 256 512 512 512 3 3 5 5 7 T3 1 1 2 2 3 128 256 512 512 512 D 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 3 3 5 5 7 Table 1. Details of each side output. T denotes the feature enhance module (Conv shown in Fig. 2). Each T contains three convolu- tional layers: T1, T2, T3 and three followed ReLu layers. We show the kernel size, padding and channel number of each convolutional layer. For example, 3, 1, 128 denote a convolutional layer whose kernel size is 3, padding is 1, channel number is 128. D denotes the transition layer which converts the multi-channel feature map to one-channel activation map. S denotes the side path. is progressively returned from the top level to the low level like the U-Net architecture, the high-level location informa- tion is gradually diluted. Besides, the receptive field of the top-level is the largest, and the location is the most accurate. Thus we design a top-down location propagation to propa- gate the top-level location information to the side path S(2) to restrain the non-salient edge. The fused features ¯C (2) could be denoted as: ¯C (2) = C (2) + Up(φ(Trans( ˆF (6); θ)); C (2)), (2) where Trans(∗; θ) is a convolutional layer with param- eter θ, which aims to change the number of chan- nels of the feature, and φ() denotes a ReLU activa- tion function. U p(∗; C (2)) is bilinear interpolation op- eration which aims to up-sample * to the same size as C (2). On the right of the equation, the second term denotes the features from the higher side path. To il- lustrate clearly, we use UpT ( ˆF (i); θ, C (j )) to represent Up(φ(Trans( ˆF (i); θ)); C (j)). ˆF (6) denotes the enhanced features in side path S(6). The enhanced features ˆF (6) could be represented as f (C (6); W (6) T ), and the enhanced features in S(3), S(4), S(5) could be computed as: T ), ˆF (i) = f (C (i) + UpT ( ˆF (i+1); θ, C (i)); W (i) (3) T denotes the parameters in T (i) and f (∗; W (i) where W (i) T ) denotes a series of convolutional and non-linear operations with parameters W (i) T . After obtaining the guided features ¯C (2), similar with other side paths, we add a series convolutional layers to en- hance the guided feature, then the final salient edge features FE in S(2) could be computed as f ( ¯C (2); W (2) T ). The con- figuration details could be found in Tab. 1. To model the salient edge feature explicitly, we add an extra salient edge supervision to supervise the salient edge features. We use the cross-entropy loss which could be defined as: L(2)(FE; W (2) log P r(yj = 1|FE; W (2) D ) D ) = − − j∈Z+ j∈Z− log P r(yj = 0|FE; W (2) D ), (4) where Z+ and Z− denote the salient edge pixels set and background pixels set, respectively. WD denotes the pa- rameters of the transition layer as shown in Tab. 1. P r(yj = 1|FE; W (2) D ) is the prediction map in which each value de- notes the salient edge confidence for the pixel. In addition, the supervision added on the salient object detection side path can be represented as: D ) = − j∈Y+ L(i)( ˆF (i); W (i) − j∈Y− log P r(yj = 1| ˆF (i); W (i) D ) log P r(yj = 0| ˆF (i); W (i) D ), i ∈ [3, 6], (5) 6 where Y+ and Y− denote the salient region pixels set and non-salient pixels set, respectively. Thus the total loss L in the complementary information modeling could be denoted as: L = L(2)(FE; W (2) D ) + L(i)( ˆF (i); W (i) D ). (6) i=3 3.3. One-to-one guidance module After obtaining the complementary salient edge features and salient object features, we aim to leverage the salient edge features to guide the salient object features to perform better on both segmentation and localization. The simple way is to fuse the FE and the ˆF (3). It will be better to suffi- ciently leverage the multi-resolution salient object features. However, the disadvantage of fusing the salient edge fea- tures and multi-resolution salient object features progres- sively from down to top is that salient edge features are di- luted when salient object features are fused. Besides, the goal is to fuse salient object features and salient edge fea- tures to utilize complementary information to obtain better prediction results. Hence, we propose a one-to-one guid- ance module. Moreover, experimental parts validate our view. Specifically, we add sub-side paths for S(3), S(4), S(5), S(6). In each sub-side path, by fusing the salient edge fea- tures into enhanced salient object features, we make the lo- cation of high-level predictions more accurate, and more importantly, the segmentation details become better. The salient edge guidance features (s-features) could be denoted as: G(i) = UpT ( ˆF (i); θ, FE) + FE, i ∈ [3, 6]. (7) Then similar to the PSFEM, we adopt a series of convo- lutional layers T in each sub-side path to further enhance the s-features and a transition layer D to convert the multi- channel feature map to one-channel prediction map. Here in order to illustrate clearly, we denote the T and D as T and D in this module. By Eq. (3), we obtain the enhanced s-features ˆG(i).
Here we also add deep supervision for these enhanced s- features. For each sub-side output prediction map, the loss can be calculated as: L(i) D ) = − log P r(yj = 1| ˆG(i); W (i) D ) ( ˆG(i); W (i) j∈Y+ log P r(yj = 0| ˆG(i); W (i) D ), i ∈ [3, 6]. (8) − j∈Y− Then we fuse the multi-scale refined prediction maps to ob- tain a fused map. The loss function for the fused map can be denoted as: L f ( ˆG; WD) = σ(Y, βif ( ˆG(i); W (i) D )), (9) where the σ(∗,∗) represents the cross-entropy loss between prediction map and saliency ground-truth, which has the same form to Eq. (5). Thus the loss for this part and the total for the proposed network could be expressed as: 6 i=3 i=6 i=3 f ( ˆG; WD) + L = L Lt = L + L. L(i) ( ˆG(i); W (i) D ) (10) 4. Experiments 4.1. Implementation Details We train our model on DUTS [46] dataset followed by [33,49,59,63]. For a fair comparison, we use VGG [43] and ResNet [16] as backbone networks, respectively. Our model is implemented in PyTorch. All the weights of newly added convolution layers are initialized randomly with a truncated normal (σ = 0.01), and the biases are initialized to 0. The hyper-parameters are set as followed: learning rate = 5e- 5, weight decay = 0.0005, momentum = 0.9, loss weight for each side output is equal to 1. A back propagation is processing for each of the ten images. We do not use the validation dataset during training. We train our model 24 epochs and divide the learning rate by 10 after 15 epochs. During inference, we are able to obtain a predicted salient edge map and a set of saliency maps. In our method, we directly use the fused prediction map as the final saliency map. 4.2. Datasets and Evaluation Metric We have evaluated the proposed architecture on six widely used public benchmark datasets: ECSSD [56], PASCAL-S [30], DUT-OMRON [57], SOD [36, 44], HKU- IS [27], DUTS [46]. ECSSD [56] contains 1000 meaningful semantic images with various complex scenes. PASCAL- S [30] contains 850 images which are chosen from the vali- dation set of the PASCAL VOC segmentation dataset [8]. DUT-OMRON [57] contains 5168 high-quality but chal- lenging images. Images in this dataset contain one or more salient objects with a relatively complex background. SOD [36] contains 300 images and is proposed for image seg- mentation. Pixel-wise annotations of salient objects are generated by [44]. It is one of the most challenging datasets currently. HKU-IS [27] contains 4447 images with high- quality annotations, many of which have multiple discon- nected salient objects. This dataset is split into 2500 train- ing images, 500 validation images and 2000 test images. DUTS [46] is the largest salient object detection bench- mark. It contains 10553 images for training and 5019 im- ages for testing. Most images are challenging with var- ious locations and scales. Following most recent works [33, 49, 52], we use the DUTS dataset to train the proposed model. We use three widely used and standard metrics, F- measure, mean absolute error (MAE) [2], and a recently proposed structure-based metric, namely S-measure [10], to evaluate our model and other state-of-the-art models. F- measure is a harmonic mean of average precision and aver- age recall, formulated as: Fβ = (1 + β2)P recision × Recall β2 × P recision + Recall , (11) we set β2 = 0.3 to weigh precision more than recall as suggested in [5]. Precision denotes the ratio of detected salient pixels in the predicted saliency map. Recall denotes the ratio of detected salient pixels in the ground-truth map. Precision and recall are computed on binary images. Thus we should threshold the prediction map to binary map first. There are different precision and recall of different thresh- olds. We could plot the precision-recall curve at different thresholds. Here we use the code provided by [17, 18] for evaluation. Following most salient object detection meth- ods [17,18,32,59], we report the maximum F-measure from all precision-recall pairs. MAE is a metric which evaluates the average difference between prediction map and ground-truth map. Let P and Y denote the saliency map and the ground truth that is nor- malized to [0, 1]. We compute the MAE score by: ε = 1 W × H |P (x, y) − Y (x, y)|, (12) W H x=1 y=1 where W and H are the width and height of images, respec- tively. S-measure focuses on evaluating the structural informa- tion of saliency maps, which is closer to the human visual system than F-measure. Thus we include S-measure for a more comprehensive evaluation. S-measure could be com- puted as: S = γSo + (1 − γ)Sr, (13)
ECSSD [56] PASCAL-S [30] DUT-O [57] HKU-IS [27] SOD [36, 37] DUTS-TE [46] MaxF ↑ MAE ↓ S ↑ MaxF ↑ MAE ↓ S ↑ MaxF ↑ MAE ↓ S ↑ MaxF ↑ MAE ↓ S ↑ MaxF ↑ MAE ↓ S ↑ MaxF ↑ MAE ↓ S ↑ DCL∗ [28] DSS∗ [17, 18] MSR [26] NLDF [35] RAS [3] ELD∗ [13] DHS [32] RFCN∗ [48] UCF [62] Amulet [61] C2S [29] PAGR [63] Ours 0.896 0.906 0.903 0.903 0.915 0.865 0.905 0.898 0.908 0.911 0.909 0.924 0.941 0.080 0.064 0.059 0.065 0.060 0.082 0.062 0.097 0.080 0.062 0.057 0.064 0.044 0.863 0.882 0.875 0.875 0.886 0.839 0.884 0852 0.884 0.894 0.891 0.889 0.913 0.805 0.821 0.839 0.822 0.830 0.772 0.825 0.827 0.820 0.826 0.845 0.847 0.863 0.115 0.101 0.083 0.098 0.102 0.122 0.092 0.118 0.127 0.092 0.081 0.089 0.076 0.791 0.796 0.802 0.803 0.798 0.757 0.807 0.799 0.806 0.820 0.839 0.818 0.848 0.733 0.760 0.790 0.753 0.784 0.738 - 0.747 0.735 0.737 0.759 0.771 0.826 0.893 0.900 0.907 0.902 0.910 0.843 0.892 0.895 0.888 0.889 0.897 0.919 0.929 0.063 0.050 0.043 0.048 0.047 0.072 0.052 0.079 0.073 0.052 0.047 0.047 0.034 0.859 0.878 0.852 0.878 0.884 0.823 0.869 0.860 0.874 0.886 0.886 0.889 0.910 0.831 0.834 0.841 0.837 0.844 0.762 0.823 0.805 0.798 0.799 0.821 0.841 0.869 0.131 0.125 0.111 0.123 0.130 0.154 0.128 0.161 0.164 0.146 0.122 0.146 0.110 0.748 0.744 0.757 0.756 0.760 0.705 0.750 0.730 0.762 0.753 0.763 0.716 0.788 0.786 0.813 0.824 0.816 0.800 0.747 0.815 0.786 0.771 0.773 0.811 0.854 0.880 0.081 0.065 0.062 0.065 0.060 0.092 0.065 0.090 0.116 0.075 0.062 0.055 0.043 0.785 0.812 0.809 0.805 0.827 0.749 0.809 0.784 0.777 0.796 0.822 0.825 0.866 VGG-based 0.094 0.074 0.073 0.079 0.063 0.093 0.743 0.765 0.767 0.750 0.792 0.743 - - 0.752 0.094 0.748 0.131 0.771 0.083 0.783 0.072 0.751 0.071 0.056 0.813 ResNet-based 0.777 0.069 0.062 0.791 0.808 0.064 0.052 0.818 0.769 0.774 0.820 0.842 0.838 0.844 0.864 0.869 0.916 0.921 0.932 0.943 0.056 0.043 0.048 0.041 0.895 0.906 0.914 0.918 SRM∗ [49] 0.824 DGRL [52] 0.836 PiCANet∗ [33] 0.850 Ours 0.875 Table 2. Quantitative comparison including max F-measure, MAE, and S-measure over six widely used datasets. ‘-’ denotes that corre- sponding methods are trained on that dataset. ↑ & ↓ denote larger and smaller is better, respectively. ∗ means methods using pre-processing or post-processing. The best three results are marked in red, blue, and green, respectively. Our method achieves the state-of-the-art on these six widely used datasets under three evaluation metrics. 0.046 0.036 0.044 0.031 0.126 0.103 0.103 0.097 0.058 0.049 0.050 0.039 0.084 0.075 0.077 0.074 0.887 0.896 0.905 0.918 0.742 0.774 0.790 0.807 0.832 0.839 0.850 0.852 0.906 0.910 0.920 0.937 0.840 0.843 0.861 0.890 0.826 0.828 0.863 0.893 SOD DUTS Model MaxF ↑ MAE ↓ S ↑ MaxF ↑ MAE ↓ S ↑ .844 .851 1. B .851 .873 2. B + edge PROG .882 .866 3. B + edge TDLP .860 4. B + edge NLDF .857 .869 5. B + edge TDLP + MRF PROG .882 6. B + edge TDLP + MRF OTO .890 .875 Table 3. Ablation analyses on SOD [36] and DUTS-TE [46]. Here, B denotes the baseline model. edge PROG, edge TDLF, edge NLDF, MRF PROG, MRF OTO are introduced in the Sec. 4.3. .116 .105 .100 .112 .106 .097 .780 .799 .807 .794 .796 .807 .855 .872 .879 .866 .880 .893 .060 .051 .044 .053 .046 .039 where So and Sr denotes the region-aware and object-aware structural similarity and γ is set as 0.5 by default. More details can be found in [10]. 4.3. Ablation Experiments and Analyses In this section, with the DUTS-TR [46] as the training set, we explore the effect of different components in the pro- posed network over the relatively difficult dataset SOD [36] and the recently proposed big dataset DUTS-TE [46]. 4.3.1 The complementary information modeling In this subsection, we explore the role of salient edge in- formation, which is also our basic idea. The baseline is the U-Net architecture which integrates the multi-scale features (From Conv2-2 to Conv6-3) in the way as PSFEM (Fig. 2). We remove the side path S(2) in the baseline and then fuse the final saliency features ˆF (3) (side path from Conv3-3) and the local Conv2-2 features to obtain the salient edge features. Finally, we integrate salient edge features and the salient object features ˆF (3) to get the prediction mask. We denote this strategy of using edges as edge PROG. The re- sult is shown in the second row of Tab. 3. It proves that the salient edge information is very useful for the salient object detection task. 4.3.2 Top-down location propagation In this subsection, we explore the role of top-down loca- tion propagation. Compared with edge PROG mentioned in the previous subsection Sec. 4.3.1, we leverage the top- down location propagation to extract more accurate location information from top-level instead of side path S(3). We call this strategy of using edges as edge TDLP. By com- paring the second and third rows of Tab. 3, the effect of top-down location propagation could be proved. Besides, comparing the first row and the third row of Tab. 3, we can find that through our explicit modeling of these two kinds of complementary information within the network, the per- formance is greatly improved on the datasets (3.1%, 2.4% under F-measure) without additional time and space con- sumption. 4.3.3 Mechanism of using edge cues To demonstrate the advantages over NLDF [35], in which an IOU loss is added to the end of the network to punish the errors of edges. We add the same IOU loss to the base- line. This strategy is called edge NLDF. The performance is shown in the 4th row of Tab. 3. Compared with the baseline model, the improvement is limited. This also demonstrates that the proposed method of using edge information is more
(a) DUTS-TE [46] (b) DUT-OMRON [57] (c) HKU-IS [27] Figure 3. Precision (vertical axis) recall (horizontal axis) curves on three popular salient object datasets. It can be seen that the proposed method performs favorably against state-of-the-arts. Model NLDF Ours Recall ↑ 0.513 0.637 SOD Precision ↑ MaxF ↑ Recall ↑ 0.318 0.446 0.541 0.534 0.527 0.581 DUTS Precision ↑ MaxF ↑ 0.429 0.539 0.659 0.680 Source B B+edge NLDF B+edge TDLP GT Figure 4. Visual examples before and after adding edge cues. B denotes the baseline model. edge NLDF and edge TDLP repre- sent the edges penalty used in NLDF [35] and the edge modeling method proposed in this paper. The details are introduced in the Sec. 4.3. effective. The visualization results are shown in Fig. 4. Compared with the baseline model without edge constraint, after we add the edge penalty used in NLDF [35], edge in- formation can only help refine the boundaries. In particular, this penalty can not help to remove the redundant parts in saliency prediction mask, nor can it make up for the missing parts. In contrast, the proposed complementary information modeling method considers the complementarity between salient edge information and salient object information, and performs better on both segmentation and localization. Besides, in order to further prove that salient edge de- tection and salient object detection are mutually helpful and complementary. We compare the salient edges generated by NLDF with the salient edges generated by us. The pre- trained model and code are both provided by the authors. As shown in Tab. 4, it could be found that the salient edge generated by our method is much better, especially under the recall and F-measure metrics. It proves that the edges are more accurate in our methods. Table 4. Comparisons on the salient edge generated by the NLDF and ours. 4.3.4 The complementary features fusion After we obtain the salient edge features and multi- resolution salient object features. We aim to fuse these com- plementary features. Here we compare three fusion meth- ods. The first way is the default way, which integrates the salient edge features (FE) and the salient object features ˆF (3) which is on the top of U-Net architecture. The sec- ond way is to fuse the multi-resolution features ˆF (3), ˆF (4), ˆF (5), ˆF (6) progressively, which is called MRF PROG. The third way is the proposed one-to-one guidance, which is de- noted MRF OTO. Here MRF denotes the multi-resolution fusion. The results are shown in the third, fifth, sixth rows of Tab. 3, respectively. It can be seen that our proposed one-to-one guidance method is most suitable for our whole architecture. 4.4. Comparison with the State-of-the-art In this section, we compare our proposed EGNet with 15 previous state-of-the-art methods, including DCL [28], DSS [17, 18], NLDF [35], MSR [26], ELD [13], DHS [32], RFCN [48], UCF [62], Amulet [61], PAGR [63], Pi- CANet [33], SRM [49], DGRL [52], RAS [3] and C2S [29]. Note that all the saliency maps of the above methods are produced by running source codes or pre-computed by the authors. The evaluation codes are provided in [10, 17, 18]. F-measure, MAE, and S-measure. We evaluate and compare our proposed method with other salient object de- tection methods in term of F-measure, MAE, and S-measure as shown in Tab. 2. We could see that different methods 00.20.40.60.810.60.70.80.91DCLRFCNMSRDSSAmuletSRMPAGRDGRLPiCANetOurs00.20.40.60.810.60.70.80.91DCLRFCNMSRDSSAmuletSRMPAGRDGRLPiCANetOurs00.20.40.60.810.60.70.80.91DCLRFCNMSRDSSAmuletSRMPAGRDGRLPiCANetOurs
Image GT Ours PiCANet [33] PAGR [63] DGRL [52] SRM [49] UCF [62] Amulet [61] DSS [17, 18] DHS [32] Figure 5. Qualitative comparisons with state-of-the-arts. may use different backbone net. Here for a fair compari- son, we train our model on the VGG [43] and ResNet [16], respectively. It can be seen that our model performs favor- ably against the state-of-the-art methods under all evalua- tion metrics on all the compared datasets especially on the relative challenging dataset SOD [36, 44] (2.9% and 1.7% improvements in F-measure and S-measure) and the largest dataset DUTS [46] (3.0% and 2.5%). Specifically, Com- pared with the current best approach [33], the average F- measure improvement on six datasets is 1.9%. Note that this is achieved without any pre-processing and post-processing. Precision-recall curves. Besides the numerical compar- isons shown in Tab. 2, we plot the precision-recall curves of all compared methods over three datasets Fig. 3. As can be seen that the solid red line which denotes the proposed method outperforms all other methods at most thresholds. Due to the help of the complementary salient edge informa- tion, the results yield sharp edge information and accurate localization, which results in a better PR curve. Visual comparison. In Fig. 5, we show some visualiza- tion results. It could be seen that our method performs better on salient object segmentation and localization. It is worth mentioning that thank to the salient edge features, our result could not only highlight the salient region but also produce coherent edges. For instance, for the first sample, due to the influence of the complex scene, other methods are not capa- ble of localizing and segmenting salient objects accurately. However, benefiting from the complementary salient edge features, our method performs better. For the second sam- ple, in which the salient object is relatively small, our result is still very close to the ground-truth. 5. Conclusion In this paper, we aim to preserve salient object bound- aries well. Different from other methods which integrate the multi-scale features or leverage the post-processing, we fo- cus on the complementarity between salient edge informa- tion and salient object information. Based on this idea, we propose the EGNet to model these complementary features within the network. First, we extract the multi-resolution salient object features based on U-Net. Then, we propose a non-local salient edge features extraction module which integrates the local edge information and global location in- formation to get the salient edge features. Finally, we adopt a one-to-one guidance module to fuse these complementary features. The salient object boundaries and localization are improved under the help of salient edge features. Our model performs favorably against the state-of-the-art methods on six widely used datasets without any pre-processing or post- processing. We also provide analyses of the effectiveness of the EGNet. Acknowledgments. This supported by NSFC (61572264), the national youth talent sup- port program, and Tianjin Natural Science Foundation (17JCJQJC43700, 18ZXZNGX00110). research was
分享到:
收藏