logo资料库

Feature Denoising for Improving Adversarial Robustness.pdf

第1页 / 共9页
第2页 / 共9页
第3页 / 共9页
第4页 / 共9页
第5页 / 共9页
第6页 / 共9页
第7页 / 共9页
第8页 / 共9页
资料共9页,剩余部分请下载后查看
Feature Denoising for Improving Adversarial Robustness Cihang Xie1,2∗ Yuxin Wu2 1Johns Hopkins University Laurens van der Maaten2 Alan Yuille1 2Facebook AI Research Kaiming He2 9 1 0 2 r a M 5 2 ] V C . s c [ 2 v 1 1 4 3 0 . 2 1 8 1 : v i X r a Abstract Adversarial attacks to image classification systems present challenges to convolutional networks and oppor- tunities for understanding them. This study suggests that adversarial perturbations on images lead to noise in the features constructed by these networks. Motivated by this observation, we develop new network architectures that increase adversarial robustness by performing feature denoising. Specifically, our networks contain blocks that denoise the features using non-local means or other filters; the entire networks are trained end-to-end. When combined with adversarial training, our feature denoising networks substantially improve the state-of-the-art in adversarial robustness in both white-box and black-box attack set- tings. On ImageNet, under 10-iteration PGD white-box attacks where prior art has 27.9% accuracy, our method achieves 55.7%; even under extreme 2000-iteration PGD white-box attacks, our method secures 42.6% accuracy. Our method was ranked first in Competition on Adversarial Attacks and Defenses (CAAD) 2018 — it achieved 50.6% classification accuracy on a secret, ImageNet-like test dataset against 48 unknown attackers, surpassing the runner-up approach by ∼10%. Code is available at https://github.com/facebookresearch/ ImageNet-Adversarial-Training. 1. Introduction Adversarial attacks to image classification systems [20] add small perturbations to images that lead these systems into making incorrect predictions. While the perturba- tions are often imperceptible or perceived as small “noise” in the image, these attacks are highly effective against even the most successful convolutional network based sys- tems [11, 14]. The success of adversarial attacks leads to security threats in real-world applications of convolutional networks, but equally importantly, it demonstrates that these networks perform computations that are dramatically differ- ent from those in human brains. Figure 1 shows a randomly selected feature map of a ResNet [9] applied on a clean image (top) and on its ad- ∗Work done during an internship at Facebook AI Research. Figure 1. Feature map in the res3 block of an ImageNet-trained ResNet-50 [9] applied on a clean image (top) and on its adversar- ially perturbed counterpart (bottom). The adversarial perturbation was produced using PGD [16] with maximum perturbation  =16 (out of 256). In this example, the adversarial image is incorrectly recognized as “space heater”; the true label is “digital clock”. versarially perturbed counterpart (bottom). The figure sug- gests that adversarial perturbations, while small in the pixel space, lead to very substantial “noise” in the feature maps of the network. Whereas the features for the clean image appear to focus primarily on semantically informative con- tent in the image, the feature maps for the adversarial image are activated across semantically irrelevant regions as well. Figure 2 displays more examples with the same pattern. Motivated by this observation, we explore feature denoising approaches to improve the robustness of convolu- tional networks against adversarial attacks. We develop new convolutional network architectures equipped with building blocks designed to denoise feature maps. Our networks are trained end-to-end on adversarially generated samples, al- lowing them to learn to reduce feature-map perturbations. Empirically, we find that the best performance is achieved by networks using non-local means [2] for fea- ture denoising, leading to models that are related to self- attention [23] and non-local networks [24]. Our ablation studies show that using mean filters, median filters, and bilateral filters [21] for feature denoising also improves adversarial robustness, suggesting that feature denoising is a good design principle. 1 01234cleanadversarial01234
Figure 2. More examples similar to Figure 1. We show feature maps corresponding to clean images (top) and to their adversarial perturbed versions (bottom). The feature maps for each pair of examples are from the same channel of a res3 block in the same ResNet-50 trained on clean images. The attacker has a maximum perturbation  = 16 in the pixel domain. Our models outperform the state-of-the-art in adversarial robustness against highly challenging white-box attacks on ImageNet [18]. Under 10-iteration PGD attacks [16], we report 55.7% classification accuracy on ImageNet, largely surpassing the prior art’s 27.9% [10] with the same at- tack protocol. Even when faced with extremely challeng- ing 2000-iteration PGD attacks that have not been ex- plored in other literature, our model achieves 42.6% accu- racy. Our ablation experiments also demonstrate that fea- ture denoising consistently improves adversarial defense re- sults in white-box settings. Our networks are also highly effective under the black- box attack setting. The network based on our method won the defense track in the recent Competition on Adversarial Attacks and Defenses (CAAD) 2018, achieving 50.6% ac- curacy against 48 unknown attackers, under a strict “all-or- nothing” criterion. This is an 10% absolute (20% relative) accuracy increase compared to the CAAD 2018 runner-up model. We also conduct ablation experiments in which we defend against the five strongest attackers from CAAD 2017 [13], demonstrating the potential of feature denoising. 2. Related Work training networks by Adversarial training [6, 10, 16] defends against perturbations on adversarial images that are generated on-the-fly dur- adversarial ing training. Adversarial training constitutes the current state-of-the-art in adversarial robustness against white-box attacks; we use it to train our networks. Adversarial logit paring (ALP) [10] is a type of adversarial training that encourages the logit predictions of a network for a clean image and its adversarial counterpart to be similar. ALP can be interpreted as “denoising” the logit predictions for the adversarial image, using the logits for the clean image as the “noise-free” reference. Other approaches to increase adversarial robustness in- clude pixel denoising. Liao et al. [15] propose to use high- level features to guide the pixel denoiser; in contrast, our denoising is applied directly on features. Guo et al. [8] transform the images via non-differentiable image prepro- cessing, like image quilting [4], total variance minimiza- tion [17], and quantization. While these defenses may be effective in black-box settings, they can be circumvented in white-box settings because the attacker can approximate the gradients of their non-differentiable computations [1]. In contrast to [8], our feature denoising models are differ- entiable, but are still able to improve adversarial robustness against very strong white-box attacks. 3. Feature Noise Adversarial images are created by adding perturbations to images, constraining the magnitude of perturbations to be small in terms of a certain norm (e.g., L∞ or L2). The perturbations are assumed to be either imperceptible by hu- mans, or perceived as noise that does not impede human recognition of the visual content. Although the perturba- tions are constrained to be small at the pixel level, no such constraints are imposed at the feature level in convolutional networks. Indeed, the perturbation of the features induced by an adversarial image gradually increases as the image is propagated through the network [15, 8], and non-existing activations in the feature maps are hallucinated. In other words, the transformations performed by the layers in the network exacerbate the perturbation, and the hallucinated activations can overwhelm the activations due to the true signal, which leads the network to make wrong predictions. We qualitatively demonstrate these characteristics of adversarial images by visualizing the feature maps they give rise to. Given a clean image and its adversarially perturbed counterpart, we use the same network (here, a ResNet-50 [9]) to compute its activations in the hidden layers. Figure 1 and 2 show typical examples of the same feature map on clean and adversarial images, extracted from the middle of the network (in particular, from a res3 block). These figures 01234012340123012301230123cleanadversarial
Figure 4. A generic denoising block. It wraps the denoising oper- ation (e.g., non-local means, bilateral, mean, median filters) with a 1×1 convolution and an identity skip connection [9]. 4. Denoising Feature Maps Motivated by the empirical observations above, we pro- pose to improve adversarial robustness by adding denoising blocks at intermediate layers of convolutional networks. The denoising blocks are trained jointly with all layers of the network in an end-to-end manner using adversarial training. The end-to-end adversarial training allows the re- sulting networks to (partly) eliminate feature map noise that is data-dependent, i.e., noise that is generated by the at- tacker. It also naturally handles the noise across multiple layers by considering how changes in earlier layers may im- pact the feature/noise distributions of later layers. Empirically, we find that the best-performed denoising blocks are inspired by self-attention transformers [23] that are commonly used in machine translation and by non-local networks [24] that are used for video classification. In this study, we focus on the design of denoising blocks and study their denoising effects. Besides non-local means, we also experiment with simpler denoising operations such as bilat- eral filtering, mean filtering, and median filtering inside our convolutional networks. 4.1. Denoising Block Figure 4 shows the generic form of our denoising block. The input to the block can be any feature layer in the convo- lutional neural network. The denoising block processes the input features by a denoising operation, such as non-local means or other variants. The denoised representation is first processed by a 1×1 convolutional layer, and then added to the block’s input via a residual connection [9].1 The design in Figure 4 is inspired by self-attention [23] and non-local blocks [24]. However, only the non-local means [2] operation in the denoising block is actually doing the denoising; the 1×1 convolutions and the residual con- nection are mainly for feature combination. While various operations can suppress noise, they can also impact signal. The usage of the residual connection can help the network 1In our terminology, a “denoising operation” refers to the computa- tion that only performs denoising (blue box in Figure 4), and a “denoising block” refers to the entire block (all of Figure 4). Figure 3. Adversarial images and their feature maps before (left) and after (right) the denoising operation (blue box in Figure 4). Here each pair of feature maps are from the same channel of a res3 block in the same adversarially trained ResNet-50 equipped with (Gaussian) non-local means denoising blocks. The attacker has a maximum perturbation  = 16 for each pixel. reveal that the feature maps corresponding to adversarial images have activations in regions without relevant visual content that resemble feature noise. Assuming that strong activations indicate the presence of semantic information about the image content (as often hypothesized [27]), the activations that are hallucinated by adversarial images re- veal why the model predictions are altered. In this study, we attempt to address this problem by fea- ture denoising. In Figure 3, we visualize the feature maps of adversarial images, right before and right after a feature denoising operation (see the next section for details). The figure shows that feature denoising operations can success- fully suppress much of the noise in the feature maps, and make the responses focus on visually meaningful content. In the next sections, we present empirical evidence show- ing that models that perform feature denoising operations, indeed, improve adversarial robustness. Before we move on to describing our methods, we note that although the feature noise can be easily observed qualitatively, it is difficult to quantitatively measure this noise. We found it is nontrivial to compare feature noise levels between different models, in particular, when the network architecture and/or training methods (standard or adversarial) change. E.g., adding a denoising block in a network, end-to-end trained, tends to change the magni- tudes/distributions of all features. Nevertheless, we believe the observed noisy appearance of features reflects a real phenomenon associated with adversarial images. 00.20.40.60.800.511.500.61.21.82.41×1 convdenoisingoperation
to retain signals, and the tradeoff between removing noise and retaining signal is adjusted by the 1×1 convolution, which is learned end-to-end with the entire network. We will present ablation studies showing that both the residual connection and the 1×1 convolution contribute to the effec- tiveness of the denoising block. The generic form of the denoising block allows us to explore various denoising op- erations, as introduced next. 4.2. Denoising Operations We experiment with four different instantiations of the denoising operation in our denoising blocks. Non-local means. Non-local means [2] compute a de- noised feature map y of an input feature map x by taking a weighted mean of features in all spatial locations L: yi = 1 C(x) f (xi, xj) · xj, (1) ∀j∈L where f (xi, xj) is a feature-dependent weighting function and C(x) is a normalization function. We note that the weighted average in Eqn. (1) is over xj, rather than another embedding of xj, unlike [23, 24] — denoising is directly on the input feature x, and the correspondence between the feature channels in y and x is kept. Following [24], we con- sider two forms: • Gaussian (softmax) sets f (xi, xj) = e 1√ d ber of channels, and C =∀j∈L f (xi, xj). By notic- θ(xi)Tφ(xj ), where θ(x) and φ(x) are two embedded versions of x (obtained by two 1×1 convolutions), d is the num- ing that f /C is the softmax function, this version is shown in [24] to be equivalent to the softmax-based, self-attention computation of [23]. • Dot product sets f (xi, xj) = xT i xj and C(x) = N, where N is the number of pixels in x. Unlike the Gaussian non-local means, the weights of the weighted mean do not sum up to 1 in dot-product non-local means. However, qualitative evaluations suggest it does suppress noise in the features. Experiments also show this version improves adversarial robustness. In- terestingly, we find that it is unnecessary to embed x in the dot-product version of non-local means for the model to work well. This is unlike the Gaussian non- local means, in which embedding is essential. The dot- product version provides a denoising operation with no extra parameters (blue box in Figure 5). Figure 5, adapted from [24], shows the implementation of the denoising block based on non-local means. Bilateral filter. It is easy to turn the non-local means in Eqn. (1) into a “local mean”. Doing so leads to the classi- cal bilateral filter [21] that is popular for edge-preserving Figure 5. A block with non-local means as its denoising operation. The blue part illustrates the implementation of non-local means in Eqn. (1). The shapes of the feature tensors are noted, with cor- responding reshaping/transposing performed: here, H and W are the height and width of the feature maps, and we use 256 chan- nels as an example. If softmax is used, it is the Gaussian version (with appropriate 1×1 convolution embeddings used; omitted in this figure); if softmax is not used, it is the dot product version. denoising. Formally, it is defined as: ∀j∈Ω(i) yi = 1 C(x) f (xi, xj) · xj. (2) This equation only differs from Eqn. (1) in the neighbor- hood, Ω(i), which is a local region (e.g., a 3×3 patch) around pixel i. In Eqn. (2), we consider the Gaussian and dot product implementations of the weights as before. Mean filter. Perhaps the simplest form of denoising is the mean filter (average pooling with a stride of 1). Mean filters reduce noise but also smooth structures, so it is reasonable to expect them to perform worse than the above weighted means. However, somewhat surprisingly, experiments show that denoising blocks using mean filters as the denoising operation can still improve adversarial robustness. Median filter. Lastly, we consider an interesting denoising filter that has rarely been used in deep networks: median filtering. The median filter is defined as: yi = median{∀j ∈ Ω(i) : xj}, (3) where the median is over a local region, Ω(i), and is per- formed separately for each channel. Median filters are known to be good at removing salt-and-pepper noise and outliers of similar kind. Training convolutional networks that contain median filters is an open problem, but we find experimentally that using median filters as a denoising op- eration can also improve adversarial robustness. In summary, our study explores a rich collection of denoising operations. Sec. 6 reports the results for all the denoising operations described above. 1×1 convH×W×256HW×256256×HWHW×HWHW×256HW×256H×W×256H×W×256x(softmax)
5. Adversarial Training 6. Experiments We show the effectiveness of feature denoising on top of very strong baselines. Our strong experimental results are partly driven by a successful implementation of adversarial training [6, 16]. In this section, we describe our implemen- tation of adversarial training, which is used for training both baseline models and our feature denoising models. The basic idea of adversarial training [6, 16] is to train the network on adversarially perturbed images. The ad- versarially perturbed images can be generated by a given white-box attacker based on the current parameters of the models. We use the Projected Gradient Descent (PGD)2 [16] as the white-box attacker for adversarial training. PGD attacker. PGD is an iterative attacker. At each itera- tion, it performs a gradient descent step in the loss function w.r.t. the image pixel values, based on an adversarially se- lected output target. Next, it projects the resulting perturbed images into the feasible solution space — within a maxi- mum per-pixel perturbation of  of the clean image (that is, subject to an L∞ constraint). The hyper-parameters of the PGD attacker during adversarial training are: the maximum perturbation for each pixel  = 16, the attack step size α = 1, and the number of attack iterations n = 30. For this PGD in adversarial training, we can initialize the adversarial im- age by the clean image, or randomly within the allowed  [16]. We randomly choose from both initializations in the PGD attacker during adversarial training: 20% of training batches use clean images to initialize PGD, and 80% use random points within the allowed . Distributed training with adversarial images. For each mini-batch, we use PGD to generate adversarial images for that mini-batch. Then we perform a one-step SGD on these perturbed images and update the model weights. Our SGD update is based exclusively on adversarial images; the mini- batch contains no clean images. Because a single SGD update is preceded by n-step PGD (with n = 30), the total amount of computation in adversarial training is ∼n× bigger than standard (clean) training. To make adversarial training practical, we per- form distributed training using synchronized SGD on 128 GPUs. Each mini-batch contains 32 images per GPU (i.e., the total mini-batch size is 128×32 = 4096). We follow the training recipe of [7]3 to train models with such large mini- batches. On ImageNet, our models are trained for a total of 110 epochs; we decrease the learning rate by 10× at the 35- th, 70-th, and 95-th epoch. A label smoothing [19] of 0.1 is used. The total time needed for adversarial training on 128 Nvidia V100 GPUs is approximately 38 hours for the base- line ResNet-101 model, and approximately 52 hours for the baseline ResNet-152 model. 2Publicly available: https://github.com/MadryLab/cifar10_challenge 3Implemented using the publicly available Tensorpack framework [25]. We evaluate feature denoising on the ImageNet classi- fication dataset [18] that has ∼1.28 million images in 1000 classes. Following common protocols [1, 10] for adversarial images on ImageNet, we consider targeted attacks when evaluating under the white-box settings, where the targeted class is selected uniformly at random; targeted attacks are also used in our adversarial training. We evaluate top-1 classification accuracy on the 50k ImageNet validation im- ages that are adversarially perturbed by the attacker (regard- less of its targets), also following [1, 10]. In this paper, adversarial perturbation is considered un- der L∞ norm (i.e., maximum difference for each pixel), with an allowed maximum value of . The value of  is relative to the pixel intensity scale of 256. Our baselines are ResNet-101/152 [9]. By default, we add 4 denoising blocks to a ResNet: each is added after the last residual block of res2, res3, res4, and res5, respectively. 6.1. Against White-box Attacks Following the protocol of ALP [10], we report defense results against PGD as the white-box attacker.4 We evaluate with  =16, a challenging case for defenders on ImageNet. Following [16], the PGD white-box attacker initializes the adversarial perturbation from a random point within the allowed  cube. We set its step size α = 1, except for 10- iteration attacks where α is set to /10 = 1.6. We consider a numbers of PGD attack iterations ranging from 10 to 2000. Main results. Figure 6 shows the main results. We first compare with ALP [10], the previous state-of-the-art. ALP was evaluated under 10-iteration PGD attack in [10], on Inception-v3 [19]. It achieves 27.9% accuracy on ImageNet validation images (Figure 6, purple triangle). ResNet-101 and ResNet-152 in Figure 6 are our baseline models (without any denoising blocks) trained using our adversarial training implementation. Even with the lower- capacity model of R-101, our baseline is very strong — it has 49.7% accuracy under 10-iteration PGD attacks, con- siderably better than the ALP result. This shows that our adversarial training system is solid; we note that the com- parison with ALP is on the system-level as they differ in other aspects (backbone networks, implementations, etc.). “R-152, denoise” in Figure 6 is our model of ResNet-152 with four denoising blocks added. Here we show the best- performing version (non-local with Gaussian), which we ablate next. There is a consistent performance improvement introduced by the denoising blocks. Under the 10-iteration PGD attack, it improves the accuracy of ResNet-152 base- line by 3.2% from 52.5% to 55.7% (Figure 6, right). 4We have also evaluated other attackers, including FGSM [6], iterative FGSM [12], and its momentum variant [3]. Similar to [10], we found that PGD is the strongest white-box attacker among them.
Figure 6. Defense against white-box attacks on ImageNet. The left plot shows results against a white-box PGD attacker with 10 to 2000 attack iterations. The right plot zooms in on the results with 10 to 100 attack iterations. The maximum perturbation is  = 16. Our results are robust even under 2000-iteration PGD attacks. To our knowledge, such a strong attack has not been previously explored on ImageNet. ALP [10] was only evaluated against 10-iteration PGD attacks (Figure 6), and its claimed robustness is subject to controversy [5]. Against 2000-iteration PGD attacks, our ResNet-152 baseline has 39.2% accuracy, and its denoising counterpart is 3.4% bet- ter, achieving 42.6%. We also observe that the attacker per- formance diminishes with 1000∼2000 attack iterations. We note that in this white-box setting, the attackers can iteratively back-propagate through the denoising blocks and create adversarial perturbations that are tailored to the de- noisers. Recent work [1] reports that pixel denoising meth- ods can be circumvented by attackers in the white-box set- tings. By contrast, feature denoising leads to consistent im- provements in white-box settings, suggesting that feature denoising blocks make it more difficult to fool networks. Variants of denoising operations. Next, we evaluate vari- ants of denoising operations in Sec. 4. In these ablations, we add blocks of different kinds to baseline ResNet-152. We consider the following denoising operations: 3×3 mean filtering, 3×3 median filtering, 3×3 bilateral filtering (Eqn. (2)), and non-local filtering. In our ablation study, we further consider a “null” version of the denoising block: the block in Figure 4 becomes trivially a residual block with a single 1×1 convolution. Further, we also compare with adding 4 standard bottleneck [9] blocks — essentially, ResNet-164. All models are trained by adversarial training. Figure 7 shows the white-box attacks results; for simplicity, we show PGD attacker with up to 100 attack iterations in this ablation. All of these denoising operations have better accuracy than: (i) ResNet-152 baseline, (ii) adding 4 standard bot- tleneck blocks, and (iii) adding 4 “null” denoising blocks. It is worth noticing that the 1×1 null version has the exact Figure 7. Ablation: denoising variants for defending against white-box attacks on ImageNet. On the ResNet-152 baseline, all other models add 4 blocks to it. The attacker is PGD under differ- ent attack iterations, with  = 16. All denoising models are better than the R-152 baseline and the “null” version. same number of extra parameters as the mean filtering, me- dian filtering, and bilateral/non-local filtering’s dot product versions (which have no embedding). The null version is worse than all of them (Figure 7). Also, while adding stan- dard bottleneck blocks is helpful, adding denoising blocks of any version is more accurate. These results suggest that the extra parameters are not the main reason for our accu- racy improvements; feature denoising appears to be a gen- eral approach particularly useful for adversarial robustness. Our best-performing model is given by the non-local (Gaussian) version, which we use by default in other parts of the paper unless noted. Interestingly, this Gaussian ver- sion is just marginally better than the dot product version. 10100200400600800100012001400160018002000attack iterations25303540455055accuracy (%)ALP45.544.443.342.842.641.740.439.638.939.238.737.236.435.935.827.92000-iter PGD attackALP45.544.443.342.842.641.740.439.638.939.238.737.236.435.935.827.92000-iter PGD attackALP, Inception-v3ours, R-101 baselineours, R-152 baselineours, R-152 denoise102030405060708090100attack iterations25303540455055accuracy (%)ALP55.753.349.948.647.946.846.446.045.845.552.550.046.144.943.842.842.442.141.941.749.747.343.242.040.840.039.639.339.038.727.9ALP, Inception-v3ours, R-101 baselineours, R-152 baselineours, R-152 denoise102030405060708090100attack iterations4244464850525456accuracy (%)55.745.553.543.452.541.755.745.553.543.452.541.7ResNet-152 baseline +4 bottleneck (ResNet-164) +4 denoise: null (1x1 only) +4 denoise: 3x3 mean +4 denoise: 3x3 median +4 denoise: bilateral, dot prod +4 denoise: bilateral, gaussian +4 denoise: nonlocal, dot prod +4 denoise: nonlocal, gaussian
attack iterations non-local, Gaussian removing 1×1 removing residual 10 55.7 52.1 NaN 100 45.5 36.8 NaN Table 1. Ablation: denoising block design for defending against white-box attacks on ImageNet. Our networks have four (Gaus- sian) non-local means denoising blocks. We indicate the perfor- mance of models we were unable to train by “NaN”. Design decisions of the denoising block. The denoising block in Figure 4 has a 1×1 layer and a residual connection. Although both components do not perform denoising, they are important for the denoising blocks to work well. Next, we ablate the behavior of the 1×1 and residual connection. This ablation is in Table 1. We investigate ResNet-152 with four non-local, Gaussian denoising blocks. All models are all trained by adversarial training. When removing the 1×1 convolution in the denoising block, the accuracy drops considerably — e.g., decreasing from 45.5% to 36.8% un- der 100-iteration PGD attacks. On the other hand, removing the residual connection makes training unstable, and its loss does not decrease in our adversarial training. These results suggest that denoising features in itself is not sufficient. As suppressing noise may also remove use- ful signals, it appears essential to properly combine the de- noised features with the input features in denoising blocks. 6.2. Against Black-Box Attacks Next, we evaluate defending against black-box attacks. To have an unbiased yet challenging set of attackers, we study the 5 best attackers of the NIPS 2017 CAAD com- petition [13], for which code is publicly available. We use the latest CAAD 2018 evaluation criterion, which we call “all-or-nothing”: an image is considered correctly classified only if the model correctly classifies all adversarial versions of this image created by all attackers. This is a challenging evaluation scenario for the defender. Following the CAAD black-box setting, the maximum perturbation for each pixel is  = 32, which also makes defense more difficult. Note that our models are trained with  = 16. Table 2 shows the results of defending against black-box attacks on ImageNet validation images. To highlight the difficulty of the new “all-or-nothing” criterion, we find that the CAAD 2017 winner [15] has only 0.04% accuracy un- der this criterion. We find that it is mainly vulnerable to two of the five attackers5,6. If we remove these two attackers, [15] has 13.4% accuracy in the “all-or-nothing” setting. With the “all-or-nothing” criterion, our ResNet-152 baseline has 43.1% accuracy against all five attackers. This number suggests that a successful implementation of adversarial training is critical for adversarial robustness. 5https://github.com/pfnet-research/nips17-adversarial-attack 6https://github.com/toshi-k/kaggle-nips-2017-adversarial-attack model accuracy (%) CAAD 2017 winner CAAD 2017 winner, under 3 attackers ours, R-152 baseline +4 denoise: null (1×1 only) +4 denoise: non-local, dot product +4 denoise: non-local, Gaussian +all denoise: non-local, Gaussian 0.04 13.4 43.1 44.1 46.2 46.4 49.5 Table 2. Defense against black-box attacks on ImageNet. We show top-1 classification accuracy on the ImageNet validation set. The attackers are the 5 best attackers in CAAD 2017. We adopt the CAAD 2018 “all-or-nothing” criterion for defenders. The 2017 winner has 0.04% accuracy under this strict criterion, and if we remove the 2 attackers that it is most vulnerable to, it has 13.4% accuracy under the 3 remaining attackers. On top of our strong ResNet-152 baseline, adding four non-local denoising blocks improves the accuracy to 46.4% (Table 2). Interestingly, both the Gaussian and dot prod- uct versions perform similarly (46.4% vs. 46.2%), although the Gaussian version has more parameters due to its embed- ding. Furthermore, the null version has 44.1% accuracy — this is worse than the non-local, dot product version, even though they have the same number of parameters; this null version of 1×1 is 1.0% better than the ResNet-152 baseline. We have also studied the local variants of denoising blocks, including mean, median, and bilateral filters. They have 43.6% ∼ 44.4% accuracy in this black-box setting. Their results are not convincingly better than the null ver- sion’s results. This suggests that non-local denoising is more important than local denoising for robustness against these black-box attackers. Pushing the envelope. To examine the potential of our model, we add denoising blocks to all residual blocks (one denoising block after each residual block) in ResNet-152. We only study the non-local Gaussian version here. To make training feasible, we use the sub-sampling trick in [24]: the feature map of xj in Eqn. (1) is subsampled (by a 2×2 max pooling) when performing the non-local means, noting that the feature map of xi is still full-sized. We only use sub-sampling in this case. It achieves a number of 49.5%. This is 6.4% better than the ResNet-152 baseline’s 43.1%, under the black-box setting (Table 2). CAAD 2018 challenge results. Finally, we report the re- sults from the latest CAAD 2018 competition. The 2018 de- fense track adopts the “all-or-nothing” criterion mentioned above — in this case, every defense entry needs to defend against 48 unknown attackers submitted to the same chal- lenge (in contrast to 5 attackers in our above black-box ab- lation). The test data is a secret, ImageNet-like dataset. The maximum perturbation for each pixel is  = 32.
Figure 8. CAAD 2018 results of the adversarial defense track. The first-place entry is based on our method. We only show the 5 winning submissions here, out of more than 20 submissions. Figure 8 shows the accuracy of the 5 best entries in the CAAD 2018 defense track. The winning entry, shown in the blue bar, was based on our method by using a ResNeXt- 101-32×8 backbone [26] with non-local denoising blocks added to all residual blocks. This entry only uses single- It achieves 50.6% accuracy crop, single-model testing. against 48 unknown attackers. This is ∼10% absolute (20% relative) better than the second place’s 40.8% accuracy. We also reported the white-box performance of this win- ning entry on ImageNet. Under 10-iteration PGD attacks and 100-iteration PGD attacks, it achieves 56.0% accuracy and 40.4% accuracy, respectively. These results are slightly worse than the robustness of ResNet-152 based models re- ported in Section 6.1. We note that this white-box robust- ness comparison is on the system-level as the winning entry was trained with a slightly different parameter setting. We emphasize that the CAAD 2018 defense task is very challenging because of the “all-or-nothing” criterion and many unknown (potentially new state-of-the-art) attackers. Actually, except for the two leading teams, all others have <10% accuracy and many of them have <1% accuracy. This highlights the significance of our 50.6% accuracy. 6.3. Denoising Blocks in Non-Adversarial Settings Thus far we have been focusing on denoising blocks for improving adversarial defense. Because our denoising blocks are components of the convolutional networks, these networks can also be trained without adversarial training for the classification of “clean” images (i.e., the original ImageNet dataset task). We believe that studying the non- adversarial setting can help us better understand the behav- ior of denoising blocks. In the clean setting, Table 3 presents the clean image performance of mod- els that were not adversarially trained. We compare the baseline R-152, adding standard bottleneck blocks, adding “null” (1×1) denoising blocks, and adding denoising blocks of various types. these denoising blocks have no obvious advantage over the baseline R- 152, adding standard bottleneck blocks, or adding “null” denoising blocks. Actually, all results are in the range of about ±0.2% of the baseline R-152’s result — which have no significant difference if we also consider the natural vari- ance between separate training runs of the same model (see baseline R-152 in Table 3). model R-152 baseline R-152 baseline, run 2 R-152 baseline, run 3 +4 bottleneck (R-164) +4 denoise: null (1×1 only) +4 denoise: 3×3 mean filter +4 denoise: 3×3 median filter +4 denoise: bilateral, Gaussian +4 denoise: non-local, Gaussian accuracy (%) 78.91 +0.05 -0.04 +0.13 +0.15 +0.01 -0.12 +0.15 +0.17 Table 3. Accuracy on clean images in the ImageNet validation set when trained on clean images. All numbers except the first row are reported as the accuracy difference comparing with the first R-152 baseline result. For R-152, we run training 3 times independently, to show the natural random variation of the same architecture. All denoising models show no significant difference, and are within ±0.2% of the baseline R-152’s result. We also find that adding non-local denoising blocks to the shallower ResNet-50 can moderately improve accuracy by 0.7% in the non-adversarial setting, but doing so on ResNet-152 has diminishing gain. This, however, is not the case for the adversarial images. These results suggest that the denoising blocks could have special advantages in settings that require adversarial robustness. This observation matches our intuition that denoising blocks are designed to reduce feature noise, which only appears when classifying adversarial images. Finally, we report that our ResNet-152 baseline with adversarial training has 62.32% accuracy when tested on clean images, whereas its counterpart with “clean” train- ing obtains 78.91%. For the denoising version (non- local, Gaussian), the accuracy of an adversarially trained network is 65.30% on clean images, whereas its cleanly trained counterpart obtains 79.08%. This tradeoff between adversarial and clean training has been observed before (e.g., in [22]); we expect this tradeoff to be the subject of future research. 7. Conclusion Motivated by the noisy appearance of feature maps from adversarial images, we have demonstrated the potential of feature denoising for improving the adversarial robustness of convolutional networks. Interestingly, our study suggests that there are certain architecture designs (viz., denoising blocks) that are particularly good for adversarial robustness, even though they do not lead to accuracy improvements compared to baseline models in “clean” training and testing scenarios. When combined with adversarial training, these particular architecture designs may be more appropriate for modeling the underlying distribution of adversarial images. We hope our work will encourage researchers to start de- signing convolutional network architectures that have “in- nate” adversarial robustness. 0 10203040501st2nd3rd4th5th50.640.8 8.6 3.6 0.6
分享到:
收藏