logo资料库

A-2 Extended Feature Pyramid Network for Small Object Detection.....pdf

第1页 / 共16页
第2页 / 共16页
第3页 / 共16页
第4页 / 共16页
第5页 / 共16页
第6页 / 共16页
第7页 / 共16页
第8页 / 共16页
资料共16页,剩余部分请下载后查看
Extended Feature Pyramid Network for Small Object Detection
0 2 0 2 r a M 6 1 ] V C . s c [ 1 v 1 2 0 7 0 . 3 0 0 2 : v i X r a Extended Feature Pyramid Network for Small Object Detection Chunfang Deng, Mengmeng Wang, Liang Liu, and Yong Liu† {dengcf, mengmengwang, leonliuz}@zju.edu.cn, yongliu@iipc.zju.edu.cn Zhejiang University Abstract. Small object detection remains an unsolved challenge be- cause it is hard to extract information of small objects with only a few pixels. While scale-level corresponding detection in feature pyramid net- work alleviates this problem, we find feature coupling of various scales still impairs the performance of small objects. In this paper, we propose extended feature pyramid network (EFPN) with an extra high-resolution pyramid level specialized for small object detection. Specifically, we de- sign a novel module, named feature texture transfer (FTT), which is used to super-resolve features and extract credible regional details simultane- ously. Moreover, we design a foreground-background-balanced loss func- tion to alleviate area imbalance of foreground and background. In our experiments, the proposed EFPN is efficient on both computation and memory, and yields state-of-the-art results on small traffic-sign dataset Tsinghua-Tencent 100K and small category of general object detection dataset MS COCO. Keywords: Small Object Detection, Feature Pyramid Network, Feature Super-Resolution 1 Introduction Object detection is a fundamental task of many advanced computer vision prob- lems such as segmentation, image caption and video understanding. Over the past few years, rapid development of deep learning has boosted the popular- ity of CNN-based detectors, which mainly include two-stage pipelines [8,7,28,5] and one-stage pipelines [24,27,20]. Although these general object detectors have improved accuracy and efficiency substantially, they still perform poorly when detecting small objects with a few pixels. Since CNN uses pooling layers repeat- edly to extract advanced semantics, the pixels of small objects can be filtered out during the downsampling process. Utilization of low-level features is one way to pick up information about small objects. Feature pyramid network (FPN) [19] is the first method to en- hance features by fusing features from different levels and constructing feature pyramids, where upper feature maps are responsible for larger object detection, and lower feature maps are responsible for smaller object detection. Despite FPN improves multi-scale detection performance, the heuristic mapping mechanism
2 Chunfang Deng, Mengmeng Wang, Liang Liu, and Yong Liu (a) Mapping between pyramid level and proposal size in vanilla FPN detectors. (b) Detection performance of original P2 and our P 2 on Tsinghua-Tencent 100K. Fig. 1. The drawback of small object detection in vanilla FPN detectors. (a)Feature Coupling: Both small and medium objects are detected on the lowest level (P2) of FPN. (b)Poor Performance of Small Objects on P2: The detection performance of P2 varies with scale, and the average precision (AP) and average recall (AR) decline sharply when instances turn small. The extended pyramid level P 2 in our EFPN mitigates this performance drop between pyramid level and proposal size in FPN detectors may confuse small object detection. As shown in Fig. 1(a), small-sized objects must share the same feature map with medium-sized objects and some large-sized objects, while easy cases like large-sized objects can pick features from a suitable level. Besides, as shown in Fig. 1(b), the detection accuracy and recall of the FPN bottom layer fall dramatically as the object scale decreases. Fig.1 suggests that, feature cou- pling across scales in vanilla FPN detectors still degenerates the ability of small object detection. Intuitively, another way of compensating for the information loss of small objects is to increase the feature resolution. Thus some super-resolution (SR) methods are introduced to object detection. Early practices [11,3] directly super- resolve the input image, but the computational cost of feature extraction in the following network would be expensive. Li et al. [14] introduce GAN [10] to lift features of small objects to higher resolution. Noh et al. [25] use high-resolution target features to supervise SR of the whole feature map containing context information. These feature SR methods avoid adding to the burden of the CNN backbone, but they imagine the absent details only on the basis of the low- resolution feature map, and neglect credible details encoded in other features of backbones. Hence, they are inclined to fabricate fake textures and artifacts on CNN features, causing false positives. In this paper, we propose extended feature pyramid network (EFPN), which employs large-scale SR features with abundant regional details to decouple small and medium object detection. EFPN extends the original FPN with a high- resolution level specialized for small-sized object detection. To avoid expensive computation that would be caused by direct high-resolution image input, the 01234050100150200250300350400450500550600FPNFeatureLevelProposalSize(px)SmallMediumLarge405060708090102030405060708090AP/AR(%)ObjectSize(px)AP ofAR ofAP ofAR of!"!"!"′!"′
Extended Feature Pyramid Network for Small Object Detection 3 extended high-resolution feature maps of our method is generated by feature SR embedded FPN-like framework. After construction of the vanilla feature pyramid, the proposed feature texture transfer (FTT) module firstly combines deep semantics from low-resolution features and shallow regional textures from high-resolution feature reference. Then, the subsequent FPN-like lateral connec- tion will further enrich the regional characteristics by tailor-made intermediate CNN feature maps. One advantage of EFPN is that the generation of the high- resolution feature maps depends on original real features produced by CNN and FPN, rather than on unreliable imagination in other similar methods. As shown in Fig. 1(b), the extended pyramid level with credible details in EFPN improves detection performance on small objects significantly. Moreover, we introduce features which are generated by large-scale input images as supervision to optimize EFPN, and design a foreground-background- balanced loss function. We argue that general reconstruction loss will lead to insufficient learning of positive pixels, as small instances merely cover frac- tional area on the whole feature map. In light of the importance of foreground- background balance [20], we add loss of object areas to global loss function, drawing attention to the feature quality of positive pixels. We evaluate our method on challenging small traffic-sign dataset Tsinghua- Tencent 100K and general object detection dataset MS COCO. The results demonstrates that the proposed EFPN outperforms other state-of-the-art meth- ods on both datasets. Besides, compared with multi-scale test, single-scale EFPN achieves similar performance but with fewer computing resources. For clarity, the main contributions of our work can be summarized as: (1) We propose extended feature pyramid network (EFPN) which improves the performance of small object detection. (2) We design a pivotal feature reference-based SR module named feature tex- ture transfer (FTT), to endow the extended feature pyramid with credible details for more accurate small object detection. (3) We introduce a foreground-background-balanced loss function to draw at- tention on positive pixels, alleviating area imbalance of foreground and back- ground. (4) Our efficient approach significantly improves the performance of detectors, and becomes state-of-the-art on Tsinghua-Tencent 100K and small category of MS COCO. 2 Related Work 2.1 Deep Object Detectors Deep learning based detectors have ruled general object detection due to their high performance. The successful two-stage methods [8,7,28,5] firstly generate Regions of Interest (RoIs), and then refine RoIs with a classifier and a regres- sor. One-stage detectors [24,27,20], another kind of prevalent detectors, directly conduct classification and localization on CNN feature maps with the help of
4 Chunfang Deng, Mengmeng Wang, Liang Liu, and Yong Liu pre-defined anchor boxes. Recently, anchor-free frameworks [13,38,31,39] also be- come increasingly popular. Despite of the development of deep object detectors, small object detection remains an unsolved challenge. Dilated convolution [34] is introduced in [23,17,16] to augment receptive fields for multi-scale detection. However, general detectors tend to focus more on improving the performance of easier large instances, since the metric of general object detection is average precision of all scales. Detectors specialized for small objects still need more exploration. 2.2 Cross-Scale Features Utilizing cross-scale features is an effective way to alleviate the problem arising from object scale variation. Building image pyramids is a traditional approach to generating cross-scale features. Use of features from different layers of network is another kind of cross-scale practice. SSD [24] and MS-CNN [4] detect objects of different scales on different layers of CNN backbone. FPN [19] constructs feature pyramids by merging features from lower layers and higher layers via a top-down pathway. Following FPN, FPN variants explore more information pathways in feature pyramids. PANet [22] adds an extra down-top pathway to pass shallow localization information up. G-FRNet [1] introduces gate unit on the pathway, which passes crucial information and block ambiguous information. NAS-FPN [6] delves into optimal pathway configuration using AutoML. Though these FPN variants improve the performance of multi-scale object detection, they continue to use the same number of layers as original FPN. But these layers are not suitable for small object detection, which leads to still poor performance of small objects. 2.3 Super-Resolution in Object Detection Some studies introduce SR to object detection, since small object detection al- ways benefits from large scales. Image-level SR is adopted in some specific situa- tions where extremely small objects exist, such as satellite images [15] and images with crowded tiny faces [2]. But large-scale images are burdensome for subse- quent networks. Instead of super-resolving the whole image, SOD-MTGAN [3] only super-resolves the area of RoIs, but large quantities of RoIs still need con- siderable computation. The other way of SR is to directly super-resolve features. Li et al. [14] use Perceptual GAN to enhance features of small objects with the characteristics of large objects. STDN [37] employs sub-pixel convolution on top layers of DenseNet [12] to detect small objects and meanwhile reduce network parameters. Noh et al. [25] super-resolve the whole feature map and introduce supervision signal to training process. Nevertheless, above-mentioned feature SR methods are all based on restricted information from a single feature map. Re- cent reference-based SR methods [35,36] have capacity of enhancing SR images with textures or contents from reference images. Enlightened by reference-based SR, we design a novel module to super-resolves features under the reference of
Extended Feature Pyramid Network for Small Object Detection 5 shallow features with credible details, thus generating features more suitable for small object detection. Eq. (8) Fig. 2. The framework of extended feature pyramid network (EFPN). Here Ci denotes the feature map from stage i of CNN backbone, and Pi denotes the corresponding pyramid level on EFPN. Top 4 layers of EFPN are vanilla FPN layers. Feature texture transfer (FTT) module integrates semantic contents from P3 and regional textures from P2. And then, an FPN-like top-down pathway passes FTT module output down to form the final extended pyramid level P 2, P2, P3, P4, P5) will be fed to the following detector for further object localization and classification 2. The extended feature pyramid (P 3 Our Approach In this section, we will introduce the proposed extended feature pyramid network (EFPN) in detail. First, we construct an extended feature pyramid, which is specialized for small objects with a high-resolution feature map at the bottom. Specifically, we design a novel module named feature texture transfer(FTT), to generate intermediate features for the extended feature pyramid. Moreover, we employ a new foreground-background-balanced loss function to further enforce learning of positive pixels. The pipeline of EFPN network and FTT module is explained in Sec. 3.1 and Sec. 3.2, and Sec. 3.3 elaborates our loss function design. 3.1 Extended Feature Pyramid Network Vanilla FPN constructs a 4-layer feature pyramid by upsampling high-level CNN feature maps and fusing them with lower features by lateral connections. Al- 2345′2FTTDetector2×2×2×2×2345′2P3'Supervision8008004001005025400100502540040020020010010050502525vanilla FPN200200
6 Chunfang Deng, Mengmeng Wang, Liang Liu, and Yong Liu Table 1. Generation of C pooling in stage2 is added to generate C C2 from 2× input image. The branches of C2 and C C2 and C 2 are generated simultaneously from 1× input 2 in ResNet/ResNeXt backbones. A new branch without max- 2, simulating the semantics and resolution of 2 share the same weights. In EFPN, Layer Name Input Stage1 Stage2 Output Layer Components 3 × 3 max pool, stride 2 800 × 800(1×) 7 × 7, 64, stride 2 residual blocks ×3 C2:(200 × 200) 800 × 800(1×) 7 × 7, 64, stride 2 residual blocks ×3 2:(400 × 400) C though features on different pyramid levels are responsible for objects of differ- ent sizes, small object detection and medium object detection are still coupled on the same bottom layer P2 of FPN, as shown in Fig. 1. To relieve this issue, we propose EFPN to extend the vanilla feature pyramid with a new level, which accounts for small object detection with more regional details. We implement the extended feature pyramid by an FPN-like framework embedded with a feature SR module. This pipeline directly generates high- resolution features from low-resolution images to support small object detection, while stays in low computational cost. The overview of EFPN is shown in Fig. 2. Top 4 pyramid layers are constructed by top-down pathways for medium and large object detection. The bottom extension in EFPN, which contains an FTT module, a top-down pathway and a purple pyramid layer in Fig. 2, aims to capture regional details for small objects. More specifically, in the extension, the 3rd and 4th pyramid layers of EFPN which are denoted by green and yellow layers respectively in Fig. 2, are mixed up in the feature SR module FTT to produce the intermediate feature P 3 with selected regional information, which is denoted by a blue diamond in Fig. 2. And then, the top-down pathway merges P 3 with a tailor-made high-resolution CNN feature map C2, producing the final extended pyramid layer P 2. We remove a max-pooling layer in ResNet/ResNeXt stage2, and get C 2 shares the same representation level with original C2 but contains more regional details due to its higher resolution. And the smaller receptive field in C 2 also helps better locate small objects. Mathematically, operations of the extension in the proposed EFPN can be described as 2 as the output of stage2, as shown in in Table 1. C where ↑2× denotes double upscaling by nearest-neighbor interpolation. P 2 = P 3 ↑2× +C 2 (1) In EFPN detectors, the mapping between proposal size and pyramid level still follows the fashion in [19]: √ l = l0 + log2( wh/224) (2) Here l represents pyramid level, w and h are the width and height of a box proposal, 224 is the canonical ImageNet pre-training size, and l0 is the target
Extended Feature Pyramid Network for Small Object Detection 7 Fig. 3. The framework of FTT module. Main semantic contents of input feature P3 are firstly extracted by a content extractor. And then we double the resolution of the content features by sub-pixel convolution. The texture extractor selects credible regional textures for small object detection from the wrap of mainstream features and reference features. Finally, residual connection helps fuse the textures with super- resolved content features to produce P 3 for the extended feature pyramid level on which an box proposal with w × h = 2242 should be mapped into. Since the detector which follows EFPN fits various receptive fields adaptively, the receptive field drift mentioned in [25] can be ignored. 3.2 Feature Texture Transfer Enlightened by image reference-based SR [35], we design FTT module to super- resolve features and extract regional textures from reference features simultane- ously. Without FTT, noises in the 4th level P2 of EFPN would directly pass down to the extended pyramid level, and overwhelm meaningful semantics. However, the proposed FTT output synthesizes strong semantics in upper low-resolution features and critical local details in lower high-resolution reference features, but discards disturbing noises in reference. As shown in Fig. 3, the main input of FTT module is the feature map P3 from the 3rd layer of EFPN, and the reference is the feature map P2 from the 4th layer of EFPN. The output P 3 can be defined as 3 = Et(P2 Ec(P3) ↑2×) + Ec(P3) ↑2× P (3) where Et(·) denotes texture extractor component, Ec(·) denotes content extrac- tor component, ↑2× here denotes double upscaling by sub-pixel convolution [29], and denotes feature concatenation. The content extractor and texture extrac- tor are both composed of residual blocks. In the main stream, we apply sub-pixel convolution to upscale spatial reso- lution of the content features from the main input P3 considering its efficiency. Sub-pixel convolution augments pixels on the dimensions of width and height via diverting pixels on the dimension of channel. Denote the feature generated by convolution layers as F ∈ RH×W×C·r2 . The pixel shuffle operator in sub- pixel convolution rearranges the feature to a map of shape rH × rW × C. This ......convBNReLUconvBNContent Extractor24......convBNReLUconvBNTexture Extractor23′3ReferenceMainSub-Pixel Conv2
8 Chunfang Deng, Mengmeng Wang, Liang Liu, and Yong Liu operation can be mathematically defined as PS(F )x,y,c = Fx/r,y/r,C·r·mod(y,r)+C·mod(x,r)+c (4) where PS(F )x,y,c denotes the output feature pixel on coordinates (x, y, c) after pixel shuffle operation PS(·), and r denotes the upscaling factor. In our FTT module, we adopt r = 2 in order to double the spatial scale. In the reference stream, the wrap of reference feature P2 and super-resolved content feature P3 is fed to texture extractor. Texture extractor aims to pick up credible textures that are for small object detection and block useless noises from the wrap. The final element-wise addition of textures and contents ensures the out- put integrates both semantic and regional information from input and reference. Hence, the feature map P 3 possesses selected reliable textures from shallow fea- ture reference P2, as well as similar semantics from the deeper level P3. 3.3 Training Loss Foreground-Background-Balanced Loss. Foreground-background-balanced loss is designed to improve comprehensive quality of EFPN. Common global loss will lead to insufficient learning of small object areas, because small objects only make up fractional part of the whole image. Foreground-background-balanced loss function improves the feature quality of both background and foreground by two parts: 1) global reconstruction loss 2) positive patch loss. Global construction loss mainly enforces resemblance to the real background features, since background pixels consist most part of an image. Here we adopt l1 loss that is commonly used in SR as global reconstruction loss Lglob: Lglob(F, F t) = ||F t − F||1 (5) where F denotes the generated feature map, and F t denotes the target feature map. Positive patch loss is used to draw attention to positive pixels, because severe foreground-background imbalance will impede detector performance [20]. We employ l1 loss on foreground areas as positive patch loss Lglob: Lpos(F, F t) = 1 N ||F t x,y − Fx,y||1 (6) (x,y)∈Ppos where Ppos denotes the patches of ground truth objects, N denotes the total number of positive pixels, and (x, y) denotes the coordinates of pixels on feature maps. Positive patch loss plays the role of a stronger constraint for the areas where objects locate, enforcing learning true representation of these areas. The foreground-background-balanced loss function Lf bb is then defined as Lf bb(F, F t) = Lglob(F, F t) + λLpos(F, F t) (7) where λ is a weight balancing factor. The balanced loss function mines true pos- itives by improving feature quality of foreground areas, and kills false positives by improving feature quality of background areas.
分享到:
收藏