poly-yolo.pdf

发布时间：2022-06-03 发布人：admin 分类：说明书资料大小：7.98M 资料格式：pdf 举报版权申诉

andeyeluguo-12473300-16359647349977999130.pdf-第1页.png

第1页 / 共18页

andeyeluguo-12473300-16359647349977999130.pdf-第2页.png

第2页 / 共18页

andeyeluguo-12473300-16359647349977999130.pdf-第3页.png

第3页 / 共18页

andeyeluguo-12473300-16359647349977999130.pdf-第4页.png

第4页 / 共18页

andeyeluguo-12473300-16359647349977999130.pdf-第5页.png

第5页 / 共18页

andeyeluguo-12473300-16359647349977999130.pdf-第6页.png

第6页 / 共18页

andeyeluguo-12473300-16359647349977999130.pdf-第7页.png

第7页 / 共18页

andeyeluguo-12473300-16359647349977999130.pdf-第8页.png

第8页 / 共18页

1 Problem statement

2 Current state and related work

2.1 Object detection

2.1.1 Two-stage detectors

2.1.2 One-stage detectors

2.2 Instance Segmentation

3 Fast and precise object detection with Poly-YOLO

3.1 YOLO history

3.2 YOLOv3 issues blocking better performance

3.2.1 Label rewriting problem

3.2.2 Anchors distribution problem

3.3 Poly-YOLO architecture

4 Instance segmentation with Poly-YOLO

4.1 The principle of bounding polygons

4.2 Integration with Poly-YOLO

5 Benchmarks

5.1 Preparing data for Poly-YOLO

5.2 Datasets

5.3 Results

6 Discussion

6.1 Hyperparameters

6.2 Emphasizing parts of detections

6.3 Limitations

7 Summary

POLY-YOLO: HIGHER SPEED, MORE PRECISE DETECTION AND INSTANCE SEGMENTATION FOR YOLOV3 A PREPRINT Petr Hurtik*, Vojtech Molek*, Jan Hula*, Marek Vajgl*, Pavel Vlasanek∗, and Tomas Nejezchleba† May 28, 2020 ABSTRACT We present a new version of YOLO with better performance and extended with instance segmentation called Poly-YOLO. Poly-YOLO builds on the original ideas of YOLOv3 and removes two of its weaknesses: a large amount of rewritten labels and inefﬁcient distribution of anchors. Poly-YOLO reduces the issues by aggregating features from a light SE-Darknet-53 backbone with a hypercolumn technique, using stairstep upsampling, and produces a single scale output with high resolution. In comparison with YOLOv3, Poly-YOLO has only 60% of its trainable parameters but improves mAP by a relative 40%. We also present Poly-YOLO lite with fewer parameters and a lower output resolution. It has the same precision as YOLOv3, but it is three times smaller and twice as fast, thus suitable for embedded devices. Finally, Poly-YOLO performs instance segmentation using bounding polygons. The network is trained to detect size-independent polygons deﬁned on a polar grid. Vertices of each polygon are being predicted with their conﬁdence, and therefore Poly-YOLO produces polygons with a varying number of vertices. Source code is available at https://gitlab.com/irafm-ai/poly-yolo. Keywords Object detection · Instance segmentation · YOLOv3 · Bounding box · Bounding polygon · Realtime detection 0 2 0 2 y a M 7 2 ] V C . s c [ 1 v 3 4 2 3 1 . 5 0 0 2 : v i X r a Figure 1: The ﬁgure shows instance segmentation performance of the proposed Poly-YOLO algorithm applied on Cityscapes dataset and running 26FPS on a mid-tier graphic card. Image was cropped due to visibility. dubna 22, Ostrava, Czech Republic ∗University of Ostrava, Centre of Excellence IT4Innovations, Institute for Research and Applications of Fuzzy Modeling, 30. †Varroc Lighting Systems, Suvorovova 195, Šenov u Nového Jiˇcína, Czech Republic.

A PREPRINT - MAY 28, 2020 1 Problem statement Object detection is a process where all important areas containing objects of interest are bounded while the background is ignored. Usually, the object is bounded by a box that is expressed in terms of spatial coordinates of its top-left corner and its width and height. The disadvantage of this approach is that for the objects of complex shapes, the bounding box also includes background, which can occupy a signiﬁcant part of the area as the bounding box does not wrap the object tightly. Such behavior can decrease the performance of a classiﬁer applied over the bounding box [1] or may not fulﬁll requirements of precise detection [2]. To avoid the problem, classical detectors such as Faster R-CNN [3] or RetinaNet [4] were modiﬁed into a version of Mask R-CNN [5] or RetinaMask [6]. These methods also infer the instance segmentation, i.e., each pixel in the bounding box is classiﬁed into object/background classes. The limitation of the methods is their computation speed, where they are unable to reach real-time performance on non-high-tier hardware. The problem we focus on is to create a precise detector with instance segmentation and the ability of real-time processing on mid-tier graphic cards. In this study, we start with YOLOv3 [7], which excels in processing speed, and therefore it is a good candidate for real-time applications running on computers [8] or mobile devices [9]. On the other hand, the precision of YOLOv3 lags behind detectors such as RetinaNet [4], EfﬁcientDet [10], or CornerNet [11]. We analyze YOLO’s performance and identify its two drawbacks. The ﬁrst drawback is low precision of the detection of big boxes [7] caused by inappropriate handling of anchors in output layers. The second one is rewriting of labels by each-other due to the coarse resolution. To solve these issues, we design a new approach, dubbed Poly-YOLO, that signiﬁcantly pushes forward original YOLOv3 abilities. To tackle the problem of instance segmentation, we propose a way to detect tight polygon-based contour. Our contributions and beneﬁts of our approach are as follows: • we propose Poly-YOLO that increases the detection accuracy of the previous version, YOLOv3. Poly-YOLO has a brand-new feature decoder with a single output tensor that goes to head with higher resolution that solves two principal YOLO’s issues: rewriting of labels and incorrect distribution of anchors. • We produce a single output tensor by a hypercolumn composition of multi-resolution feature maps produced by a feature extractor. To unify the resolutions of the feature maps, we utilize stairstep upscaling, which allows us to obtain slightly lower loss with comparison to direct upscaling while the computation speed is preserved. • We design an extension that realizes instance segmentation using bounding polygon representation. The number of maximal polygon vertices can be adjusted according to a requirement to a precision. • The bounding polygon is detected within a polar grid with relative coordinates that allow the network to learn general, size-independent shapes. The network produces a dynamic number of vertices per bounding polygon. Figure 2: Examples of Poly-YOLO inference on the Cityscapes testing dataset. Figure 3: Examples of Poly-YOLO inference on the India driving testing dataset. 2

A PREPRINT - MAY 28, 2020 2 Current state and related work 2.1 Object detection Models for object detection can be divided into two groups, two-stage, and one-stage detectors. Two-stage detectors split the process as follows. In the ﬁrst phase, regions of interest (RoI) are proposed, and in the subsequent stage, bounding box regression and classiﬁcation is being done inside these proposed regions. One-stage detectors predict the bounding boxes and their classes at once. Two-stage detectors are usually more precise in terms of localization and classiﬁcation accuracy, but in terms of processing are slower then one-stage detectors. Both of these types contain a backbone network for feature extraction and head networks for classiﬁcation and regression. Typically, the backbone is some SOTA network such as ResNet [5] or ResNext [12], pre-trained on ImageNet or OpenImages. Even though, some approaches [13], [14] also experiment with training from scratch. 2.1.1 Two-stage detectors The prototypical example of two-stage architecture is Faster R-CNN [3], which is an improvement of its predecessor Fast R-CNN [15]. The main improvement lies in the use of Region Proposal Network (RPN), which replaced a much slower selective search of RoIs. It also introduced the usage of multi-scale anchors to detect objects of different sizes. Faster R-CNN is, in a way, a meta-algorithm that can have many different incarnations depending on a type of the backbone and its heads. One of the frequently used backbones, called Feature Pyramid Network (FPN) [16], allows to predict RoIs from multiple feature maps, each with a different resolution. This is beneﬁcial for the recognition of objects at different scales. 2.1.2 One-stage detectors Two best-known examples of one-stage detectors are YOLO [7] and SSD [17]. The architecture of YOLO will be thoroughly described in Section 3. Usually, one-stage detectors divide the image into a grid and predict bounding boxes and their classes inside them, all at once. Most of them also use the concept of anchors, which are predeﬁned typical dimensions of bounding boxes that serve as apriori knowledge. One of the major improvements in the area of one-stage detectors was a novel loss function call Focal Loss [4]. Because of the fact that two-stage detectors produce a sparse set of region proposals in the ﬁrst step, most of the negative locations are ﬁltered out for the second stage. One-stage detectors, on the other hand, produce a dense set of region proposals which they need to classify as containing objects or not. This creates a problem with the non-proportional frequency of negative examples. Focal Loss solves this problem by adjusting the importance of negative and positive examples within the loss function. Another interesting idea was proposed in an architecture called ReﬁneDet [18], which performs a two-step regression of the bounding boxes. The second step reﬁnes the bounding boxes proposed in the ﬁrst step, which produces more accurate detection, especially for small objects. Recently, there has been a surge of interest in approaches that do not use anchor boxes. The main representative of this trend is the FCOS framework [19], which works by predicting four coordinates of a bounding box for every foreground pixel. These four coordinates represent a distance to the four boundary edges of a bounding box in which the pixel is enclosed in. The predicted bounding boxes of every pixel are subsequently ﬁltered by NMS. Similar anchor-free approach was proposed in CornerNet [11], where the objects are detected as a pair of top-left and bottom-right corners of a bounding box. 2.2 Instance Segmentation In many applications, a boundary given by a rectangle may be too crude, and we may instead require a boundary framing the object tightly. In the literature, this task is called Instance Segmentation, and the main approaches also ﬁt into the one-stage/two-stage taxonomy. The prototypical example of a two-stage method is an architecture called Mask R-CNN [5], which extended Faster R-CNN by adding a separate fully-convolutional head that predicts masks of objects. Note, the same principle is also applied to RetinaNet, and the improved net is called RetinaMask [6]. One of Mask R-CNN innovations is a novel way for extracting features from RoIs using the RoIAlign layer, which avoids the problem of misalignments of the RoI due to its quantization to the grid of the feature map. One-stage methods for instance segmentation can be further divided into top-down methods, bottom-up methods, and direct methods. Top-down methods [20, 21] work by ﬁrst detecting an object and then segmenting this object within a bounding box. Prediction of bounding boxes either uses anchors or is anchor free following the FCOS framework [19]. Bottom-up methods [22, 23], on the other hand, work by ﬁrst embedding each pixel into a metric space in which are these pixels subsequently clustered. As the name suggests, direct methods work by directly predicting the segmentation mask without bounding boxes or pixel embedding [24]. We also mention that independently of our instance segmentation, PolarMask [25] introduces instance segmentation using polygons, which are also predicted in polar coordinates. In comparison with PolarMask, Poly-YOLO learns itself in general size-independent shapes due to the use of the relative 3

A PREPRINT - MAY 28, 2020 size of a bounding polygon according to the particular bounding box. The second difference is that Poly-YOLO produces a dynamic number of vertices per polygon, according to the shape-complexity of various objects. 3 Fast and precise object detection with Poly-YOLO Here, we ﬁrstly recall YOLOv3 fundamental ideas, describe issues that block reaching higher performance, and propose our solution that removes them. 3.1 YOLO history First version of YOLO (You Only Look Once) was introduced in 2016 [26]. The motivation behind YOLO is to create a fast object detector with an emphasis on speed. The detector is made of two essential parts: the convolutional neural network (CNN) and specially designed loss function. The CNN backbone is inspired by GoogleNet [27] and has 24 convolutional layers followed by 2 fully connected layers. The network output is reshaped into two dimensional grid with with the shape of Gh × Gw, where Gh is number of cells in vertical side and Gw in horizontal side. Each grid cell occupies a part of the image, as depicted in Fig. 4. Every object in the image has its center in one of the cells, and Figure 4: The left image illustrates the YOLO grid over the input image, and yellow dots represent centers of detected objects. The right image illustrates detections. that particular cell is responsible for detecting and classifying said object. More precisely, the responsible cell outputs N B bounding boxes. Each box is given as a tuple (x, y, w, h) and a conﬁdence measure. Here, (x, y) is the center of the predicted box relative to the cell boundary and (w, h) is the width and height of the bounding box relative to the image size. The conﬁdence measures how much is the cell conﬁdent that it contains an object. Finally, each cell outputs N c conditional class probabilities, i.e. probabilities that detected object belongs to certain class(es). In other words, cell conﬁdence tells us that there is object in the predicted box and conditional class probabilities tells us that the box contains, e.g., vehicle – car. The ﬁnal output of the model is a tensor with dimensions Gh × Gw × (5N B + N c), where constant ﬁve is used because of (x, y, w, h) and a conﬁdence. YOLOv2 [28] brought a couple of improvements. Firstly, the architecture of the convolutional neural network was updated to Darknet-19 – a fully convolutional network with 19 convolutional layers containing batch normalization and ﬁve max-pooling layers. The cells are no longer predicting plain (x, y, w, h) directly, but instead scales and translates anchor boxes. The parameters (aw, ah), i.e., width and height of an anchor box for all anchors boxes are extracted from a training dataset with usage of k-means algorithm. The clustering criterion is IoU. Lastly, YOLOv2 uses skip connections to concatenate features from different parts of the CNN to create ﬁnal tensor of feature maps, including features across different scales and levels of abstraction. The most recent version of YOLO [7] introduces mainly three output scales and a deeper architecture – Darknet-53. Each of the scale/feature-map has its own set of anchors – three per output scale. Compared with v2, YOLOv3 reaches higher accuracy, but due to the heavier backbone, its inference speed is decreased. 3.2 YOLOv3 issues blocking better performance YOLOv3, as it is designed, suffers from two issues that we discovered and that are not described in the original papers: rewriting of labels and imbalanced distribution of anchors across output scales. Solving these issues is crucial for improvement of the YOLO performance. 4

A PREPRINT - MAY 28, 2020 Figure 5: The image illustrates the label rewriting problem for the detection of cars. A label is rewritten by other if centers of two boxes (with the same anchor box) belong to the same cell. In this illustrative example, blue denotes grid, red rewritten label, and green preserved label. Here, 10 labels out of 27 are rewritten, and the detector is not trained to detect them. Table 1: Amount of rewritten labels for various datasets Rewritten labels [%] Poly Poly Resolution YOLOv3 YOLO YOLO lite 416×416 2.31 608×800 0.61 416×416 9.50 608×832 2.75 640×1280 1.44 416×416 13.78 448×800 4.96 704×1280 2.44 16.36 12.55 9.51 3.92 2.56 23.07 13.54 9.16 0.22 0.00 2.79 0.97 0.59 5.80 1.92 1.12 Dataset Simulator Simulator Cityscapes Cityscapes Cityscapes India Driving India Driving India Driving 3.2.1 Label rewriting problem here, we discuss situation, when a bounding box given by its label from a ground truth dataset can be rewritten by other box and therefore the network is not trained to detect it. For the sake of simplicity and explanation, we avoid the usage of the anchors notation in the text bellow. Let us suppose an input image with a resolution of r × r pixels. Furthermore, let sk be a scale ratio of an k-th output to the input, where YOLOv3 uses the following ratios: s1 = 1/8, s2 = 1/16, s3 = 1/32. These scales are given by the YOLOv3 architecture, namely by strided convolutions. Finally, let B = {b1, . . . , bn} be a set of boxes presented in an image. Each box bi is represented as a tuple (bx1 i ) that deﬁnes its top-left and bottom-right corner. For simplicity, we also derive centers i C = {c1, . . . , cn} where ci = (cx i . With this notation, label is rewritten, if the following holds: i ) and the same for cy i ) is deﬁned as cx i = 0.5(bx1 i + bx2 , by1 i i , cy , bx2 i , by2 ∃(ci, cj ∈ C) : ξ(cx i , cx j , sk) + ξ(cy i , cy j , sk) = 2, (1) 1, xz = yz else where , 0, ξ(x, y, z) = (2) and · denotes the lowest integer of the term. The purpose of function ξ is to check if both boxes are assigned to the same cell of a grid on the scale sk. In simple words, if two boxes on the same scale are assigned to the same cell, then one of them will be rewritten. Introducing anchors, both must belong to the same anchor. As a consequence, the network is trained to ignore some objects, which leads to a low number of positive detections. According to Equations (1) and (2), there is a crucial role of sk that directly affects the number and the resolution of cells. Considering standard resolution of YOLO r = 416, then, for s3 (the coarsest scale) we obtain a grid of 13 × 13 cells with size of 32 × 32 pixels each. Also, the absolute size of boxes does not affect the label rewriting problem; the important indicator is the box center. The practical illustration for such a setting and its consequence for the labels is shown in Figure 5. The ratio of rewritten labels in the datasets used in the benchmark is shown in Table 1. 5

3.2.2 Anchors distribution problem A PREPRINT - MAY 28, 2020 the second YOLO issue comes from the fact that YOLO is anchor-based (i.e., it needs prototypical anchor boxes for training/detection), and the anchors are distributed among output scales. Namely, YOLOv3 uses nine anchors, three per output scale. A particular ground truth box is matched with the best matching anchor that assigns it to a certain output layer scale. Here, let us suppose a set of boxes sizes M = {m1, . . . , mn}, where mi = (mw i ) is given by mw for width and analogously for height. The k-means algorithm [29] is applied to M in order to determine centroids in 2D space, which then represent the nine anchors. The anchors are split into triplets and connected with small, medium, and large boxes detected in the output layers. Unfortunately, such a principle of splitting anchors according to three sizes is generally reasonable if i − bx1 i , mh i = bx2 i holds. By U(0, r) we notate a uniform distribution between bounds given by 0 and r. But, such the condition cannot be guaranteed for various applications in general. Note, M ∼ N (0.5r, r), where N (0.5r, r) is normal distribution with mean µ = 0.5r and standard deviation σ2 = r is a more realistic case, which causes that most of the boxes will be captured by the middle output layer (for the medium size) and the two other layers will be underused. M ∼ U(0, r) Figure 6: A comparison of YOLOv3 and Poly-YOLO architecture. Poly-YOLO uses less convolutional ﬁlters per layer in the feature extractor part and extends it by squeeze-and-excitation blocks. The heavy neck is replaced by a lightweight block with hypercolumn that utilizes a stairstep for upsampling. The head now uses single instead of three outputs and has a higher resolution. In summary, Poly-YOLO has 40% less parameters than YOLOv3 while producing more precise predictions. To illustrate the problem, let us suppose two sets of box sizes, M1 and M2; the former connected with the task of car plate detection from a camera placed over the highway and the latter connected with a person detection from a camera placed in front of the door. For such tasks, we can obtain roughly M1 ∼ N (0.3r, 0.2r) because the plates will cover small areas and M2 ∼ N (0.7r, 0.2r) because the people will cover large areas. For both sets, anchors are computed separately. The ﬁrst case leads to the problem that output scales for medium and large will also include small anchors because the dataset does not include big objects. Here, the problem of label rewriting will escalate because small objects will need to be detected in a coarse grid. The second case works vice-versa. Large objects will be detected in small and medium output layers. Here, detection will not be precise because small and medium output layers have limited receptive ﬁelds. The receptive ﬁeld for the three used scales is {85 × 85, 181 × 181, 365 × 365}. The practical impact of the two cases is the same: performance will be sub-optimal. In the paper that introduced YOLOv3 [7], the author says "YOLOv3 has relatively high APsmall performance. However, it has comparatively worse performance on medium and larger size objects. More investigation is needed to get to the bottom of this". We believe that the reason why YOLOv3 has these problems is explained in the paragraph above. 3.3 Poly-YOLO architecture Before we describe the architecture itself, let us mention the motivation and the justiﬁcation for it. As we described in the previous section, YOLO’s performance suffers from the problem of label rewriting and the problematic distribution of anchors among output scales. The ﬁrst issue can be suppressed by high values of sk, i.e., a scale multiplicator that expresses the ratio of output resolution with respect to the input resolution r. The ideal case would happen when r = rsk, i.e., sk = 1, which means 6 Predict threePredict twoPredict oneYOLOv3Poly-YOLOConvolutional 1x1Up samplingConvolutional SetConv 2d 1x1ConcatenateResidualResidual with SEHC with stairstep upsampling1x2x8x8x4x1x2x8x8x4xPredict all

A PREPRINT - MAY 28, 2020 that output and input resolutions are equal. In this case, no label rewriting may occur. Such a condition is generally held in many encoder-decoder-based segmentation NNs such as U-Net [30]. As we are focusing on computational speed, we have to omit such a scheme to ﬁnd a solution where sk < 1 will be a reasonable trade-off. Let us recall that YOLOv3 uses s1 = 1/8, s2 = 1/16, s3 = 1/32. The second issue can be solved using one of two ways. The ﬁrst way is to deﬁne receptive ﬁelds for the three output scales, and deﬁne two thresholds that will split them. Then, k-means will compute centroids triplets (used as anchors) according to these thresholds. This would change the data-driven anchors to problem-driven (receptive ﬁeld) anchors. For example, data M ∼ N (r/5, r/10) would be detected only on scale detecting small objects and not on all scales as it is currently realized in YOLOv3. The drawback of such a way is that we will not use a full capacity of the network. The second way is to create an architecture with a single output that will aggregate information from various scales. Such an aggregated output will also handle all the anchors at once. So, in contrast to the ﬁrst way, the estimation of anchor sizes will be again data-driven. We propose to use a single output layer with a high s1 scale ratio connected with all the anchors, which solves both issues mentioned above. Namely, we use s1 = 1/4. An illustration of a comparison between the original and the new architecture is shown in Figure 6. For the composition of the single output scale from multiple partial scales, we use the hypercolumn technique [31]. Formally, let O be a feature map, u(·, ω) function upscaling an input by a factor ω, and m(·) be a function transforming feature map with dimensions a × b × c × · into a feature map with dimensions a× b× c× δ, where δ is a constant. Furthermore, we consider g(O1, . . . , On) to be an n-nary composition/aggregation function. For that, the output feature map using the hypercolumn is given as O = gm (O1) , um(O2), 21 , . . . , um(On), 2n−1 . Selecting addition as an aggregation function, the formula can be rewritten as n O = u(m(Oi), 2i−1). i=1 As it is evident from the formula, there is a high imbalance - a single value of O1 projects into O just single value, while a single value of On is projected into 2n−1 × 2n−1 values directly. To break the imbalance, we propose to use the staircase approach known from the computer graphic, see Figure 7. The stairstep interpolation increases (or decreases for downscale) an image resolution by 10% at maximum until the desired resolution is reached. In comparison with a direct upscale, the output is more smooth but does not include,e.g., step artifacts as a direct upsampling does. Here, we will use the lowest available upscale factor, two. Formally, stairstep output feature map O is deﬁned as O = . . . u(u(m(On), 2) + m(On−1), 2)··· + m(O1). If we consider the nearest neighbor upsampling, O = O holds. For bilinear interpolation (and others), O = O is reached for non-homogenous inputs. The critical fact is that the computational complexity is equal for both direct upscaling and stairstep upscaling. Although the stairstep approach realizes more adding, they are computed over feature maps with a lower resolution, so the number of added elements is identical. Figure 7: Illustration of HC scheme (left) and HC with stairstep (right). For understanding the practical impact, we initiated the following experiment. We trained Poly-YOLO for 200 training and 100 validation images from Cityscapes dataset [32] for the version with direct upscaling and stairstep upscaling used in the hypercolumn. We ran the training process ﬁve times for each of the versions and plotted the training progress in the graph in Figure 8. The graph shows that the difference is tiny, but it is evident that stairstep interpolation 7 up(8)up(4)up(2)conv2d((1,1),n)conv2d((1,1),n)conv2d((1,1),n)conv2d((1,1),n)conv2d((1,1),n)conv2d((1,1),n)conv2d((1,1),n)conv2d((1,1),n)+a+up(2)up(2)up(2)+++a+

A PREPRINT - MAY 28, 2020 HC loss HC val loss HC StairStep loss HC StairStep val loss 10 9 8 7 15 20 25 30 35 40 45 50 Epoch Figure 8: The graph shows a difference between the usage of the standard hypercolumn technique and the hypercolumn with stairstep in the term of the loss. The thin lines denote particular learning runs, and the thick lines are mean of the runs. in hypercolumn yields slightly lower training and validation loss. The improvement is obtained for the identical computation time. The last way how we propose to modify YOLO’s architecture is the usage of squeeze-and-excitation (SE) blocks [33] in the backbone. Darknet-53, like many other neural networks, uses repetitive blocks, where each block consist of coupled convolutions with residual connection. The squeeze-and-excitation blocks allows the usage of both spatial and channel-wise information, which leads to accuracy improvement. By the addition of squeeze-and-excitation blocks and by working with higher output resolution, computation speed is decreased. Because speed is the main advantage of YOLO, we reduced the number of convolutional ﬁlters in the feature extraction phase. Namely, it is set to 75% of the original number. Also, the neck and head are lighter, together having 37.1M parameters, which is signiﬁcantly less than has YOLOv3 (61.5M). Still, Poly-YOLO achieves higher precision than YOLOv3 – see Section 5.3. We also propose Poly-YOLO lite, which is aimed at higher processing speed. In the feature extractor and the head, this version has only 66% of ﬁlters of Poly-YOLO. Finally, s1 is reduced to 1/4. The number of parameters of Poly-YOLO lite is 16.5M. We want to highlight that for feature extraction, an arbitrary SOTA backbone such as (SE)ResNeXt [12] or Efﬁcient- Net [10] can be used, which would probably increase the overall accuracy. Such an approach can also be seen in the paper YOLOv4 [34], where the authors use a different backbone and several other tricks (that can also be applied in our approach) but the head of the original YOLOv3 is left unchanged. The issues we described and removed in Poly-YOLO actually arise from the design of the head of YOLOv3, and a simple swap of a backbone will not solve them. The model would still suffer label rewriting and improper anchor distribution. In our work, we have focused on performance improvement achieved by conceptual changes and not brute force. Such improvements are then widely applicable, and a modern backbone can be easily integrated. 4 Instance segmentation with Poly-YOLO The last sentence in YOLOv3 paper [7] says "Boxes are stupid anyway though, I’m probably a true believer in masks except I can’t get YOLO to learn them." Here, we show how to extend YOLO with masking functionality (instance segmentation) without a big negative impact on its speed. In our previous work [1], we were focusing on more precise detection of YOLO by means of irregular quadrangular detection. We proved that the extension for quadrangular detection converges faster. We also demonstrated that classiﬁcation from quandrangular approximation yields higher accuracy than from rectangular approximation. The limitation of that approach lies in the ﬁxed number of detected vertices, namely four. Here, we introduce a polygon representation that is able to detect objects with a varying number of vertices without the usage of a recurrent neural network that would slow down the processing speed. To see a practical difference between the quality of bounding-box detection and polygon-based detection, see Figure 10, where we show results from Poly-YOLO trained to detect various geometric primitives including random polygons. 8

分享到：

赞收藏

资料库

poly-yolo.pdf

相关推荐

行业

热门标签

最新资料