Multi-column Deep Neural Networks for Image Classification.pdf

发布时间：2022-05-31 发布人：admin 分类：说明书资料大小：2.12M 资料格式：pdf 举报版权申诉

zeng56723-10645765-16359647786151744708.pdf-第1页.png

第1页 / 共20页

zeng56723-10645765-16359647786151744708.pdf-第2页.png

第2页 / 共20页

zeng56723-10645765-16359647786151744708.pdf-第3页.png

第3页 / 共20页

zeng56723-10645765-16359647786151744708.pdf-第4页.png

第4页 / 共20页

zeng56723-10645765-16359647786151744708.pdf-第5页.png

第5页 / 共20页

zeng56723-10645765-16359647786151744708.pdf-第6页.png

第6页 / 共20页

zeng56723-10645765-16359647786151744708.pdf-第7页.png

第7页 / 共20页

zeng56723-10645765-16359647786151744708.pdf-第8页.png

第8页 / 共20页

文本预览

2 1 0 2 b e F 3 1 ] V C . s c [ 1 v 5 4 7 2 . 2 0 2 1 : v i X r a Multi-column Deep Neural Networks for Image Classiﬁcation Dan Cires¸an Ueli Meier J¨urgen Schmidhuber Technical Report No. IDSIA-04-12 February 2012 IDSIA / USI-SUPSI Dalle Molle Institute for Artiﬁcial Intelligence Galleria 2, 6928 Manno, Switzerland IDSIA is a joint institute of both University of Lugano (USI) and University of Applied Sciences of Southern Switzerland (SUPSI), and was founded in 1988 by the Dalle Molle Foundation which promoted quality of life.

Technical Report No. IDSIA-04-12 1 Multi-column Deep Neural Networks for Image Classiﬁcation Dan Cires¸an Ueli Meier J¨urgen Schmidhuber February 2012 Abstract Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or trafﬁc signs. Our biologically plausible deep arti- ﬁcial neural network architectures can. Small (often minimal) receptive ﬁelds of convolutional winner- take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the ﬁrst to achieve near-human performance. On a trafﬁc sign recognition benchmark it out- performs humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classiﬁcation benchmarks. 1 Introduction Recent publications suggest that unsupervised pre-training of deep, hierarchical neural networks improves supervised pattern classiﬁcation [2, 10]. Here we train such nets by simple online back-propagation, setting new, greatly improved records on MNIST [20], Latin letters [13], Chinese characters [23], trafﬁc signs [36], NORB (jittered, cluttered) [21] and CIFAR10 [18] benchmarks. We focus on deep convolutional neural networks (DNN), introduced by [11], improved by [20], reﬁned and simpliﬁed by [1, 35, 7]. Lately, DNN proved their mettle on data sets ranging from handwritten digits (MNIST) [5, 7], handwritten characters [6] to 3D toys (NORB) and faces [37]. DNNs fully unfold their potential when they are big and deep [7]. But training them requires weeks, months, even years on CPUs. High data transfer latency prevents multi-threading and multi-CPU code from saving the situation. In recent years, however, fast parallel neural net code for graphics cards (GPUs) has overcome this problem. Carefully designed GPU code for image classiﬁcation can be up to two orders of magnitude faster than its CPU counterpart [38, 37]. Hence, to train huge DNN in hours or days, we implement them on GPU, building upon the work of [5, 7]. The training algorithm is fully online, i.e. weight updates occur after each error back-propagation step. We will show that properly trained big and deep DNNs can outperform all previous methods, and demonstrate that unsupervised initialization/pretraining is not necessary (although we don’t deny that it might help sometimes, especially for small datasets). We also show how combining several DNN columns into a Multi-column DNN (MCDNN) further decreases the error rate by 30-40%.

Technical Report No. IDSIA-04-12 2 Architecture 2 The initially random weights of the DNN are iteratively trained to minimize the classiﬁcation error on a set of labeled training images; generalization performance is then tested on a separate set of test images. Our architecture does this by combining several techniques in a novel way: (1) Unlike the shallow NN used in many 1990s applications, ours are deep, inspired by the Neocogni- tron [11], with many (6-10) layers of non-linear neurons stacked on top of each other, comparable to the number of layers found between retina and visual cortex of macaque monkeys [3]. (2) It was shown [14] that such multi-layered DNN are hard to train by standard gradient descent [39, 19, 30], the method of choice from a mathematical/algorithmic point of view. Today’s computers, however, are fast enough for this, more than 60000 times faster than those of the early 90s1. Carefully designed code for massively parallel graphics processing units (GPUs normally used for video games) allows for gaining an additional speedup factor of 50-100 over serial code for standard computers. Given enough labeled data, our networks do not need additional heuristics such as unsupervised pre-training [31, 26, 2, 10] or carefully prewired synapses [29, 34]. (3) The DNN of this paper (Fig. 1a) have 2-dimensional layers of winner-take-all neurons [17, 41] with overlapping receptive ﬁelds whose weights are shared [20, 1, 35, 7]. Given some input pattern, a simple max pooling technique [29] determines winning neurons by partitioning layers into quadratic regions of local inhibition, selecting the most active neuron of each region. The winners of some layer represent a smaller, down-sampled layer with lower resolution, feeding the next layer in the hierarchy. The approach is inspired by Hubel and Wiesel’s seminal work on the cat’s primary visual cortex [40], which identi- ﬁed orientation-selective simple cells with overlapping local receptive ﬁelds and complex cells performing down-sampling-like operations [15]. (4) Note that at some point down-sampling automatically leads to the ﬁrst 1-dimensional layer. From then on, only trivial 1-dimensional winner-take-all regions are possible, that is, the top part of the hierarchy becomes a standard multi-layer perceptron (MLP) [39, 19, 30]. Receptive ﬁelds and winner-take-all regions of our DNN often are (near-)minimal, e.g., only 2x2 or 3x3 neurons. This results in (near-)maximal depth of layers with non-trivial (2-dimensional) winner-take-all regions. In fact, insisting on minimal 2x2 ﬁelds automatically deﬁnes the entire deep architecture, apart from the number of different convolutional kernels per layer [20, 1, 35, 7] and the depth of the plain MLP on top. (5) Only winner neurons are trained, that is, other neurons cannot forget what they learnt so far, although they may be affected by weight changes in more peripheral layers. The resulting decrease of synaptic changes per time interval corresponds to biologically plausible reduction of energy consumption. Our training algorithm is fully online, i.e. weight updates occur after each gradient computation step. (6) Inspired by microcolumns of neurons in the cerebral cortex, we combine several DNN columns to form a Multi-column DNN (MCDNN). Given some input pattern, the predictions of all columns are democratically averaged. Before training, the weights (synapses) of all columns are randomly initialized. Various columns can be trained on the same inputs, or on inputs preprocessed in different ways. The latter helps to reduce both error rate and number of columns required to reach a given accuracy. The MCDNN architecture and its training and testing procedures are illustrated in Figure 1. 3 Experiments In the following we give a detailed description of all the experiments we performed. We evaluate our architecture on various commonly used object recognition benchmarks and improve the state-of-the-art on all of them. The description of the DNN architecture used for the various experiments is given in the following way: 2x48x48-100C5-MP2-100C5-MP2-100C4-MP2-300N-100N-6N represents a net with 2 input images of size 48x48, a convolutional layer with 100 maps and 5x5 ﬁlters, a max-pooling layer over 11991 486DX-33 MHz, 2011 i7-990X 3.46 GHz

Technical Report No. IDSIA-04-12 3 (a) (b) (c) Figure 1: (a) DNN architecture. (b) MCDNN architecture. The input image can be preprocessed by P0 − Pn−1 blocks. An arbitrary number of columns can be trained on inputs preprocessed in different ways. The ﬁnal predictions are obtained by averaging individual predictions of each DNN. (c) Training a DNN. The dataset is preprocessed before training, then, at the beginning of every epoch, the images are distorted (D block). See text for more explanations. non overlapping regions of size 2x2, a convolutional layer with 100 maps and 4x4 ﬁlters, a max-pooling layer over non overlapping regions of size 2x2, a fully connected layer with 300 hidden units, a fully connected layer with 100 hidden units and a fully connected output layer with 6 neurons (one per class). We use a scaled hyperbolic tangent activation function for convolutional and fully connected layers, a linear activation function for max-pooling layers and a softmax activation function for the output layer. All DNN are trained using on-line gradient descent with an annealed learning rate. During training, images are continually translated, scaled and rotated (even elastically distorted in case of characters), whereas only the original images are used for validation. Training ends once the validation error is zero or when the learning rate reaches its predetermined minimum. Initial weights are drawn from a uniform random distribution in the range [−0.05, 0.05]. 3.1 MNIST The original MNIST digits [20] are normalized such that the width or height of the bounding box equals 20 pixels. Aspect ratios for various digits vary strongly and we therefore create six additional datasets by InputConvolutionMax PoolingMax PoolingConvolutionFully connectedFully connectedP0DNNDNNP1DNNDNNPn-1DNNDNNImageAVERAGINGDTRAININGImagePDNN

Technical Report No. IDSIA-04-12 4 normalizing digit width to 10, 12, 14, 16, 18, 20 pixels. This is like seeing the data from different angles. We train ﬁve DNN columns per normalization, resulting in a total of 35 columns for the entire MCDNN. All 1x29x29-20C4-MP2-40C5-MP3-150N-10N DNN are trained for around 800 epochs with an annealed learning rate (i.e. initialized with 0.001 multiplied by a factor of 0.993/epoch until it reaches 0.00003). Training a DNN takes almost 14 hours and after 500 training epochs little additional improvement is ob- served. During training the digits are randomly distorted before each epoch (see Fig. 2a for representative characters and their distorted versions [7]). The internal state of a single DNN is depicted in Figure 2c, where a particular digit is forward propagated through a trained network and all activations together with the network weights are plotted. (a) (b) (c) Figure 2: (a) Handwritten digits from the training set (top row) and their distorted versions after each epoch (second to ﬁfth row). (b) The 23 errors of the MCDNN, with correct label (up right) and ﬁrst and second best predictions (down left and right). (c) DNN architecture for MNIST. Output layer not drawn to scale; weights of fully connected layers not displayed. L0-Input1 @ 29x2920 @ 4x4Filtersclass "2"L6-OutputLayer10L5-Fully ConnectedLayer150Filters800 @ 5x520 @ 26x26L1-ConvolutionalLayerL2-MaxPoolingLayer20 @ 13x1340 @ 9x9L3-ConvolutionalLayerL4-MaxPoolingLayer40 @ 3x3

Technical Report No. IDSIA-04-12 5 Results of all individual nets and various MCDNN are summarized in Table 1. MCDNN of 5 nets trained with the same preprocessor achieve better results than their constituent DNNs, except for original images (Tab. 1). The MCDNN has a very low 0.23% error rate, improving state of the art by at least 34% [5, 7, 27] (Tab. 2). This is the ﬁrst time an artiﬁcial method comes close to the ≈0.2% error rate of humans on this task [22]. Many of the wrongly classiﬁed digits either contain broken or strange strokes, or have wrong labels. The 23 errors (Fig. 2b) are associated with 20 correct second guesses. We also trained a single DNN on all 7 datasets simultaneously which yielded worse result (0.52%) than both MCDNN and their individual DNN. This shows that the improvements come from the MCDNN and not from using more preprocessed data. Table 1: Test error rate [%] of the 35 NNs trained on MNIST. Wxx - width of the character is normalized to xx pixels Trial ORIGINAL 1 2 3 4 5 W10 0.49 0.48 0.59 0.55 0.51 W12 0.39 0.45 0.51 0.44 0.39 W14 0.40 0.45 0.41 0.42 0.48 W16 0.40 0.39 0.41 0.43 0.40 W18 0.39 0.50 0.38 0.39 0.36 W20 0.36 0.41 0.43 0.50 0.29 0.52 0.44 0.40 0.53 0.46 avg. 0.52±0.05 0.44±0.05 0.43±0.03 0.40±0.02 0.40±0.06 35-net average error: 0.44±0.06 0.39±0.08 0.47±0.05 5 columns MCDNN 0.37 0.26 0.32 0.33 0.31 0.26 0.46 35-net MCDNN: 0.23% Table 2: Results on MNIST dataset. Method CNN CNN MLP CNN committee MCDNN Paper [35] [28] [5] [6] this Error rate[%] 0.40 0.39 0.35 0.27 0.23 How are the MCDNN errors affected by the number of preprocessors? We train 5 DNNs on all 7 datasets. A MCDNN ’y out-of-7’ (y from 1 to 7) averages 5y nets trained on y datasets. Table 3 shows that more preprocessing results in lower MCDNN error. We also train 5 DNN for each odd normalization, i.e. W11, W13, W15, W17 and W19. The 60-net MCDNN performs (0.24%) similarly to the 35-net MCDNN, indicating that additional preprocessing does not further improve recognition. We conclude that MCDNN outperform DNN trained on the same data, and that different preprocessors further decrease the error rate. 3.2 NIST SD 19 The 35-columns MCDNN architecture and preprocessing used for MNIST are also applied to Latin char- acters from NIST SD 19 [13]. For all tasks our MCDNN achieves recognition rates 1.5-5 times better

Technical Report No. IDSIA-04-12 6 Table 3: Average test error rate [%] of MCDNN trained on y preprocessed datasets. # MCDNN Average Error[%] y 1 2 3 4 5 6 7 7 21 35 35 21 7 1 0.33±0.07 0.27±0.02 0.27±0.02 0.26±0.02 0.25±0.01 0.24±0.01 0.23 than any published result (Tab. 4). In total there are 82000 characters in the test set, but there are many more easy to classify digits (58000) than hard to classify letters (24000). This explains the lower overall error rate of the 62-class problem compared to the 52-class letters problem. From all errors of the 62-class problem 3% of the 58000 digits are misclassiﬁed and 33% of the 24000 letters are misclassiﬁed. Letters are in general more difﬁcult to classify, but there is also a higher amount of confusion between similar lower- and upper-case letters such as i,I and o,O for example. Indeed, error rates for the case insensitive task drop from 21% to 7.37%. If the confused upper- and lower-case classes are merged, resulting in 37 different classes, the error rate is only slightly bigger (7.99%). Upper-case letters are far easier to classify (1.83% error rate) than lowercase letters (7.47%) due to the smaller writer dependent in-class variability. For a detailed analysis of all the errors and confusions between different classes, the confusion matrix is most informative (Supplementary material Fig. S1). Table 4: Average error rates of MCDNN for all experiments, plus results from the literature. * case insensitive Data (task) all (62) digits (10) letters (52) letters* (26) merged (37) uppercase (26) lowercase (26) 11.63 0.77 21.01 7.37 7.99 1.83 7.47 MCDNN error [%] Published results Error[%] and paper 3.71 [12] 30.91[16] 13.00 [4] 1.88 [25] 13.66[16] 10.00 [4] 16.00 [4] 6.44 [9] 13.27 [16] 3.3 Chinese characters Compared to Latin character recognition, isolated Chinese character recognition is a much harder problem, mainly because of the much larger category set, but also because of wide variability of writing styles, and the confusion between similar characters. We use a dataset from the Institute of Automation of Chinese Academy of Sciences (CASIA [23]), which contains 300 samples for each of 3755 characters (in GB1 set). This resulted in a data set with more than 1 Million characters (3GB of data) which posed a major computational challenge even to our system. Without our fast GPU implementation the nets on this task would train for more than one year. Only the forward propagation of the training set takes 27h on a normal CPU, and training a single epoch would consequently have lasted several days. On our fast GPU implementation on the other hand, training a single epoch takes 3.4h, which makes it feasible to train a net within a few days instead of many months. We train following DNN, 1x48x48-100C3-MP2-200C2-MP2-300C2-MP2-400C2-MP2-500N-3755N, on ofﬂine as well as on online characters. For the ofﬂine character recognition task, we resize all characters

Technical Report No. IDSIA-04-12 7 to 40x40 pixels and place them in the center of a 48x48 image. The contrast of each image is normalized independently. As suggested by the organizers, the ﬁrst 240 writers from the database CASIA-HWDB1.1 are used for training and the remaining 60 writers are used for testing. The total numbers of training and test characters are 938679 and 234228, respectively. For the online dataset, we draw each character from its list of coordinates, resize the resulting images to 40x40 pixels and place them in the center of a 48x48 image. Additionally, we smooth-out the resulting images with a Gaussian blur ﬁlter over a 3x3 pixel neighborhood and uniform standard deviation of 0.75. As suggested by the organizers, the characters of 240 writers from database CASIA-OLHWDB1.1 are used for training the classiﬁer and the characters of the remaining 60 writers are used for testing. The resulting numbers of training and test characters are 939564 and 234800, respectively. All methods previously applied to this dataset perform some feature extraction followed by a dimen- sionality reduction, whereas our method directly works on raw pixel intensities and learns the feature extraction and dimensionality reduction in a supervised way. On the ofﬂine task we obtain an error rate of 6.5% compared to 10.01% of the best method [23]. Even though much information is lost when drawing a character from it’s coordinate sequence, we obtain a recognition rate of 5.61% on the online task compared to 7.61% of the best method [23]. We conclude that on this very hard classiﬁcation problem, with many classes (3755) and relatively few samples per class (240), our fully supervised DNN beats the current state-of-the-art methods by a large margin. 3.4 Trafﬁc signs Recognizing trafﬁc signs is essential for the automotive industry’s efforts in the ﬁeld of driver’s assistance, and for many other trafﬁc-related applications. We use the GTSRB trafﬁc sign dataset [36]. The original color images contain one trafﬁc sign each, with a border of 10% around the sign. They vary in size from 15 × 15 to 250 × 250 pixels and are not necessarily square. The actual trafﬁc sign is not always centered within the image; its bounding box is part of the annotations. The training set consists of 26640 images; the test set of 12569 images. We crop all images and process only within the bounding box. Our DNN implementation requires all training images to be of equal size. After visual inspection of the image size distribution we resize all images to 48× 48 pixels. As a consequence, scaling factors along both axes are different for trafﬁc signs with rectangular bounding boxes. Resizing forces them to have square bounding boxes. Our MCDNN is the only artiﬁcial method to outperform humans, who produced twice as many errors. Since trafﬁc signs greatly vary in illumination and contrast, standard image preprocessing methods are used to enhance/normalize them (Fig. 3a and supplementary material). For each dataset ﬁve DNN are trained (architecture: 3x48x48-100C7-MP2-150C4-150MP2-250C4-250MP2-300N-43N), resulting in a MCDNN with 25 columns, achieving an error rate of 0.54% on the test set. Figure 3b depicts all errors, plus ground truth and ﬁrst and second predictions. Over 80% of the 68 errors are associated with correct second predictions. Erroneously predicted class probabilities tend to be very low—here the MCDNN is quite unsure about its classiﬁcations. In general, however, it is very conﬁdent—most of its predicted class probabilities are close to one or zero. Rejecting only 1% percent of all images (conﬁdence below 0.51) results in an even lower error rate of 0.24%. To reach an error rate of 0.01% (a single misclassiﬁcation), only 6.67% of the images have to be rejected (conﬁdence below 0.94). Our method outperforms the second best algorithm by a factor of 3. It takes 37 hours to train the MCDNN with 25 columns on four GPUs. The trained MCDNN can check 87 images per second on one GPU (and 2175 images/s/DNN). 3.5 CIFAR 10 CIFAR10 is a set of natural color images of 32x32 pixels [18]. It contains 10 classes, each with 5000 training samples and 1000 test samples. Images vary greatly within each class. They are not necessarily

分享到：

赞收藏

资料库

Multi-column Deep Neural Networks for Image Classification.pdf

相关推荐

人工智能

热门标签

最新资料