logo资料库

Deep Learning in Neural Networks An Overview.pdf

第1页 / 共88页
第2页 / 共88页
第3页 / 共88页
第4页 / 共88页
第5页 / 共88页
第6页 / 共88页
第7页 / 共88页
第8页 / 共88页
资料共88页,剩余部分请下载后查看
1 Introduction to Deep Learning (DL) in Neural Networks (NNs)
2 Event-Oriented Notation for Activation Spreading in NNs
3 Depth of Credit Assignment Paths (CAPs) and of Problems
4 Recurring Themes of Deep Learning
4.1 Dynamic Programming for Supervised/Reinforcement Learning (SL/RL)
4.2 Unsupervised Learning (UL) Facilitating SL and RL
4.3 Learning Hierarchical Representations Through Deep SL, UL, RL
4.4 Occam's Razor: Compression and Minimum Description Length (MDL)
4.5 Fast Graphics Processing Units (GPUs) for DL in NNs
5 Supervised NNs, Some Helped by Unsupervised NNs
5.1 Early NNs Since the 1940s (and the 1800s)
5.2 Around 1960: Visual Cortex Provides Inspiration for DL (Sec. 5.4, 5.11)
5.3 1965: Deep Networks Based on the Group Method of Data Handling
5.4 1979: Convolution + Weight Replication + Subsampling (Neocognitron)
5.5 1960-1981 and Beyond: Development of Backpropagation (BP) for NNs
5.5.1 BP for Weight-Sharing Feedforward NNs (FNNs) and Recurrent NNs (RNNs)
5.6 Late 1980s-2000 and Beyond: Numerous Improvements of NNs
5.6.1 Ideas for Dealing with Long Time Lags and Deep CAPs
5.6.2 Better BP Through Advanced Gradient Descent (Compare Sec. 5.24)
5.6.3 Searching For Simple, Low-Complexity, Problem-Solving NNs (Sec. 5.24)
5.6.4 Potential Benefits of UL for SL (Compare Sec. 5.7, 5.10, 5.15)
5.7 1987: UL Through Autoencoder (AE) Hierarchies (Compare Sec. 5.15)
5.8 1989: BP for Convolutional NNs (CNNs, Sec. 5.4)
5.9 1991: Fundamental Deep Learning Problem of Gradient Descent
5.10 1991: UL-Based History Compression Through a Deep Stack of RNNs
5.11 1992: Max-Pooling (MP): Towards MPCNNs (Compare Sec. 5.16, 5.19)
5.12 1994: Early Contest-Winning NNs
5.13 1995: Supervised Recurrent Very Deep Learner (LSTM RNN)
5.14 2003: More Contest-Winning/Record-Setting NNs; Successful Deep NNs
5.15 2006/7: UL For Deep Belief Networks / AE Stacks Fine-Tuned by BP
5.16 2006/7: Improved CNNs / GPU-CNNs / BP for MPCNNs / LSTM Stacks
5.17 2009: First Official Competitions Won by RNNs, and with MPCNNs
5.18 2010: Plain Backprop (+ Distortions) on GPU Breaks MNIST Record
5.19 2011: MPCNNs on GPU Achieve Superhuman Vision Performance
5.20 2011: Hessian-Free Optimization for RNNs
5.21 2012: First Contests Won on ImageNet, Object Detection, Segmentation
5.22 2013-: More Contests and Benchmark Records
5.23 Currently Successful Techniques: LSTM RNNs and GPU-MPCNNs
5.24 Recent Tricks for Improving SL Deep NNs (Compare Sec. 5.6.2, 5.6.3)
5.25 Consequences for Neuroscience
5.26 DL with Spiking Neurons?
6 DL in FNNs and RNNs for Reinforcement Learning (RL)
6.1 RL Through NN World Models Yields RNNs With Deep CAPs
6.2 Deep FNNs for Traditional RL and Markov Decision Processes (MDPs)
6.3 Deep RL RNNs for Partially Observable MDPs (POMDPs)
6.4 RL Facilitated by Deep UL in FNNs and RNNs
6.5 Deep Hierarchical RL (HRL) and Subgoal Learning with FNNs and RNNs
6.6 Deep RL by Direct NN Search / Policy Gradients / Evolution
6.7 Deep RL by Indirect Policy Search / Compressed NN Search
6.8 Universal RL
7 Conclusion and Outlook
8 Acknowledgments
4 1 0 2 t c O 8 ] E N . s c [ 4 v 8 2 8 7 . 4 0 4 1 : v i X r a Deep Learning in Neural Networks: An Overview Technical Report IDSIA-03-14 / arXiv:1404.7828 v4 [cs.NE] (88 pages, 888 references) J¨urgen Schmidhuber The Swiss AI Lab IDSIA Istituto Dalle Molle di Studi sull’Intelligenza Artificiale University of Lugano & SUPSI Galleria 2, 6928 Manno-Lugano Switzerland 8 October 2014 Abstract In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distin- guished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks. LATEX source: http://www.idsia.ch/˜juergen/DeepLearning8Oct2014.tex Complete BIBTEX file (888 kB): http://www.idsia.ch/˜juergen/deep.bib Preface This is the preprint of an invited Deep Learning (DL) overview. One of its goals is to assign credit to those who contributed to the present state of the art. I acknowledge the limitations of attempting to achieve this goal. The DL research community itself may be viewed as a continually evolving, deep network of scientists who have influenced each other in complex ways. Starting from recent DL results, I tried to trace back the origins of relevant ideas through the past half century and beyond, sometimes using “local search” to follow citations of citations backwards in time. Since not all DL publications properly acknowledge earlier relevant work, additional global search strategies were em- ployed, aided by consulting numerous neural network experts. As a result, the present preprint mostly consists of references. Nevertheless, through an expert selection bias I may have missed important work. A related bias was surely introduced by my special familiarity with the work of my own DL research group in the past quarter-century. For these reasons, this work should be viewed as merely a snapshot of an ongoing credit assignment process. To help improve it, please do not hesitate to send corrections and suggestions to juergen@idsia.ch. 1
Contents 1 Introduction to Deep Learning (DL) in Neural Networks (NNs) 2 Event-Oriented Notation for Activation Spreading in NNs 3 Depth of Credit Assignment Paths (CAPs) and of Problems 4 Recurring Themes of Deep Learning 4.1 Dynamic Programming for Supervised/Reinforcement Learning (SL/RL) . . . . . . 4.2 Unsupervised Learning (UL) Facilitating SL and RL . . . . . . . . . . . . . . . . . 4.3 Learning Hierarchical Representations Through Deep SL, UL, RL . . . . . . . . . . 4.4 Occam’s Razor: Compression and Minimum Description Length (MDL) . . . . . . . 4.5 Fast Graphics Processing Units (GPUs) for DL in NNs . . . . . . . . . . . . . . . . 5 Supervised NNs, Some Helped by Unsupervised NNs . 5.1 Early NNs Since the 1940s (and the 1800s) . . . . . . . . . . . . . . . . . . . . . . 5.2 Around 1960: Visual Cortex Provides Inspiration for DL (Sec. 5.4, 5.11) . . . . . . . 1965: Deep Networks Based on the Group Method of Data Handling . . . . . . . . . 5.3 5.4 1979: Convolution + Weight Replication + Subsampling (Neocognitron) . . . . . 1960-1981 and Beyond: Development of Backpropagation (BP) for NNs . . . . . . . 5.5 5.5.1 BP for Weight-Sharing Feedforward NNs (FNNs) and Recurrent NNs (RNNs) 5.6 Late 1980s-2000 and Beyond: Numerous Improvements of NNs . . . . . . . . . . . Ideas for Dealing with Long Time Lags and Deep CAPs . . . . . . . . . . . 5.6.1 . . . . 5.6.2 Better BP Through Advanced Gradient Descent (Compare Sec. 5.24) Searching For Simple, Low-Complexity, Problem-Solving NNs (Sec. 5.24) . 5.6.3 Potential Benefits of UL for SL (Compare Sec. 5.7, 5.10, 5.15) . . . . . . . . 5.6.4 . . . . . . . 1987: UL Through Autoencoder (AE) Hierarchies (Compare Sec. 5.15) 5.7 1989: BP for Convolutional NNs (CNNs, Sec. 5.4) . . . . . . . . . . . . . . . . . . 5.8 1991: Fundamental Deep Learning Problem of Gradient Descent . . . . . . . . . . . 5.9 5.10 1991: UL-Based History Compression Through a Deep Stack of RNNs . . . . . . . 5.11 1992: Max-Pooling (MP): Towards MPCNNs (Compare Sec. 5.16, 5.19) . . . . . . . 5.12 1994: Early Contest-Winning NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.13 1995: Supervised Recurrent Very Deep Learner (LSTM RNN) . . . . . . . . . . . . 5.14 2003: More Contest-Winning/Record-Setting NNs; Successful Deep NNs . . . . . . 5.15 2006/7: UL For Deep Belief Networks / AE Stacks Fine-Tuned by BP . . . . . . . . 5.16 2006/7: Improved CNNs / GPU-CNNs / BP for MPCNNs / LSTM Stacks . . . . . . 5.17 2009: First Official Competitions Won by RNNs, and with MPCNNs . . . . . . . . . 5.18 2010: Plain Backprop (+ Distortions) on GPU Breaks MNIST Record . . . . . . . . 5.19 2011: MPCNNs on GPU Achieve Superhuman Vision Performance . . . . . . . . . 5.20 2011: Hessian-Free Optimization for RNNs . . . . . . . . . . . . . . . . . . . . . . 5.21 2012: First Contests Won on ImageNet, Object Detection, Segmentation . . . . . . . 5.22 2013-: More Contests and Benchmark Records . . . . . . . . . . . . . . . . . . . . 5.23 Currently Successful Techniques: LSTM RNNs and GPU-MPCNNs . . . . . . . . . 5.24 Recent Tricks for Improving SL Deep NNs (Compare Sec. 5.6.2, 5.6.3) . . . . . . . 5.25 Consequences for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.26 DL with Spiking Neurons? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 6 7 7 7 8 8 8 8 9 10 10 10 11 11 12 12 13 14 14 15 16 16 17 18 18 19 21 21 22 22 23 23 24 24 25 26 27 28 28 2
6 DL in FNNs and RNNs for Reinforcement Learning (RL) 6.1 RL Through NN World Models Yields RNNs With Deep CAPs . . . . . . . . . . . . 6.2 Deep FNNs for Traditional RL and Markov Decision Processes (MDPs) . . . . . . . 6.3 Deep RL RNNs for Partially Observable MDPs (POMDPs) . . . . . . . . . . . . . . 6.4 RL Facilitated by Deep UL in FNNs and RNNs . . . . . . . . . . . . . . . . . . . . 6.5 Deep Hierarchical RL (HRL) and Subgoal Learning with FNNs and RNNs . . . . . . 6.6 Deep RL by Direct NN Search / Policy Gradients / Evolution . . . . . . . . . . . . . 6.7 Deep RL by Indirect Policy Search / Compressed NN Search . . . . . . . . . . . . . 6.8 Universal RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 30 31 31 31 32 33 33 34 35 7 Conclusion and Outlook 8 Acknowledgments Abbreviations in Alphabetical Order AE: Autoencoder AI: Artificial Intelligence ANN: Artificial Neural Network BFGS: Broyden-Fletcher-Goldfarb-Shanno BNN: Biological Neural Network BM: Boltzmann Machine BP: Backpropagation BRNN: Bi-directional Recurrent Neural Network CAP: Credit Assignment Path CEC: Constant Error Carousel CFL: Context Free Language CMA-ES: Covariance Matrix Estimation ES CNN: Convolutional Neural Network CoSyNE: Co-Synaptic Neuro-Evolution CSL: Context Senistive Language CTC : Connectionist Temporal Classification DBN: Deep Belief Network DCT: Discrete Cosine Transform DL: Deep Learning DP: Dynamic Programming DS: Direct Policy Search EA: Evolutionary Algorithm EM: Expectation Maximization ES: Evolution Strategy FMS: Flat Minimum Search FNN: Feedforward Neural Network FSA: Finite State Automaton GMDH: Group Method of Data Handling GOFAI: Good Old-Fashioned AI GP: Genetic Programming GPU: Graphics Processing Unit GPU-MPCNN: GPU-Based MPCNN HMM: Hidden Markov Model HRL: Hierarchical Reinforcement Learning HTM: Hierarchical Temporal Memory HMAX: Hierarchical Model “and X” LSTM: Long Short-Term Memory (RNN) MDL: Minimum Description Length MDP: Markov Decision Process MNIST: Mixed National Institute of Standards and Technology Database MP: Max-Pooling MPCNN: Max-Pooling CNN NE: NeuroEvolution NEAT: NE of Augmenting Topologies NES: Natural Evolution Strategies NFQ: Neural Fitted Q-Learning NN: Neural Network OCR: Optical Character Recognition PCC: Potential Causal Connection PDCC: Potential Direct Causal Connection PM: Predictability Minimization POMDP: Partially Observable MDP RAAM: Recursive Auto-Associative Memory RBM: Restricted Boltzmann Machine ReLU: Rectified Linear Unit RL: Reinforcement Learning RNN: Recurrent Neural Network R-prop: Resilient Backpropagation SL: Supervised Learning SLIM NN: Self-Delimiting Neural Network SOTA: Self-Organising Tree Algorithm SVM: Support Vector Machine TDNN: Time-Delay Neural Network TIMIT: TI/SRI/MIT Acoustic-Phonetic Continu- ous Speech Corpus UL: Unsupervised Learning WTA: Winner-Take-All 3
1 Introduction to Deep Learning (DL) in Neural Networks (NNs) Which modifiable components of a learning system are responsible for its success or failure? What changes to them improve performance? This has been called the fundamental credit assignment prob- lem (Minsky, 1963). There are general credit assignment methods for universal problem solvers that are time-optimal in various theoretical senses (Sec. 6.8). The present survey, however, will focus on the narrower, but now commercially important, subfield of Deep Learning (DL) in Artificial Neural Networks (NNs). A standard neural network (NN) consists of many simple, connected processors called neurons, each producing a sequence of real-valued activations. Input neurons get activated through sensors per- ceiving the environment, other neurons get activated through weighted connections from previously active neurons (details in Sec. 2). Some neurons may influence the environment by triggering actions. Learning or credit assignment is about finding weights that make the NN exhibit desired behavior, such as driving a car. Depending on the problem and how the neurons are connected, such behavior may require long causal chains of computational stages (Sec. 3), where each stage transforms (of- ten in a non-linear way) the aggregate activation of the network. Deep Learning is about accurately assigning credit across many such stages. Shallow NN-like models with few such stages have been around for many decades if not centuries (Sec. 5.1). Models with several successive nonlinear layers of neurons date back at least to the 1960s (Sec. 5.3) and 1970s (Sec. 5.5). An efficient gradient descent method for teacher-based Supervised Learning (SL) in discrete, differentiable networks of arbitrary depth called backpropagation (BP) was developed in the 1960s and 1970s, and applied to NNs in 1981 (Sec. 5.5). BP-based training of deep NNs with many layers, however, had been found to be difficult in practice by the late 1980s (Sec. 5.6), and had become an explicit research subject by the early 1990s (Sec. 5.9). DL became practically fea- sible to some extent through the help of Unsupervised Learning (UL), e.g., Sec. 5.10 (1991), Sec. 5.15 (2006). The 1990s and 2000s also saw many improvements of purely supervised DL (Sec. 5). In the new millennium, deep NNs have finally attracted wide-spread attention, mainly by outperforming al- ternative machine learning methods such as kernel machines (Vapnik, 1995; Sch¨olkopf et al., 1998) in numerous important applications. In fact, since 2009, supervised deep NNs have won many official international pattern recognition competitions (e.g., Sec. 5.17, 5.19, 5.21, 5.22), achieving the first superhuman visual pattern recognition results in limited domains (Sec. 5.19, 2011). Deep NNs also have become relevant for the more general field of Reinforcement Learning (RL) where there is no supervising teacher (Sec. 6). Both feedforward (acyclic) NNs (FNNs) and recurrent (cyclic) NNs (RNNs) have won contests (Sec. 5.12, 5.14, 5.17, 5.19, 5.21, 5.22). In a sense, RNNs are the deepest of all NNs (Sec. 3)—they are general computers more powerful than FNNs, and can in principle create and process memories of arbitrary sequences of input patterns (e.g., Siegelmann and Sontag, 1991; Schmidhuber, 1990a). Unlike traditional methods for automatic sequential program synthesis (e.g., Waldinger and Lee, 1969; Balzer, 1985; Soloway, 1986; Deville and Lau, 1994), RNNs can learn programs that mix sequential and parallel information processing in a natural and efficient way, exploiting the massive parallelism viewed as crucial for sustaining the rapid decline of computation cost observed over the past 75 years. The rest of this paper is structured as follows. Sec. 2 introduces a compact, event-oriented notation that is simple yet general enough to accommodate both FNNs and RNNs. Sec. 3 introduces the concept of Credit Assignment Paths (CAPs) to measure whether learning in a given NN application is of the deep or shallow type. Sec. 4 lists recurring themes of DL in SL, UL, and RL. Sec. 5 focuses on SL and UL, and on how UL can facilitate SL, although pure SL has become dominant in recent competitions (Sec. 5.17–5.23). Sec. 5 is arranged in a historical timeline format with subsections on important inspirations and technical contributions. Sec. 6 on deep RL discusses traditional Dynamic Programming (DP)-based RL combined with gradient-based search techniques for SL or UL in deep 4
NNs, as well as general methods for direct and indirect search in the weight space of deep FNNs and RNNs, including successful policy gradient and evolutionary methods. 2 Event-Oriented Notation for Activation Spreading in NNs Throughout this paper, let i, j, k, t, p, q, r denote positive integer variables assuming ranges implicit in the given contexts. Let n, m, T denote positive integer constants. An NN’s topology may change over time (e.g., Sec. 5.3, 5.6.3). At any given moment, it can be described as a finite subset of units (or nodes or neurons) N = {u1, u2, . . . ,} and a finite set H ⊆ N × N of directed edges or connections between nodes. FNNs are acyclic graphs, RNNs cyclic. The first (input) layer is the set of input units, a subset of N. In FNNs, the k-th layer (k > 1) is the set of all nodes u ∈ N such that there is an edge path of length k − 1 (but no longer path) between some input unit and u. There may be shortcut connections between distant layers. In sequence-processing, fully connected RNNs, all units have connections to all non-input units. The NN’s behavior or program is determined by a set of real-valued, possibly modifiable, param- eters or weights wi (i = 1, . . . , n). We now focus on a single finite episode or epoch of information processing and activation spreading, without learning through weight changes. The following slightly unconventional notation is designed to compactly describe what is happening during the runtime of the system. ft(nett) with real-valued nett = During an episode, there is a partially causal sequence xt(t = 1, . . . , T ) of real values that I call events. Each xt is either an input set by the environment, or the activation of a unit that may directly depend on other xk(k < t) through a current NN topology-dependent set int of indices k representing incoming causal connections or links. Let the function v encode topology information and map such event index pairs (k, t) to weight indices. For example, in the non-input case we may have xt = xkwv(k,t) (multiplicative case), where ft is a typically nonlinear real-valued activation function such as tanh. In many recent competition-winning NNs (Sec. 5.19, 5.21, 5.22) there also are events of the type xt = maxk∈int(xk); some network types may also use complex polynomial activation functions (Sec. 5.3). xt may directly affect certain xk(k > t) through outgoing connections or links represented through a current set outt of indices k with t ∈ ink. Some of the non-input events are called output events. xkwv(k,t) (additive case) or nett = k∈int k∈int Note that many of the xt may refer to different, time-varying activations of the same unit in sequence-processing RNNs (e.g., Williams, 1989, “unfolding in time”), or also in FNNs sequentially exposed to time-varying input patterns of a large training set encoded as input events. During an episode, the same weight may get reused over and over again in topology-dependent ways, e.g., in RNNs, or in convolutional NNs (Sec. 5.4, 5.8). I call this weight sharing across space and/or time. Weight sharing may greatly reduce the NN’s descriptive complexity, which is the number of bits of information required to describe the NN (Sec. 4.4). In Supervised Learning (SL), certain NN output events xt may be associated with teacher-given, real-valued labels or targets dt yielding errors et, e.g., et = 1/2(xt−dt)2. A typical goal of supervised NN training is to find weights that yield episodes with small total error E, the sum of all such et. The hope is that the NN will generalize well in later episodes, causing only small errors on previously unseen sequences of input events. Many alternative error functions for SL and UL are possible. SL assumes that input events are independent of earlier output events (which may affect the en- vironment through actions causing subsequent perceptions). This assumption does not hold in the broader fields of Sequential Decision Making and Reinforcement Learning (RL) (Kaelbling et al., 1996; Sutton and Barto, 1998; Hutter, 2005; Wiering and van Otterlo, 2012) (Sec. 6). In RL, some of the input events may encode real-valued reward signals given by the environment, and a typical 5
goal is to find weights that yield episodes with a high sum of reward signals, through sequences of appropriate output actions. Sec. 5.5 will use the notation above to compactly describe a central algorithm of DL, namely, backpropagation (BP) for supervised weight-sharing FNNs and RNNs. (FNNs may be viewed as RNNs with certain fixed zero weights.) Sec. 6 will address the more general RL case. 3 Depth of Credit Assignment Paths (CAPs) and of Problems To measure whether credit assignment in a given NN application is of the deep or shallow type, I introduce the concept of Credit Assignment Paths or CAPs, which are chains of possibly causal links between the events of Sec. 2, e.g., from input through hidden to output layers in FNNs, or through transformations over time in RNNs. Let us first focus on SL. Consider two events xp and xq (1 ≤ p < q ≤ T ). Depending on the application, they may have a Potential Direct Causal Connection (PDCC) expressed by the Boolean predicate pdcc(p, q), which is true if and only if p ∈ inq. Then the 2-element list (p, q) is defined to be a CAP (a minimal one) from p to q. A learning algorithm may be allowed to change wv(p,q) to improve performance in future episodes. More general, possibly indirect, Potential Causal Connections (PCC) are expressed by the re- cursively defined Boolean predicate pcc(p, q), which in the SL case is true only if pdcc(p, q), or if pcc(p, k) for some k and pdcc(k, q). In the latter case, appending q to any CAP from p to k yields a CAP from p to q (this is a recursive definition, too). The set of such CAPs may be large but is finite. Note that the same weight may affect many different PDCCs between successive events listed by a given CAP, e.g., in the case of RNNs, or weight-sharing FNNs. Suppose a CAP has the form (. . . , k, t, . . . , q), where k and t (possibly t = q) are the first succes- sive elements with modifiable wv(k,t). Then the length of the suffix list (t, . . . , q) is called the CAP’s depth (which is 0 if there are no modifiable links at all). This depth limits how far backwards credit assignment can move down the causal chain to find a modifiable weight.1 Suppose an episode and its event sequence x1, . . . , xT satisfy a computable criterion used to decide whether a given problem has been solved (e.g., total error E below some threshold). Then the set of used weights is called a solution to the problem, and the depth of the deepest CAP within the sequence is called the solution depth. There may be other solutions (yielding different event sequences) with different depths. Given some fixed NN topology, the smallest depth of any solution is called the problem depth. Sometimes we also speak of the depth of an architecture: SL FNNs with fixed topology imply a problem-independent maximal problem depth bounded by the number of non-input layers. Certain SL RNNs with fixed weights for all connections except those to output units (Jaeger, 2001; Maass et al., 2002; Jaeger, 2004; Schrauwen et al., 2007) have a maximal problem depth of 1, because only the final links in the corresponding CAPs are modifiable. In general, however, RNNs may learn to solve problems of potentially unlimited depth. Note that the definitions above are solely based on the depths of causal chains, and agnostic to the temporal distance between events. For example, shallow FNNs perceiving large “time windows” of input events may correctly classify long input sequences through appropriate output events, and thus solve shallow problems involving long time lags between relevant events. At which problem depth does Shallow Learning end, and Deep Learning begin? Discussions with DL experts have not yet yielded a conclusive response to this question. Instead of committing myself 1An alternative would be to count only modifiable links when measuring depth. In many typical NN applications this would not make a difference, but in some it would, e.g., Sec. 6.1. 6
to a precise answer, let me just define for the purposes of this overview: problems of depth > 10 require Very Deep Learning. The difficulty of a problem may have little to do with its depth. Some NNs can quickly learn to solve certain deep problems, e.g., through random weight guessing (Sec. 5.9) or other types of direct search (Sec. 6.6) or indirect search (Sec. 6.7) in weight space, or through training an NN first on shallow problems whose solutions may then generalize to deep problems, or through collapsing sequences of (non)linear operations into a single (non)linear operation (but see an analysis of non- trivial aspects of deep linear networks, Baldi and Hornik, 1994, Section B). In general, however, finding an NN that precisely models a given training set is an NP-complete problem (Judd, 1990; Blum and Rivest, 1992), also in the case of deep NNs (S´ıma, 1994; de Souto et al., 1999; Windisch, 2005); compare a survey of negative results (S´ıma, 2002, Section 1). Above we have focused on SL. In the more general case of RL in unknown environments, pcc(p, q) is also true if xp is an output event and xq any later input event—any action may affect the environment and thus any later perception. (In the real world, the environment may even influence non-input events computed on a physical hardware entangled with the entire universe, but this is ignored here.) It is possible to model and replace such unmodifiable environmental PCCs through a part of the NN that has already learned to predict (through some of its units) input events (including reward signals) from former input events and actions (Sec. 6.1). Its weights are frozen, but can help to assign credit to other, still modifiable weights used to compute actions (Sec. 6.1). This approach may lead to very deep CAPs though. Some DL research is about automatically rephrasing problems such that their depth is reduced (Sec. 4). In particular, sometimes UL is used to make SL problems less deep, e.g., Sec. 5.10. Often Dynamic Programming (Sec. 4.1) is used to facilitate certain traditional RL problems, e.g., Sec. 6.2. Sec. 5 focuses on CAPs for SL, Sec. 6 on the more complex case of RL. 4 Recurring Themes of Deep Learning 4.1 Dynamic Programming for Supervised/Reinforcement Learning (SL/RL) One recurring theme of DL is Dynamic Programming (DP) (Bellman, 1957), which can help to fa- cilitate credit assignment under certain assumptions. For example, in SL NNs, backpropagation itself can be viewed as a DP-derived method (Sec. 5.5). In traditional RL based on strong Markovian as- sumptions, DP-derived methods can help to greatly reduce problem depth (Sec. 6.2). DP algorithms are also essential for systems that combine concepts of NNs and graphical models, such as Hidden Markov Models (HMMs) (Stratonovich, 1960; Baum and Petrie, 1966) and Expectation Maximization (EM) (Dempster et al., 1977; Friedman et al., 2001), e.g., (Bottou, 1991; Bengio, 1991; Bourlard and Morgan, 1994; Baldi and Chauvin, 1996; Jordan and Sejnowski, 2001; Bishop, 2006; Hastie et al., 2009; Poon and Domingos, 2011; Dahl et al., 2012; Hinton et al., 2012a; Wu and Shao, 2014). 4.2 Unsupervised Learning (UL) Facilitating SL and RL Another recurring theme is how UL can facilitate both SL (Sec. 5) and RL (Sec. 6). UL (Sec. 5.6.4) is normally used to encode raw incoming data such as video or speech streams in a form that is more convenient for subsequent goal-directed learning. In particular, codes that describe the original data in a less redundant or more compact way can be fed into SL (Sec. 5.10, 5.15) or RL machines (Sec. 6.4), whose search spaces may thus become smaller (and whose CAPs shallower) than those necessary for dealing with the raw data. UL is closely connected to the topics of regularization and compression (Sec. 4.4, 5.6.3). 7
4.3 Learning Hierarchical Representations Through Deep SL, UL, RL Many methods of Good Old-Fashioned Artificial Intelligence (GOFAI) (Nilsson, 1980) as well as more recent approaches to AI (Russell et al., 1995) and Machine Learning (Mitchell, 1997) learn hierarchies of more and more abstract data representations. For example, certain methods of syn- tactic pattern recognition (Fu, 1977) such as grammar induction discover hierarchies of formal rules to model observations. The partially (un)supervised Automated Mathematician / EURISKO (Lenat, 1983; Lenat and Brown, 1984) continually learns concepts by combining previously learnt concepts. Such hierarchical representation learning (Ring, 1994; Bengio et al., 2013; Deng and Yu, 2014) is also a recurring theme of DL NNs for SL (Sec. 5), UL-aided SL (Sec. 5.7, 5.10, 5.15), and hierarchical RL (Sec. 6.5). Often, abstract hierarchical representations are natural by-products of data compression (Sec. 4.4), e.g., Sec. 5.10. 4.4 Occam’s Razor: Compression and Minimum Description Length (MDL) Occam’s razor favors simple solutions over complex ones. Given some programming language, the principle of Minimum Description Length (MDL) can be used to measure the complexity of a so- lution candidate by the length of the shortest program that computes it (e.g., Solomonoff, 1964; Kolmogorov, 1965b; Chaitin, 1966; Wallace and Boulton, 1968; Levin, 1973a; Solomonoff, 1978; Rissanen, 1986; Blumer et al., 1987; Li and Vit´anyi, 1997; Gr¨unwald et al., 2005). Some methods explicitly take into account program runtime (Allender, 1992; Watanabe, 1992; Schmidhuber, 1997, 2002); many consider only programs with constant runtime, written in non-universal programming languages (e.g., Rissanen, 1986; Hinton and van Camp, 1993). In the NN case, the MDL princi- ple suggests that low NN weight complexity corresponds to high NN probability in the Bayesian view (e.g., MacKay, 1992; Buntine and Weigend, 1991; Neal, 1995; De Freitas, 2003), and to high generalization performance (e.g., Baum and Haussler, 1989), without overfitting the training data. Many methods have been proposed for regularizing NNs, that is, searching for solution-computing but simple, low-complexity SL NNs (Sec. 5.6.3) and RL NNs (Sec. 6.7). This is closely related to certain UL methods (Sec. 4.2, 5.6.4). 4.5 Fast Graphics Processing Units (GPUs) for DL in NNs While the previous millennium saw several attempts at creating fast NN-specific hardware (e.g., Jackel et al., 1990; Faggin, 1992; Ramacher et al., 1993; Widrow et al., 1994; Heemskerk, 1995; Korkin et al., 1997; Urlbe, 1999), and at exploiting standard hardware (e.g., Anguita et al., 1994; Muller et al., 1995; Anguita and Gomes, 1996), the new millennium brought a DL breakthrough in form of cheap, multi- processor graphics cards or GPUs. GPUs are widely used for video games, a huge and competitive market that has driven down hardware prices. GPUs excel at the fast matrix and vector multiplications required not only for convincing virtual realities but also for NN training, where they can speed up learning by a factor of 50 and more. Some of the GPU-based FNN implementations (Sec. 5.16–5.19) have greatly contributed to recent successes in contests for pattern recognition (Sec. 5.19–5.22), image segmentation (Sec. 5.21), and object detection (Sec. 5.21–5.22). 5 Supervised NNs, Some Helped by Unsupervised NNs The main focus of current practical applications is on Supervised Learning (SL), which has domi- nated recent pattern recognition contests (Sec. 5.17–5.23). Several methods, however, use additional Unsupervised Learning (UL) to facilitate SL (Sec. 5.7, 5.10, 5.15). It does make sense to treat SL and 8
分享到:
收藏