logo资料库

Deep Learning.pdf

第1页 / 共643页
第2页 / 共643页
第3页 / 共643页
第4页 / 共643页
第5页 / 共643页
第6页 / 共643页
第7页 / 共643页
第8页 / 共643页
资料共643页,剩余部分请下载后查看
Contents Acknowledgments Notation 1 Introduction 1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . I Applied Math and Machine Learning Basics 2 Linear Algebra Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . 2.1 2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . . 2.3 2.4 Linear Dependence, Span and Rank . . . . . . . . . . . . . . . . 2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Special Kinds of Matrices and Vectors 2.6 . . . . . . . . . . . . . . . 2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 2.8 2.9 The Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . 2.10 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Example: Principal Components Analysis 3 Probability and Information Theory 3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . Random Variables 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . i vii ix 1 8 11 26 28 28 30 32 33 35 36 37 40 41 42 43 43 48 48 51 51 53 53
CONTENTS The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 3.6 Independence and Conditional Independence . . . . . . . . . . . 3.7 Expectation, Variance and Covariance . . . . . . . . . . . . . . . 3.8 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 3.10 Common Probability Distributions . . . . . . . . . . . . . . . . . 3.11 Useful Properties of Common Functions . . . . . . . . . . . . . . 3.12 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.13 Technical Details of Continuous Variables . . . . . . . . . . . . . 3.14 Structured Probabilistic Models . . . . . . . . . . . . . . . . . . . 3.15 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . 4 Numerical Computation 4.1 Overflow and Underflow . . . . . . . . . . . . . . . . . . . . . . . 4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 4.4 4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . 54 54 55 56 59 65 66 67 68 71 77 77 78 79 88 90 5 Machine Learning Basics 92 92 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 Example: Linear Regression . . . . . . . . . . . . . . . . . . . . . 100 5.3 Generalization, Capacity, Overfitting and Underfitting . . . . . . 103 5.4 Hyperparameters and Validation Sets . . . . . . . . . . . . . . . . 113 5.5 Estimators, Bias and Variance . . . . . . . . . . . . . . . . . . . . 115 5.6 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 123 5.7 . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Bayesian Statistics 5.8 Supervised Learning Algorithms . . . . . . . . . . . . . . . . . . . 133 5.9 Unsupervised Learning Algorithms . . . . . . . . . . . . . . . . . 138 5.10 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . 141 5.11 Building a Machine Learning Algorithm . . . . . . . . . . . . . . 142 5.12 The Curse of Dimensionality and Statistical Limitations of Local Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 II Modern Practical Deep Networks 155 6 Feedforward Deep Networks 157 6.1 Vanilla MLPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.2 Estimating Conditional Statistics . . . . . . . . . . . . . . . . . . 162 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . 162 6.3 6.4 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . 174 ii
CONTENTS 6.5 Universal Approximation Properties and Depth . . . . . . . . . . 188 Feature / Representation Learning . . . . . . . . . . . . . . . . . 191 6.6 6.7 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . 192 6.8 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7 Regularization of Deep or Distributed Models 196 Regularization from a Bayesian Perspective . . . . . . . . . . . . 198 7.1 Classical Regularization: Parameter Norm Penalty . . . . . . . . 199 7.2 Classical Regularization as Constrained Optimization . . . . . . . 207 7.3 7.4 Regularization and Under-Constrained Problems . . . . . . . . . 208 7.5 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 210 Classical Regularization as Noise Robustness 7.6 . . . . . . . . . . . 211 Early Stopping as a Form of Regularization . . . . . . . . . . . . 217 7.7 7.8 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 223 7.9 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 224 7.10 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 226 7.11 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 7.12 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 232 7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 234 8 Optimization for Training Deep Models 236 8.1 Optimization for Model Training . . . . . . . . . . . . . . . . . . 236 8.2 Challenges in Optimization . . . . . . . . . . . . . . . . . . . . . 241 . . . . . . . . . . . 250 8.3 Optimization Algorithms I: Basic Algorithms 8.4 Optimization Algorithms II: Adaptive Learning Rates . . . . . . 256 8.5 Optimization Algorithms III: Approximate Second-Order Methods261 8.6 Optimization Algorithms IV: Natural Gradient Methods . . . . . 262 8.7 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 262 8.8 Hints, Global Optimization and Curriculum Learning . . . . . . . 270 9 Convolutional Networks Pooling Convolution and Pooling as an Infinitely Strong Prior 274 9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 275 9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 9.3 9.4 . . . . . . 287 9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 288 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 295 9.6 9.7 Convolutional Modules . . . . . . . . . . . . . . . . . . . . . . . . 295 9.8 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . 297 9.9 9.10 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 298 iii
CONTENTS 9.11 The Neuroscientific Basis for Convolutional Networks . . . . . . . 299 9.12 Convolutional Networks and the History of Deep Learning . . . . 305 10 Sequence Modeling: Recurrent and Recursive Nets 308 10.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . 309 10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 311 10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 322 10.4 Encoder-Decoder Sequence-to-Sequence Architectures . . . . . . 323 . . . . . . . . . . . . . . . . . . . . . . 325 10.5 Deep Recurrent Networks 10.6 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . 326 10.7 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . 328 10.8 Facing the Challenge of Long-Term Dependencies . . . . . . . . . 334 10.9 Handling Temporal Dependencies with n-grams, HMMs, CRFs and Other Graphical Models . . . . . . . . . . . . . . . . . . . . . 347 10.10 Combining Neural Networks and Search . . . . . . . . . . . . . . 358 11 Practical methodology 364 11.1 Basic Machine Learning Methodology . . . . . . . . . . . . . . . 364 11.2 Selecting Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 365 11.3 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 373 12 Applications 376 12.1 Large Scale Deep Learning . . . . . . . . . . . . . . . . . . . . . . 376 12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 12.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 391 12.4 Natural Language Processing and Neural Language Models . . . 393 12.5 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 408 12.6 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 409 III Deep Learning Research 410 13 Structured Probabilistic Models for Deep Learning 412 13.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . 413 13.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . 417 13.3 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . 431 13.4 Learning About Dependencies . . . . . . . . . . . . . . . . . . . . 432 13.5 Inference and Approximate Inference Over Latent Variables . . . 434 13.6 The Deep Learning Approach to Structured Probabilistic Models 435 14 Monte Carlo Methods 440 14.1 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . 440 iv
CONTENTS 14.2 The Difficulty of Mixing Between Well-Separated Modes . . . . . 442 15 Linear Factor Models and Auto-Encoders 444 15.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 445 15.2 Denoising Auto-encoders . . . . . . . . . . . . . . . . . . . . . . . 448 15.3 Representational Power, Layer Size and Depth . . . . . . . . . . 450 15.4 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . 451 15.5 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . 452 15.6 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 453 15.7 Reconstruction Error as Log-Likelihood . . . . . . . . . . . . . . 457 15.8 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 458 . . . . . . . . . . . . . . . . . . . . . . 463 15.9 Denoising Auto-Encoders 15.10 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 468 16 Representation Learning 471 16.1 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . 472 16.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 479 16.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 486 16.4 Semi-Supervised Learning and Disentangling Underlying Causal Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 16.5 Assumption of Underlying Factors and Distributed Representation489 16.6 Exponential Gain in Representational Efficiency from Distributed Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493 16.7 Exponential Gain in Representational Efficiency from Depth . . . 495 16.8 Priors Regarding The Underlying Factors . . . . . . . . . . . . . 497 17 The Manifold Perspective on Representation Learning 501 17.1 Manifold Interpretation of PCA and Linear Auto-Encoders . . . 509 17.2 Manifold Interpretation of Sparse Coding . . . . . . . . . . . . . 512 17.3 The Entropy Bias from Maximum Likelihood . . . . . . . . . . . 512 17.4 Manifold Learning via Regularized Auto-Encoders . . . . . . . . 513 17.5 Tangent Distance, Tangent-Prop, and Manifold Tangent Classifier 514 18 Confronting the Partition Function 518 18.1 The Log-Likelihood Gradient of Energy-Based Models . . . . . . 519 18.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 521 18.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 18.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . 530 18.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . 532 18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 532 18.7 Estimating the Partition Function . . . . . . . . . . . . . . . . . 534 v
CONTENTS 19 Approximate inference 542 19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 544 19.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . 545 19.3 MAP Inference: Sparse Coding as a Probabilistic Model . . . . . 546 19.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 547 19.5 Stochastic Inference . . . . . . . . . . . . . . . . . . . . . . . . . 551 19.6 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . 551 20 Deep Generative Models 553 20.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 553 20.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 556 20.3 Training Restricted Boltzmann Machines . . . . . . . . . . . . . . 559 20.4 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 563 20.5 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 566 20.6 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 577 20.7 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . 580 20.8 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 581 . . . . . . . . . . . . . . . . . . . . . . 581 20.9 Directed Generative Nets 20.10 A Generative View of Autoencoders . . . . . . . . . . . . . . . . 583 20.11 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 589 20.12 Methodological Notes . . . . . . . . . . . . . . . . . . . . . . . . . 591 Bibliography Index 595 637 vi
Acknowledgments This book would not have been possible without the contributions of many people. We would like to thank those who commented on our proposal for the book and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain, Kyunghyun Cho, C¸ a˘glar G¨ul¸cehre, Razvan Pascanu, David Krueger and Thomas Roh´ee. We would like to thank the people who offered feedback on the content of the book itself. Some offered feedback on many chapters: Julian Serban, Laurent Dinh, Guillaume Alain, Kelvin Xu, Meire Fortunato, Ilya Sutskever, Vincent Vanhoucke, David Warde-Farley, Augustus Q. Odena, Matko Boˇsnjak, Stephan Dreseitl, Jurgen Van Gael, Dustin Webb, Johannes Roith, Ion Androutsopoulos, Stephan Dreiseitl, Karl Pichotta, Pawel Chilinski, Halis Sak, Fr´ed´eric Francis, Jonathan Hunt, and Grigory Sapunov. individual chapters: We would also like to thank those who provided us with useful feedback on • Chapter 1, Introduction: Johannes Roith, Eric Morris, Samira Ebrahimi, Ozan C¸ a˘glayan, Mart´ın Abadi, and Sebastien Bratieres. • • • • • • Chapter 2, Linear Algebra: Pierre Luc Carrier, Li Yao, Thomas Roh´ee, Colby Toland, Amjad Almahairi, Sergey Oreshkov, Istv´an Petr´as, Dennis Prangle, and Alessandro Vitale. Chapter 3, Probability and Information Theory: Rasmus Antti, Stephan Gouws, Vincent Dumoulin, Artem Oboturov, Li Yao, John Philip Anderson, and Kai Arulkumaran. Chapter 4, Numerical Computation: Tran Lam An. Chapter 5, Machine Learning Basics: Dzmitry Bahdanau, and Zheng Sun. Chapter 6, Feedforward Deep Networks: David Krueger. Chapter 8, Optimization for Training Deep Models: James Martens and Marcel Ackermann. vii
CONTENTS • • • • • Chapter 9, Convolutional Networks: Mehdi Mirza, C¸ a˘glar G¨ul¸cehre, and Mart´ın Arjovsky. Chapter 10, Sequence Modeling: Recurrent and Recursive Nets: Mihaela Rosca, Razvan Pascanu, Dmitriy Serdyuk, and Dongyu Shi. Chapter 18, Confronting the Partition Function: Sam Bowman, and Ozan C¸ a˘glayan. Chapter 20, Deep Generative Models: Fady Medhat Bibliography, Leslie N. Smith. We also want to thank those who allowed us to reproduce images, figures or data from their publications: David Warde-Farley, Matthew D. Zeiler, Rob Fergus, Chris Olah, Jason Yosinski, Nicolas Chapados and James Bergstra. We indicate their contributions in the figure captions throughout the text. Finally, we would like to thank Google for allowing Ian Goodfellow to work on the book as his 20% project while working at Google. In particular, we would like to thank Ian’s former manager, Greg Corrado, and his subsequent manager, Samy Bengio, for their support of this effort. viii
分享到:
收藏