2016最新版Bengio写的MIT Press《Deep learning》PDF.pdf

发布时间：2022-05-29 发布人：admin 分类：说明书资料大小：21.96M 资料格式：pdf 举报版权申诉

fangzheng354-9563118-4744300845371722963.pdf-第1页.png

第1页 / 共802页

fangzheng354-9563118-4744300845371722963.pdf-第2页.png

第2页 / 共802页

fangzheng354-9563118-4744300845371722963.pdf-第3页.png

第3页 / 共802页

fangzheng354-9563118-4744300845371722963.pdf-第4页.png

第4页 / 共802页

fangzheng354-9563118-4744300845371722963.pdf-第5页.png

第5页 / 共802页

fangzheng354-9563118-4744300845371722963.pdf-第6页.png

第6页 / 共802页

fangzheng354-9563118-4744300845371722963.pdf-第7页.png

第7页 / 共802页

fangzheng354-9563118-4744300845371722963.pdf-第8页.png

第8页 / 共802页

Deep Learning

Contents

Website

Acknowledgments

Notation

Chapter 1 Introduction

Part I Applied Math and MachineLearning Basics

Chapter 2 Linear Algebra

Chapter 3 Probability and InformationTheory

Chapter 4 Numerical Computation

Chapter 5 Machine Learning Basics

Part II Deep Networks: ModernPractices

Chapter 6 Deep Feedforward Networks

Chapter 7 Regularization for Deep Learning

Chapter 8 Optimization for Training DeepModels

Chapter 9 Convolutional Networks

Chapter 10 Sequence Modeling: Recurrentand Recursive Nets

Chapter 11 Practical methodology

Chapter 12 Applications

Part III Deep Learning Research

Chapter 13 Linear Factor Models

Chapter 14 Autoencoders

Chapter 15 Representation Learning

Chapter 16 Structured Probabilistic Modelsfor Deep Learning

Chapter 17 Monte Carlo Methods

Chapter 18 Confronting the PartitionFunction

Chapter 19 Approximate inference

Chapter 20 Deep Generative Models

Bibliography

Index

Deep Learning Ian Goodfellow Yoshua Bengio Aaron Courville

Contents Website Acknowledgments Notation 1 Introduction 1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . I Applied Math and Machine Learning Basics 2 Linear Algebra 2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . 2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . Identity and Inverse Matrices 2.3 . . . . . . . . . . . . . . . . . . . . Linear Dependence and Span . . . . . . . . . . . . . . . . . . . . 2.4 2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Special Kinds of Matrices and Vectors 2.6 . . . . . . . . . . . . . . . Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 2.8 2.9 The Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . 2.10 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . 2.11 The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Example: Principal Components Analysis . . . . . . . . . . . . . 3 Probability and Information Theory 3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . i vii viii xi 1 8 11 29 31 31 34 36 37 39 40 42 44 45 46 47 48 53 54

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . Random Variables 3.2 3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 3.6 Independence and Conditional Independence . . . . . . . . . . . . 3.7 3.8 Expectation, Variance and Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Common Probability Distributions 3.10 Useful Properties of Common Functions . . . . . . . . . . . . . . 3.11 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12 Technical Details of Continuous Variables . . . . . . . . . . . . . 3.13 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 3.14 Structured Probabilistic Models . . . . . . . . . . . . . . . . . . . 4 Numerical Computation 4.1 Overﬂow and Underﬂow . . . . . . . . . . . . . . . . . . . . . . . 4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 4.4 4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . 56 56 58 59 59 60 60 62 67 70 71 72 75 80 80 82 82 93 96 5 Machine Learning Basics 98 99 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Capacity, Overﬁtting and Underﬁtting . . . . . . . . . . . . . . . 110 5.2 Hyperparameters and Validation Sets . . . . . . . . . . . . . . . . 120 5.3 5.4 Estimators, Bias and Variance . . . . . . . . . . . . . . . . . . . . 122 5.5 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 131 Bayesian Statistics 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Supervised Learning Algorithms . . . . . . . . . . . . . . . . . . . 139 5.7 5.8 Unsupervised Learning Algorithms . . . . . . . . . . . . . . . . . 145 5.9 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 150 5.10 Building a Machine Learning Algorithm . . . . . . . . . . . . . . 152 5.11 Challenges Motivating Deep Learning . . . . . . . . . . . . . . . . 154 II Deep Networks: Modern Practices 165 6 Deep Feedforward Networks 167 Example: Learning XOR . . . . . . . . . . . . . . . . . . . . . . . 170 6.1 6.2 Gradient-Based Learning . . . . . . . . . . . . . . . . . . . . . . . 176 ii

CONTENTS 6.3 6.4 6.5 6.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Hidden Units Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Back-Propagation and Other Diﬀerentiation Algorithms . . . . . 203 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 7 Regularization for Deep Learning 228 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 230 7.1 Norm Penalties as Constrained Optimization . . . . . . . . . . . . 237 7.2 7.3 . . . . . . . . . 239 Regularization and Under-Constrained Problems 7.4 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 240 7.5 Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 7.6 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 244 7.7 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 245 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 7.9 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 251 7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 253 7.11 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 255 7.12 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 267 7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classiﬁer 268 8 Optimization for Training Deep Models 274 How Learning Diﬀers from Pure Optimization . . . . . . . . . . . 275 8.1 Challenges in Neural Network Optimization . . . . . . . . . . . . 282 8.2 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 8.3 Parameter Initialization Strategies . . . . . . . . . . . . . . . . . 301 8.4 Algorithms with Adaptive Learning Rates . . . . . . . . . . . . . 306 8.5 8.6 Approximate Second-Order Methods . . . . . . . . . . . . . . . . 310 8.7 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 317 9 Convolutional Networks 330 9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 331 9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 9.3 Convolution and Pooling as an Inﬁnitely Strong Prior . . . . . . . 345 9.4 9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 347 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 358 9.7 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 9.8 . . . . . . . . . . . . . . . . . . 362 . . . . . . . . . . . . . . . . . 363 9.9 Eﬃcient Convolution Algorithms Random or Unsupervised Features iii

CONTENTS 9.10 The Neuroscientiﬁc Basis for Convolutional Networks . . . . . . . 364 9.11 Convolutional Networks and the History of Deep Learning . . . . 371 10 Sequence Modeling: Recurrent and Recursive Nets 373 10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . . . . . 375 10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 378 10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 395 10.4 Encoder-Decoder Sequence-to-Sequence Architectures . . . . . . . 396 10.5 Deep Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . 398 10.6 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . 400 10.7 The Challenge of Long-Term Dependencies . . . . . . . . . . . . . 402 10.8 Echo State Networks . . . . . . . . . . . . . . . . . . . . . . . . . 405 10.9 Leaky Units and Other Strategies for Multiple Time Scales . . . . 408 10.10 The Long Short-Term Memory and Other Gated RNNs . . . . . . 410 10.11 Optimization for Long-Term Dependencies . . . . . . . . . . . . . 414 10.12 Explicit Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 11 Practical methodology 423 11.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 424 11.2 Default Baseline Models . . . . . . . . . . . . . . . . . . . . . . . 427 11.3 Determining Whether to Gather More Data . . . . . . . . . . . . 428 11.4 Selecting Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 429 11.5 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 438 11.6 Example: Multi-Digit Number Recognition . . . . . . . . . . . . . 442 12 Applications 445 12.1 Large Scale Deep Learning . . . . . . . . . . . . . . . . . . . . . . 445 12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 12.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 460 12.4 Natural Language Processing . . . . . . . . . . . . . . . . . . . . 463 12.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 479 III Deep Learning Research 488 13 Linear Factor Models 491 13.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 492 13.2 Independent Component Analysis (ICA) . . . . . . . . . . . . . . 493 13.3 Slow Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . 495 13.4 Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 iv

CONTENTS 13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 501 14 Autoencoders 504 14.1 Undercomplete Autoencoders . . . . . . . . . . . . . . . . . . . . 505 14.2 Regularized Autoencoders . . . . . . . . . . . . . . . . . . . . . . 506 14.3 Representational Power, Layer Size and Depth . . . . . . . . . . . 510 14.4 Stochastic Encoders and Decoders . . . . . . . . . . . . . . . . . . 511 14.5 Denoising Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 512 14.6 Learning Manifolds with Autoencoders . . . . . . . . . . . . . . . 517 14.7 Contractive Autoencoders . . . . . . . . . . . . . . . . . . . . . . 523 14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 525 14.9 Applications of Autoencoders . . . . . . . . . . . . . . . . . . . . 526 15 Representation Learning 528 15.1 Greedy Layer-Wise Unsupervised Pretraining . . . . . . . . . . . 530 15.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 538 15.3 Semi-Supervised Disentangling of Causal Factors . . . . . . . . . 543 15.4 Distributed Representation . . . . . . . . . . . . . . . . . . . . . . 548 15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . . 555 15.6 Providing Clues to Discover Underlying Causes . . . . . . . . . . 556 16 Structured Probabilistic Models for Deep Learning 560 16.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . 561 16.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . 565 16.3 Sampling from Graphical Models . . . . . . . . . . . . . . . . . . 582 16.4 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . 584 16.5 Learning about Dependencies . . . . . . . . . . . . . . . . . . . . 584 16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . . 585 16.7 The Deep Learning Approach to Structured Probabilistic Models 586 17 Monte Carlo Methods 592 17.1 Sampling and Monte Carlo Methods . . . . . . . . . . . . . . . . 592 17.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 594 17.3 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . 597 17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 17.5 The Challenge of Mixing between Separated Modes . . . . . . . . 601 18 Confronting the Partition Function 607 18.1 The Log-Likelihood Gradient . . . . . . . . . . . . . . . . . . . . 608 18.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 609 v

CONTENTS 18.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 18.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . 619 18.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . 621 18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 622 18.7 Estimating the Partition Function . . . . . . . . . . . . . . . . . . 625 19 Approximate inference 633 19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 635 19.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . 636 19.3 MAP Inference and Sparse Coding . . . . . . . . . . . . . . . . . 637 19.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 640 19.5 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . 653 20 Deep Generative Models 656 20.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 656 20.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 658 20.3 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 662 20.4 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 665 20.5 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 678 20.6 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . 685 20.7 Boltzmann Machines for Structured or Sequential Outputs . . . . 687 20.8 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 688 20.9 Back-Propagation through Random Operations . . . . . . . . . . 689 20.10 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . . 694 20.11 Drawing Samples from Autoencoders . . . . . . . . . . . . . . . . 712 20.12 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 716 20.13 Other Generation Schemes . . . . . . . . . . . . . . . . . . . . . . 717 20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . . . . . 719 20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721 Bibliography Index 723 780 vi

Website www.deeplearningbook.org This book is accompanied by the above website. The website provides a variety of supplementary material, including exercises, lecture slides, corrections of mistakes, and other resources that should be useful to both readers and instructors. vii

分享到：

赞收藏

资料库

2016最新版Bengio写的MIT Press《Deep learning》PDF.pdf

相关推荐

开发技术

热门标签

最新资料