Deep Learning
Ian Goodfellow
Yoshua Bengio
Aaron Courville
Contents
Website
Acknowledgments
Notation
1 Introduction
1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . .
1.2
Historical Trends in Deep Learning . . . . . . . . . . . . . . . . .
I Applied Math and Machine Learning Basics
2 Linear Algebra
2.1
Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . .
2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . .
Identity and Inverse Matrices
2.3
. . . . . . . . . . . . . . . . . . . .
Linear Dependence and Span . . . . . . . . . . . . . . . . . . . .
2.4
2.5
Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Special Kinds of Matrices and Vectors
2.6
. . . . . . . . . . . . . . .
Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7
Singular Value Decomposition . . . . . . . . . . . . . . . . . . . .
2.8
2.9
The Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . .
2.10 The Trace Operator
. . . . . . . . . . . . . . . . . . . . . . . . .
2.11 The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12 Example: Principal Components Analysis
. . . . . . . . . . . . .
3 Probability and Information Theory
3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
vii
viii
xi
1
8
11
29
31
31
34
36
37
39
40
42
44
45
46
47
48
53
54
CONTENTS
. . . . . . . . . . . . . . . . . . . . . . . . . .
Random Variables
3.2
3.3
Probability Distributions . . . . . . . . . . . . . . . . . . . . . . .
3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
Conditional Probability . . . . . . . . . . . . . . . . . . . . . . .
The Chain Rule of Conditional Probabilities . . . . . . . . . . . .
3.6
Independence and Conditional Independence . . . . . . . . . . . .
3.7
3.8
Expectation, Variance and Covariance
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
3.9
Common Probability Distributions
3.10 Useful Properties of Common Functions
. . . . . . . . . . . . . .
3.11 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.12 Technical Details of Continuous Variables
. . . . . . . . . . . . .
3.13 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . .
3.14 Structured Probabilistic Models . . . . . . . . . . . . . . . . . . .
4 Numerical Computation
4.1 Overflow and Underflow . . . . . . . . . . . . . . . . . . . . . . .
4.2
Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . .
Constrained Optimization . . . . . . . . . . . . . . . . . . . . . .
4.4
4.5
Example: Linear Least Squares . . . . . . . . . . . . . . . . . . .
56
56
58
59
59
60
60
62
67
70
71
72
75
80
80
82
82
93
96
5 Machine Learning Basics
98
99
Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Capacity, Overfitting and Underfitting . . . . . . . . . . . . . . . 110
5.2
Hyperparameters and Validation Sets . . . . . . . . . . . . . . . . 120
5.3
5.4
Estimators, Bias and Variance . . . . . . . . . . . . . . . . . . . . 122
5.5 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 131
Bayesian Statistics
5.6
. . . . . . . . . . . . . . . . . . . . . . . . . . 135
Supervised Learning Algorithms . . . . . . . . . . . . . . . . . . . 139
5.7
5.8
Unsupervised Learning Algorithms
. . . . . . . . . . . . . . . . . 145
5.9
Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 150
5.10 Building a Machine Learning Algorithm . . . . . . . . . . . . . . 152
5.11 Challenges Motivating Deep Learning . . . . . . . . . . . . . . . . 154
II Deep Networks: Modern Practices
165
6 Deep Feedforward Networks
167
Example: Learning XOR . . . . . . . . . . . . . . . . . . . . . . . 170
6.1
6.2 Gradient-Based Learning . . . . . . . . . . . . . . . . . . . . . . . 176
ii
CONTENTS
6.3
6.4
6.5
6.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Hidden Units
Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Back-Propagation and Other Differentiation Algorithms
. . . . . 203
Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7 Regularization for Deep Learning
228
Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 230
7.1
Norm Penalties as Constrained Optimization . . . . . . . . . . . . 237
7.2
7.3
. . . . . . . . . 239
Regularization and Under-Constrained Problems
7.4 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 240
7.5
Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.6
Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 244
7.7 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 245
7.8
Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.9
Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 251
7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 253
7.11 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 255
7.12 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier 268
8 Optimization for Training Deep Models
274
How Learning Differs from Pure Optimization . . . . . . . . . . . 275
8.1
Challenges in Neural Network Optimization . . . . . . . . . . . . 282
8.2
Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.3
Parameter Initialization Strategies
. . . . . . . . . . . . . . . . . 301
8.4
Algorithms with Adaptive Learning Rates . . . . . . . . . . . . . 306
8.5
8.6
Approximate Second-Order Methods . . . . . . . . . . . . . . . . 310
8.7 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 317
9 Convolutional Networks
330
9.1
The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 331
9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
9.3
Convolution and Pooling as an Infinitely Strong Prior . . . . . . . 345
9.4
9.5
Variants of the Basic Convolution Function . . . . . . . . . . . . 347
9.6
Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 358
9.7 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
9.8
. . . . . . . . . . . . . . . . . . 362
. . . . . . . . . . . . . . . . . 363
9.9
Efficient Convolution Algorithms
Random or Unsupervised Features
iii
CONTENTS
9.10 The Neuroscientific Basis for Convolutional Networks . . . . . . . 364
9.11 Convolutional Networks and the History of Deep Learning . . . . 371
10 Sequence Modeling: Recurrent and Recursive Nets
373
10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . . . . . 375
10.2 Recurrent Neural Networks
. . . . . . . . . . . . . . . . . . . . . 378
10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 395
10.4 Encoder-Decoder Sequence-to-Sequence Architectures . . . . . . . 396
10.5 Deep Recurrent Networks
. . . . . . . . . . . . . . . . . . . . . . 398
10.6 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . 400
10.7 The Challenge of Long-Term Dependencies . . . . . . . . . . . . . 402
10.8 Echo State Networks . . . . . . . . . . . . . . . . . . . . . . . . . 405
10.9 Leaky Units and Other Strategies for Multiple Time Scales . . . . 408
10.10 The Long Short-Term Memory and Other Gated RNNs . . . . . . 410
10.11 Optimization for Long-Term Dependencies . . . . . . . . . . . . . 414
10.12 Explicit Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
11 Practical methodology
423
11.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 424
11.2 Default Baseline Models . . . . . . . . . . . . . . . . . . . . . . . 427
11.3 Determining Whether to Gather More Data . . . . . . . . . . . . 428
11.4 Selecting Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 429
11.5 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 438
11.6 Example: Multi-Digit Number Recognition . . . . . . . . . . . . . 442
12 Applications
445
12.1 Large Scale Deep Learning . . . . . . . . . . . . . . . . . . . . . . 445
12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
12.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 460
12.4 Natural Language Processing . . . . . . . . . . . . . . . . . . . . 463
12.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 479
III Deep Learning Research
488
13 Linear Factor Models
491
13.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 492
13.2 Independent Component Analysis (ICA) . . . . . . . . . . . . . . 493
13.3 Slow Feature Analysis
. . . . . . . . . . . . . . . . . . . . . . . . 495
13.4 Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
iv
CONTENTS
13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 501
14 Autoencoders
504
14.1 Undercomplete Autoencoders
. . . . . . . . . . . . . . . . . . . . 505
14.2 Regularized Autoencoders . . . . . . . . . . . . . . . . . . . . . . 506
14.3 Representational Power, Layer Size and Depth . . . . . . . . . . . 510
14.4 Stochastic Encoders and Decoders . . . . . . . . . . . . . . . . . . 511
14.5 Denoising Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 512
14.6 Learning Manifolds with Autoencoders . . . . . . . . . . . . . . . 517
14.7 Contractive Autoencoders . . . . . . . . . . . . . . . . . . . . . . 523
14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 525
14.9 Applications of Autoencoders . . . . . . . . . . . . . . . . . . . . 526
15 Representation Learning
528
15.1 Greedy Layer-Wise Unsupervised Pretraining . . . . . . . . . . . 530
15.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 538
15.3 Semi-Supervised Disentangling of Causal Factors
. . . . . . . . . 543
15.4 Distributed Representation . . . . . . . . . . . . . . . . . . . . . . 548
15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . . 555
15.6 Providing Clues to Discover Underlying Causes . . . . . . . . . . 556
16 Structured Probabilistic Models for Deep Learning
560
16.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . 561
16.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . 565
16.3 Sampling from Graphical Models . . . . . . . . . . . . . . . . . . 582
16.4 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . 584
16.5 Learning about Dependencies . . . . . . . . . . . . . . . . . . . . 584
16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . . 585
16.7 The Deep Learning Approach to Structured Probabilistic Models 586
17 Monte Carlo Methods
592
17.1 Sampling and Monte Carlo Methods
. . . . . . . . . . . . . . . . 592
17.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 594
17.3 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . 597
17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
17.5 The Challenge of Mixing between Separated Modes . . . . . . . . 601
18 Confronting the Partition Function
607
18.1 The Log-Likelihood Gradient
. . . . . . . . . . . . . . . . . . . . 608
18.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 609
v
CONTENTS
18.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
18.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . 619
18.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . 621
18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 622
18.7 Estimating the Partition Function . . . . . . . . . . . . . . . . . . 625
19 Approximate inference
633
19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 635
19.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . 636
19.3 MAP Inference and Sparse Coding . . . . . . . . . . . . . . . . . 637
19.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 640
19.5 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . 653
20 Deep Generative Models
656
20.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 656
20.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 658
20.3 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 662
20.4 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 665
20.5 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 678
20.6 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . 685
20.7 Boltzmann Machines for Structured or Sequential Outputs . . . . 687
20.8 Other Boltzmann Machines
. . . . . . . . . . . . . . . . . . . . . 688
20.9 Back-Propagation through Random Operations . . . . . . . . . . 689
20.10 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . . 694
20.11 Drawing Samples from Autoencoders . . . . . . . . . . . . . . . . 712
20.12 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 716
20.13 Other Generation Schemes . . . . . . . . . . . . . . . . . . . . . . 717
20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . . . . . 719
20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Bibliography
Index
723
780
vi
Website
www.deeplearningbook.org
This book is accompanied by the above website. The website provides a
variety of supplementary material, including exercises, lecture slides, corrections of
mistakes, and other resources that should be useful to both readers and instructors.
vii