Neural Networks: Tricks of the Trade.pdf

发布时间：2022-06-13 发布人：admin 分类：说明书资料大小：11.68M 资料格式：pdf 举报版权申诉

92282bef-b9f1-4e06-a92b-908d2cd557a0.pdf-第1页.png

第1页 / 共753页

92282bef-b9f1-4e06-a92b-908d2cd557a0.pdf-第2页.png

第2页 / 共753页

92282bef-b9f1-4e06-a92b-908d2cd557a0.pdf-第3页.png

第3页 / 共753页

92282bef-b9f1-4e06-a92b-908d2cd557a0.pdf-第4页.png

第4页 / 共753页

92282bef-b9f1-4e06-a92b-908d2cd557a0.pdf-第5页.png

第5页 / 共753页

92282bef-b9f1-4e06-a92b-908d2cd557a0.pdf-第6页.png

第6页 / 共753页

92282bef-b9f1-4e06-a92b-908d2cd557a0.pdf-第7页.png

第7页 / 共753页

92282bef-b9f1-4e06-a92b-908d2cd557a0.pdf-第8页.png

第8页 / 共753页

Title

Preface

Table of Contents

Introduction

Speeding Learning

Regularization Techniques to Improve Generalization

Efficient BackProp

Introduction

Learning and Generalization

Standard Backpropagation

A Few Practical Tricks

Stochastic versus Batch Learning

Shuffling the Examples

Normalizing the Inputs

The Sigmoid

Choosing Target Values

Initializing the Weights

Choosing Learning Rates

Radial Basis Functions vs Sigmoid Units

Convergence of Gradient Descent

A Little Theory

Examples

Input Transformations and Error Surface Transformations Revisited

Classical Second Order Optimization Methods

Newton Algorithm

Conjugate Gradient

Quasi-Newton (BFGS)

Gauss-Newton and Levenberg Marquardt

Tricks to Compute the Hessian Information in Multilayer Networks

Finite Difference

Square Jacobian Approximation for the Gauss-Newton and Levenberg-Marquardt Algorithms

Backpropagating Second Derivatives

Backpropagating the Diagonal Hessian in Neural Nets

Computing the Product of the Hessian and a Vector

Analysis of the Hessian in Multi-layer Networks

Applying Second Order Methods to Multilayer Networks

A Stochastic Diagonal Levenberg Marquardt Method

Computing the Principal Eigenvalue/Vector of the Hessian

Discussion and Conclusion

References

Regularization Techniques to Improve Generalization

Early Stopping — But When?

Early Stopping Is Not Quite as Simple

Why Early Stopping?

The Basic Early Stopping Technique

The Uglyness of Reality

How to Do Early Stopping Best

Some Classes of Stopping Criteria

The Trick: Criterion Selection Rules

Where and How Well Does This Trick Work?

Concrete Questions

Experimental Setup

Experiment Results

Discussion: Answers to the Questions

Generalization of These Results

Why This Works

References

A Simple Trick for Estimating the Weight Decay Parameter

Introduction

Ill-Posed Problems, Regularization, and Such Things...

Ill-Posed Problems

Regularization

Bias and Variance

Bayesian Framework

Weight Decay

Early Stopping

Estimating

Search Estimates

Two Early Stopping Estimates

Experiments

Data Sets

Experimental Procedure

Quality of the Estimates

Weight Decay versus Early Stopping Committees

Conclusions

References

Controlling the Hyperparameter Search in MacKay’s Bayesian Neural Network Framework

Introduction

Hyperparameter Updates

Difficulties with Using the Update Formulas

Control Strategies

Choosing When to Update Hyperparameters

Dealing with Out-of-Bounds Estimates of Numbers of Well-Determined Parameters

Further Generally Applicable Strategies

Experimental Setup

Targets for ``Good'' Performance

Network Architecture and Training

Effectiveness of Control Strategies

Relationship of Test Set Error to Evidence

Conclusion

References

Adaptive Regularization in Neural Network Modeling

Introduction

Training and Generalization

Adapting Regularization Parameters

Numerical Experiments

Potentials and Limitations in the Approach

Classification

Time Series Prediction

Conclusions

References

Large Ensemble Averaging

Introduction

Extrapolation to Large-Ensemble Averages

Application to the Sunspots Problem

Best Result

Theoretical Analysis

References

Improving Network Models and Algorithmic Tricks

Square Unit Augmented, Radially Extended, Multilayer Perceptrons

Introduction and Motivation

The Trick: A SQUARE-MLP

Example Applications

Hill-Plateau Function Approximation

Two-Spirals Classification

Vowel Classification

Theoretical Justification

Intuitive and Topological Justification

Conclusions

References

A Dozen Tricks with Multitask Learning

Introduction to Multitask Learning in Backprop Nets

Single and Multitask Learning of Task 1

Results

Discussion

Tricks for Using Multitask Learning in the Real World

Using the Future to Predict the Present

Multiple Metrics

Multiple Output Representations

Time Series Prediction

Using Non-operational Features

Using Extra Tasks to Focus Attention

Hints: Tasks Hand-Crafted by a Domain Expert

Handling other Categories in Classification

Sequential Transfer

Similar Tasks With Different Data Distributions

Learning with Hierarchical Data

Some Inputs Work Better as Outputs

Getting the Most Out of MTL

Use Large Hidden Layers

Do Early Stopping for Each Task Separately

Use Different Learning Rates for Different Tasks

Use a Private Hidden Layer for the Main Task

Chapter Summary

References

Solving the Ill-Conditioning in Neural Network Learning

Introduction

The Learning Process

Learning Methodology

Condition of the Learning Problem

What Causes the Singularities

Definition of Minimum

Local Minima are Caused by BackPropagation

A New Neural Network Structure

Influence on the Approximation Error E

M and the Universal Approximation Theorems

Example

Applications

Conclusion

References

Centering Neural Network Gradient Factors

Introduction

Centered Backpropagation

Activity Propagation

Weight Modification

Error Backpropagation

Implementation Techniques

A Priori Methods

Adaptive Methods

Empirical Results

Setup of Experiments

Symmetry Detection Problem

Vowel Recognition Problem

Discussion

References

Avoiding Roundoff Error in Backpropagating Derivatives

Introduction

Roundoff Error in Sigmoid Units

Sum-Squared Error Computations

Single Logistic-Output Cross-Entropy Computations

Other Approaches to Avoiding Zero-Derivatives with the Logistic Function

Softmax and Cross-Entropy Computations

Roundoff Error in Tanh Units

Why Bother?

References

Representing and Incorporating Prior Knowledge in Neural Network Training

Transformation Invariance in Pattern Recognition– Tangent Distance and Tangent Propagation

Introduction

Memory Based Algorithms

Learned-Function Algorithms

Tangent Distance

Implementation

Some Illustrative Results

How to Make Tangent Distance Work

Tangent Propagation

Local Rule

Results

How to Make Tangent Prop Work

Tangent Vectors

Lie Groups and Lie Algebras

Tangent Vectors

Important Transformations in Image Processing

Conclusion

References

Combining Neural Networks and Context-Driven Search for On-line, Printed Handwriting Recognition in the Newton

Introduction

System Overview

Tentative Segmentation

Character Classification

Representation

Architecture

Normalizing Output Error

Negative Training

Stroke Warping

Frequency Balancing

Error Emphasis

Annealing

Quantized Weights

Context-Driven Search

Lexical Context

Geometric Context

Integration with Word Segmentation

Discussion

Future Extensions

References

Neural Network Classification and Prior Class Probabilities

Introduction

The Trick

Prior Scaling

Probabilistic Sampling

Post Scaling

Equalizing Class Membership

Experimental Results

Performance Measures

ECG Classification Problem

Explanation

Convergence and Representation Issues

Overlapping Distributions

Limitations

A Posteriori Proofs

Conclusions

References

Applying Divide and Conquer to Large Scale Pattern Recognition Tasks

Introduction

Hierarchical Classification

Decomposition of Posterior Probabilities

Hierarchical Interpretation

Estimation of Conditional Node Posteriors

Classifier Tree Design

Optimality

Prior Knowledge

Confusion Matrices

Agglomerative Clustering

Application to Speech Recognition

Statistical Speech Recognition

Emission and Transition Modeling

Phonetic Context Modeling

Connectionist Acoustic Modeling

ACID Clustering

Training Hierarchies of Neural Networks on Large Datasets

Conclusions

References

Tricks for Time Series

Forecasting the Economy with Neural Nets: A Survey of Challenges and Solutions

Challenges of Macroeconomic Forecasting

A Survey of Neural Network Solutions

Smoothing Regularizers for Better Generalization

Model Selection and Interpretation

Improving Forecasts via Architecture and Input Selection

Architecture Selection via the Prediction Risk

Estimation of Prediction Risk

Algebraic Estimates of Prediction Risk

NCV: Cross-Validation for Nonlinear Models

Pruning Inputs via Directed Search and Sensitivity Analysis

Empirical Example

Gaining Economic Understanding through Model Visualization

Discussion

References

How to Train Neural Networks

Introduction

Preprocessing

Architectures

Net Internal Preprocessing by a Diagonal Connector

Net Internal Preprocessing by a Bottleneck Network

Squared Inputs

Interaction Layer

Averaging

Regularization by Random Targets

An Integrated Network Architecture for Forecasting Problems

Cost Functions

Robust Estimation with LnCosh

Robust Estimation with CDEN

Error Bar Estimation with CDEN

Data Meets Structure

The Observer-Observation Dilemma

Learning Reviewed

Parameter Noise as an Implicit Penalty Function

Cleaning Reviewed

Data Noise Reviewed

Cleaning with Noise

A Unifying Approach: The Separation of Structure and Noise

Architectural Optimization

Node-Pruning

Weight-Pruning

The Training Procedure

Training Paradigms: Early vs. Late Stopping

Setup Steps

Learning: Generation of Structural Hypothesis

Pruning: Falsification of the generated Structure

Final Stopping Criteria of the Training

Diagram of the Training Procedure

Experiments

Conclusion

References

Big Learning in Deep Neural Networks

Big Learning and Deep Neural Networks

Stochastic Gradient Descent Tricks

Introduction

What Is Stochastic Gradient Descent?

Gradient Descent

Stochastic Gradient Descent

The Convergence of Stochastic Gradient Descent

When to Use Stochastic Gradient Descent?

The Trade-Offs of Large Scale Learning

Asymptotic Analysis of the Large-Scale Case

General Recommendations

Preparing the Data

Monitoring and Debugging

Linear Models with L2 Regularization

Sparsity

Learning Rates

Averaged Stochastic Gradient Descent

Experiments

Conclusion

References

Practical Recommendations for Gradient-Based Training of Deep Architectures

Introduction

Deep Learning and Greedy Layer-Wise Pretraining

Denoising and Contractive Auto-encoders

Online Learning and Optimization of Generalization Error

Gradients

Gradient Descent and Learning Rate

Gradient Computation and Automatic Differentiation

Hyper-parameters

Neural Network Hyper-parameters

Hyper-parameters of the Model and Training Criterion

Manual Search and Grid Search

Random Sampling of Hyper-parameters

Debugging and Analysis

Gradient Checking and Controlled Overfitting

Visualizations and Statistics

Other Recommendations

Multi-core Machines, BLAS and GPUs

Sparse High-Dimensional Inputs

Symbolic Variables, Embeddings, Multi-task Learning and Multi-relational Learning

Open Questions

On the Added Difficulty of Training Deeper Architectures

Adaptive Learning Rates and Second-Order Methods

Conclusion

References

Training Deep and Recurrent Networks with Hessian-Free Optimization

Introduction

Feedforward Neural Networks

Recurrent Neural Networks

Hessian-Free Optimization Basics

Exact Multiplication by the Hessian

The Generalized Gauss-Newton Matrix

Multiplying by the Gauss-Newton Matrix

Typical Losses

Dealing with Non-convex Losses

Implementation Details

Efficiency via Parallelism

Verifying the Correctness of G Products

Damping

Tikhonov Damping

Problems with Tikhonov Damping

Scale-Sensitive Damping

Structural Damping

The Levenberg-Marquardt Heuristic

Trust-Region Methods

CG Truncation as Damping

Line Searching

Convergence of CG

Initializing CG

Preconditioning

The Effects of Preconditioning

Designing a Good Preconditioner

The Empirical Fisher Diagonal

An Unbiased Estimator for the Diagonal of G

Minibatching

Higher Quality Gradient Estimates

Minibatch Overfitting and Methods to Combat It

Tricks and Recipes

Summary

References

Implementing Neural Networks Efficiently

Efficient Environment

Scripting Language

Multi-purpose Efficient N-Dimensional Tensor Object

Modular Neural Networks

Additional Torch7 Packages

Efficient Runtime Execution

Float or Double Representations

Memory Allocation Control

BLAS/LAPACK Interfaces

SIMD Instructions

Ordering Memory Accesses

OpenMP Support

CUDA Support

Benchmarks

Efficient Optimization Heuristics

Conclusion

References

Better Representations: Invariant, Disentangled and Reusable

Learning Feature Representations with K-Means

Introduction

Data, Pre-processing and Initialization

Pre-processing

Initialization

Comparison to Sparse Feature Learning

Application to Image Recognition

Parameters

Encoders

Local Receptive Fields and Multiple Layers

Deep Networks

Conclusion

References

Deep Big Multilayer Perceptrons for Digit Recognition

Introduction

Data

Architectures

Deforming Images to Get More Training Instances

Forming a Committee

Using the GPU to Train Deep MLPs

Single MLP

Committee of MLP

Discussion

References

A Practical Guide to Training Restricted Boltzmann Machines

Introduction

An Overview of Restricted Boltzmann Machines and Contrastive Divergence

How to Collect Statistics When Using Contrastive Divergence

Updating the Hidden States

Updating the Visible States

Collecting the Statistics Needed for Learning

A Recipe for Getting the Learning Signal for CD1

The Size of a Mini-batch

A Recipe for Dividing the Training Set into Mini-batches

Monitoring the Progress of Learning

A Recipe for Using the Reconstruction Error

Monitoring the Overfitting

A Recipe for Monitoring the Overfitting

The Learning Rate

A Recipe for Setting the Learning Rates for Weights and Biases

The Initial Values of the Weights and Biases

A Recipe for Setting the Initial Values of the Weights and Biases

Momentum

A Recipe for Using Momentum

Weight-Decay

A Recipe for Using Weight-Decay

Encouraging Sparse Hidden Activities

A Recipe for Sparsity

The Number of Hidden Units

A Recipe for Choosing the Number of Hidden Units

Different Types of Unit

Softmax and Multinomial Units

Gaussian Visible Units

Gaussian Visible and Hidden Units

Binomial Units

Rectified Linear Units

Varieties of Contrastive Divergence

Displaying What Is Happening during Learning

Using RBM's for Discrimination

Computing the Free Energy of a Visible Vector

Dealing with Missing Values

References

Deep Boltzmann Machines and the Centering Trick

Introduction

Boltzmann Machines

Deep Boltzmann Machines

Training Boltzmann Machines

The Centering Trick

Understanding the Centering Trick

Evaluating Boltzmann Machines

Discriminative Analysis

Generative Analysis

Experiments

Conclusion

References

Deep Learning via Semi-supervised Embedding

Introduction

Semi-supervised Embedding

Embedding Algorithms

Semi-supervised Algorithms

Semi-supervised Embedding for Deep Learning

Labeling Unlabeled Data as Neighbors (Building the Graph)

When Do We Expect This Approach to Work?

Why Is This Approach Good?

Experimental Evaluation

Small-Scale Experiments

MNIST Experiments

Deeper MNIST Experiments

Semantic Role Labeling

Object Recognition Using Unlabeled Video

Conclusion

References

Identifying Dynamical Systems for Forecasting and Control

A Practical Guide to Applying Echo State Networks

Introduction

The Basic Model

Producing a Reservoir

Function of the Reservoir

Global Parameters of the Reservoir

Practical Approach to Reservoir Production

Pointers to Reservoir Extensions

Training Readouts

Ridge Regression

Regularization

Large Datasets

Direct Pseudoinverse Solution

Initial Transient

Regression Weighting

Readouts for Classification

Online Learning

Pointers to Readouts Extensions

Dealing with Output Feedbacks

Output Feedbacks

Teacher Forcing

Online Learning with Real Feedbacks

Summary and Implementations

References

Forecasting with Recurrent Neural Networks: 12 Tricks

Introduction

Tricks for Recurrent Neural Networks

Conclusion and Outlook

References

Solving Partially Observable Reinforcement Learning Problems with Recurrent Neural Networks

Introduction

Background

The Trick of Modeling a Markovian State Space Using a Recurrent Neural Network

Improving the Generalization Capabilities with Respect to Actions

A Recipe to Improve the Modeling of State Transitions

Scaling of Inputs and Targets

Block Validation

Removal of Invalid Data Patterns

Learning Settings

Double Rest Learning

A Recipe to Generate an Efficient State Estimation Function

Application of a Neural State Estimator

The Markov Decision Process Extraction Network

Reward Function Design Influences the Performance of a State Estimator

Choosing the Forecast Horizon of a State Estimator

The Trick of Addressing Long Term Dependencies

A Recipe to Find a Good Shortcut Length

Experiments on Long Term Dependencies Problems

Conclusion

References

10 Steps and Some Tricks to Set up Neural Reinforcement Controllers

Overview

The Reinforcement Learning Framework

Learning in Markovian Decision Processes

Q-Learning with Function Approximation

Characteristics of the Control Task

Modeling the Learning Task

State Information

Actions

Choice of Control Interval t

The Terminal Goal State and The Non-terminal Goal State Setting

Choice of X+

Choice of X-

Choice of Immediate and Final Costs

Discounting

Choice of X0

Choice of the Maximal Episode Length N

Tricks

Scaling the Input Values

The X++-Trick

Artificial Training Transitions

Growing Batch

Training the Neural Q-Function

Exploration

Delays

Experiments

The Control Task

Modeling as a Learning Task

Applied Tricks

Measuring Quality

Results on the Simulated Cart Pole

Results on the Real Cart Pole

Conclusion

References

Author Index

Subject Index

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen 7700 Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

Grégoire Montavon Geneviève B. Orr Klaus-Robert Müller (Eds.) Neural Networks: Tricks of the Trade Second Edition 1 3

Volume Editors Grégoire Montavon Technische Universität Berlin Department of Computer Science Franklinstr. 28/29, 10587 Berlin, Germany E-mail: gregoire.montavon@tu-berlin.de Geneviève B. Orr Willamette University Department of Computer Science 900 State Street, Salem, OR 97301, USA E-mail: gorr@willamette.edu Klaus-Robert Müller Technische Universität Berlin Department of Computer Science Franklinstr. 28/29, 10587 Berlin, Germany and Korea University Department of Brain and Cognitive Engineering Anam-dong, Seongbuk-gu, Seoul 136-713, Korea E-mail: klaus-robert.mueller@tu-berlin.de ISSN 0302-9743 ISBN 978-3-642-35288-1 DOI 10.1007/978-3-642-35289-8 Springer Heidelberg Dordrecht London New York e-ISSN 1611-3349 e-ISBN 978-3-642-35289-8 Library of Congress Control Number: 2012952591 CR Subject Classiﬁcation (1998): F.1, I.2.6, I.5.1, C.1.3, F.2, J.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues © Springer-Verlag Berlin Heidelberg 1998, 2012 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientiﬁc Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface to the Second Edition There have been substantial changes in the ﬁeld of neural networks since the ﬁrst edition of this book in 1998. Some of them have been driven by external factors such as the increase of available data and computing power. The Internet made public massive amounts of labeled and unlabeled data. The ever-increasing raw mass of user-generated and sensed data is made easily accessible by databases and Web crawlers. Nowadays, anyone having an Internet connection can parse the 4,000,000+ articles available on Wikipedia and construct a dataset out of them. Anyone can capture a Web TV stream and obtain days of video content to test their learning algorithm. Another development is the amount of available computing power that has continued to rise at steady rate owing to progress in hardware design and en- gineering. While the number of cycles per second of processors has thresholded due to physics limitations, the slow-down has been oﬀset by the emergence of processing parallelism, best exempliﬁed by the massively parallel graphics pro- cessing units (GPU). Nowadays, everybody can buy a GPU board (usually al- ready available in consumer-grade laptops), install free GPU software, and run computation-intensive simulations at low cost. These developments have raised the following question: Can we make use of this large computing power to make sense of these increasingly complex datasets? Neural networks are a promising approach, as they have the intrinsic modeling capacity and ﬂexibility to represent the solution. Their intrinsically distributed nature allows one to leverage the massively parallel computing resources. During the last two decades, the focus of neural network research and the practice of training neural networks underwent important changes. Learning in deep (or “deep learning”) has to a certain degree displaced the once more preva- lent regularization issues, or more precisely, changed the practice of regularizing neural networks. Use of unlabeled data via unsupervised layer-wise pretrain- ing or deep unsupervised embeddings is now often preferred over traditional regularization schemes such as weight decay or restricted connectivity. This new paradigm has started to spread over a large number of applications such as image recognition, speech recognition, natural language processing, complex systems, neuroscience, and computational physics. The second edition of the book reloads the ﬁrst edition with more tricks. These tricks arose from 14 years of theory and experimentation (from 1998 to 2012) by some of the world’s most prominent neural networks researchers. These tricks can make a substantial diﬀerence (in terms of speed, ease of im- plementation, and accuracy) when it comes to putting algorithms to work on real problems. Tricks may not necessarily have solid theoretical foundations or formal validation. As Yoshua Bengio states in Chap. 19, “the wisdom distilled here should be taken as a guideline, to be tried and challenged, not as a practice set in stone” [1].

VI G. Montavon and K.-R. Müller The second part of the new edition starts with tricks to faster optimize neu- ral networks and make more eﬃcient use of the potentially inﬁnite stream of data presented to them. Chapter 18 [2] shows that a simple stochastic gradi- ent descent (learning one example at a time) is suited for training most neural networks. Chapter 19 [1] introduces a large number of tricks and recommenda- tions for training feed-forward neural networks and choosing the multiple hyper- parameters. When the representation built by the neural network is highly sensitive to small parameter changes, for example, in recurrent neural networks, second-order methods based on mini-batches such as those presented in Chap. 20 [9] can be a better choice. The seemingly simple optimization procedures presented in these chapters require their fair share of tricks in order to work optimally. The software Torch7 presented in Chap. 21 [5] provides a fast and modular implementation of these neural networks. The novel second part of this volume continues with tricks to incorporate invariance into the model. In the context of image recognition, Chap. 22 [4] shows that translation invariance can be achieved by learning a k-means representation of image patches and spatially pooling the k-means activations. Chapter 23 [3] shows that invariance can be injected directly in the input space in the form of elastic distortions. Unlabeled data are ubiquitous and using them to capture regularities in data is an important component of many learning algorithms. For example, we can learn an unsupervised model of data as a ﬁrst step, as discussed in Chaps. 24 [7] and 25 [10], and feed the unsupervised representation to a supervised classiﬁer. Chapter 26 [12] shows that similar improvements can be obtained by learning an unsupervised embedding in the deep layers of a neural network, with added ﬂexibility. The book concludes with the application of neural networks to modeling time series and optimal control systems. Modeling time series can be done using a very simple technique discussed in Chap. 27 [8] that consists of ﬁtting a linear model on top of a “reservoir” that implements a rich set of time series primitives. Chapter 28 [13] oﬀers an alternative to the previous method by directly identifying the underly- ing dynamical system that generates the time series data. Chapter 29 [6] presents how these system identiﬁcation techniques can be used to identify a Markov de- cision process from the observation of a control system (a sequence of states and actions in the reinforcement learning terminology). Chapter 30 [11] concludes by showing how the control system can be dynamically improved by ﬁtting a neural network as the control system explores the space of states and actions. The book intends to provide a timely snapshot of tricks, theory, and algo- rithms that are of use. Our hope is that some of the chapters of the new second edition will become our companions when doing experimental work—eventually becoming classics, as some of the papers of the ﬁrst edition have become. Even- tually in some years, there may be an urge to reload again... September 2012 Grégoire Klaus

Preface to the Second Edition VII Acknowledgments. This work was supported by the World Class University Pro- gram through the National Research Foundation of Korea funded by the Ministry of Education, Science, and Technology, under Grant R31-10008. The editors also acknowledge partial support by DFG (MU 987/17-1). References [1] Bengio, Y.: Practical Recommendations for Gradient-based Training of Deep Ar- chitectures. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 437–478. Springer, Heidelberg (2012) [2] Bottou, L.: Stochastic Gradient Descent Tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012) [3] Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Deep Big Mul- tilayer Perceptrons for Digit Recognition. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 581–598. Springer, Heidelberg (2012) [4] Coates, A., Ng, A.Y.: Learning Feature Representations with k-means. In: Mon- tavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 561–580. Springer, Heidelberg (2012) [5] Collobert, R., Kavukcuoglu, K., Farabet, C.: Implementing Neural Networks Eﬃ- ciently. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 537–557. Springer, Heidelberg (2012) [6] Duell, S., Udluft, S., Sterzing, V.: Solving Partially Observable Reinforcement Learning Problems with Recurrent Neural Networks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 687–707. Springer, Heidelberg (2012) [7] Hinton, G.E.: A Practical Guide to Training Restricted Boltzmann Machines. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 621–637. Springer, Heidelberg (2012) [8] Lukoševičius, M.: A Practical Guide to Applying Echo State Networks. In: Mon- tavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 659–686. Springer, Heidelberg (2012) [9] Martens, J., Sutskever, I.: Training Deep and Recurrent Networks with Hessian- free Optimization. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 479–535. Springer, Heidelberg (2012) [10] Montavon, G., Müller, K.-R.: Deep Boltzmann Machines and the Centering Trick. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 621–637. Springer, Heidelberg (2012) [11] Riedmiller, M.: 10 Steps and Some Tricks to Set Up Neural Reinforcement Con- trollers. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 735–757. Springer, Heidelberg (2012) [12] Weston, J., Ratle, F., Collobert, R.: Deep Learning Via Semi-supervised Embed- ding. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 639–655. Springer, Heidelberg (2012) [13] Zimmermann, H.-G., Tietz, C., Grothmann, R.: Forecasting with Recurrent Neural Networks: 12 Tricks. In: NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 687–707. Springer, Heidelberg (2012)

Table of Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Speeding Learning Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. Eﬃcient BackProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus-Robert Müller Regularization Techniques to Improve Generalization Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Early Stopping — But When? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lutz Prechelt 3. A Simple Trick for Estimating the Weight Decay Parameter . . . . . . . . . . Thorsteinn S. Rögnvaldsson 4. Controlling the Hyperparameter Search in MacKay’s Bayesian Neural Network Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tony Plate 1 7 9 49 53 69 91 5. Adaptive Regularization in Neural Network Modeling . . . . . . . . . . . . . . . 111 Jan Larsen, Claus Svarer, Lars Nonboe Andersen, and Lars Kai Hansen 6. Large Ensemble Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 David Horn, Ury Naftaly, and Nathan Intrator Improving Network Models and Algorithmic Tricks Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons . . 143 Gary William Flake 8. A Dozen Tricks with Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Rich Caruana 9. Solving the Ill-Conditioning in Neural Network Learning . . . . . . . . . . . . . 191 Patrick van der Smagt and Gerd Hirzinger 10. Centering Neural Network Gradient Factors . . . . . . . . . . . . . . . . . . . . . . . 205 Nicol N. Schraudolph 11. Avoiding Roundoﬀ Error in Backpropagating Derivatives . . . . . . . . . . . . 225 Tony Plate

X Table of Contents Representing and Incorporating Prior Knowledge in Neural Network Training Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 12. Transformation Invariance in Pattern Recognition – Tangent Distance and Tangent Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Patrice Y. Simard, Yann A. LeCun, John S. Denker, and Bernard Victorri 13. Combining Neural Networks and Context-Driven Search for On-line, Printed Handwriting Recognition in the Newton . . . . . . . . . . . 271 Larry S. Yaeger, Brandyn J. Webb, and Richard F. Lyon 14. Neural Network Classiﬁcation and Prior Class Probabilities . . . . . . . . . 295 Steve Lawrence, Ian Burns, Andrew Back, Ah Chung Tsoi, and C. Lee Giles 15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 Jürgen Fritsch and Michael Finke Tricks for Time Series Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 16. Forecasting the Economy with Neural Nets: A Survey of Challenges and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 John Moody 17. How to Train Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Ralph Neuneier and Hans Georg Zimmermann Big Learning in Deep Neural Networks Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 18. Stochastic Gradient Descent Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 Léon Bottou 19. Practical Recommendations for Gradient-Based Training of Deep Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 Yoshua Bengio 20. Training Deep and Recurrent Networks with Hessian-Free Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 James Martens and Ilya Sutskever 21. Implementing Neural Networks Eﬃciently . . . . . . . . . . . . . . . . . . . . . . . . . 537 Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet

分享到：

赞收藏

资料库

Neural Networks: Tricks of the Trade.pdf

相关推荐

课程资源

热门标签

最新资料