logo资料库

Supervised Sequence Labelling with Recurrent Neural Networks.pdf

第1页 / 共137页
第2页 / 共137页
第3页 / 共137页
第4页 / 共137页
第5页 / 共137页
第6页 / 共137页
第7页 / 共137页
第8页 / 共137页
资料共137页,剩余部分请下载后查看
List of Tables
List of Figures
List of Algorithms
Introduction
Structure of the Book
Supervised Sequence Labelling
Supervised Learning
Pattern Classification
Probabilistic Classification
Training Probabilistic Classifiers
Generative and Discriminative Methods
Sequence Labelling
Sequence Classification
Segment Classification
Temporal Classification
Neural Networks
Multilayer Perceptrons
Forward Pass
Output Layers
Loss Functions
Backward Pass
Recurrent Neural Networks
Forward Pass
Backward Pass
Unfolding
Bidirectional Networks
Sequential Jacobian
Network Training
Gradient Descent Algorithms
Generalisation
Input Representation
Weight Initialisation
Long Short-Term Memory
Network Architecture
Influence of Preprocessing
Gradient Calculation
Architectural Variants
Bidirectional Long Short-Term Memory
Network Equations
Forward Pass
Backward Pass
A Comparison of Network Architectures
Experimental Setup
Network Architectures
Computational Complexity
Range of Context
Output Layers
Network Training
Retraining
Results
Previous Work
Effect of Increased Context
Weighted Error
Hidden Markov Model Hybrids
Background
Experiment: Phoneme Recognition
Experimental Setup
Results
Connectionist Temporal Classification
Background
From Outputs to Labellings
Role of the Blank Labels
Bidirectional and Unidirectional Networks
Forward-Backward Algorithm
Log Scale
Loss Function
Loss Gradient
Decoding
Best Path Decoding
Prefix Search Decoding
Constrained Decoding
Experiments
Phoneme Recognition 1
Phoneme Recognition 2
Keyword Spotting
Online Handwriting Recognition
Offline Handwriting Recognition
Discussion
Multidimensional Networks
Background
Network Architecture
Multidirectional Networks
Multidimensional Long Short-Term Memory
Experiments
Air Freight Data
MNIST Data
Analysis
Hierarchical Subsampling Networks
Network Architecture
Subsampling Window Sizes
Hidden Layer Sizes
Number of Levels
Multidimensional Networks
Output Layers
Complete System
Experiments
Offline Arabic Handwriting Recognition
Online Arabic Handwriting Recognition
French Handwriting Recognition
Farsi/Arabic Character Classification
Phoneme Recognition
Bibliography
Acknowledgements
Supervised Sequence Labelling with Recurrent Neural Networks Alex Graves
Contents List of Tables List of Figures List of Algorithms 1 Introduction 1.1 Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . 2 Supervised Sequence Labelling 2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Pattern Classification . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Probabilistic Classification . . . . . . . . . . . . . . . . . . 2.2.2 Training Probabilistic Classifiers . . . . . . . . . . . . . . 2.2.3 Generative and Discriminative Methods . . . . . . . . . . 2.3 Sequence Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Classification . . . . . . . . . . . . . . . . . . . . 2.3.1 2.3.2 Segment Classification . . . . . . . . . . . . . . . . . . . . 2.3.3 Temporal Classification . . . . . . . . . . . . . . . . . . . 3 Neural Networks 3.2 Recurrent Neural Networks 3.1 Multilayer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Output Layers . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Backward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Backward Pass . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Bidirectional Networks . . . . . . . . . . . . . . . . . . . . 3.2.5 Sequential Jacobian . . . . . . . . . . . . . . . . . . . . . 3.3 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Gradient Descent Algorithms . . . . . . . . . . . . . . . . 3.3.2 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Input Representation . . . . . . . . . . . . . . . . . . . . . 3.3.4 Weight Initialisation . . . . . . . . . . . . . . . . . . . . . i iv v vii 1 3 4 4 5 5 5 7 7 9 10 11 12 12 13 15 16 16 18 19 19 20 21 23 25 25 26 29 30
CONTENTS 4 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Network Architecture 4.2 Influence of Preprocessing . . . . . . . . . . . . . . . . . . . . . . 4.3 Gradient Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Architectural Variants . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Bidirectional Long Short-Term Memory . . . . . . . . . . . . . . 4.6 Network Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Backward Pass . . . . . . . . . . . . . . . . . . . . . . . . 5 A Comparison of Network Architectures 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Computational Complexity . . . . . . . . . . . . . . . . . 5.2.2 Range of Context . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Output Layers . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Retraining . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Effect of Increased Context . . . . . . . . . . . . . . . . . 5.4.3 Weighted Error . . . . . . . . . . . . . . . . . . . . . . . . 6 Hidden Markov Model Hybrids 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Experiment: Phoneme Recognition . . . . . . . . . . . . . . . . . 6.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 6.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Connectionist Temporal Classification 7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 From Outputs to Labellings . . . . . . . . . . . . . . . . . . . . . 7.2.1 Role of the Blank Labels . . . . . . . . . . . . . . . . . . . 7.2.2 Bidirectional and Unidirectional Networks . . . . . . . . . 7.3 Forward-Backward Algorithm . . . . . . . . . . . . . . . . . . . . 7.3.1 Log Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Loss Gradient . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Best Path Decoding . . . . . . . . . . . . . . . . . . . . . 7.5.2 Prefix Search Decoding . . . . . . . . . . . . . . . . . . . 7.5.3 Constrained Decoding . . . . . . . . . . . . . . . . . . . . 7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Phoneme Recognition 1 . . . . . . . . . . . . . . . . . . . 7.6.2 Phoneme Recognition 2 . . . . . . . . . . . . . . . . . . . 7.6.3 Keyword Spotting . . . . . . . . . . . . . . . . . . . . . . 7.6.4 Online Handwriting Recognition . . . . . . . . . . . . . . 7.6.5 Offline Handwriting Recognition . . . . . . . . . . . . . . 7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 31 31 35 35 36 36 36 37 38 39 39 40 41 41 41 41 43 43 45 46 46 48 48 49 49 50 52 52 54 54 55 55 58 58 59 60 62 62 63 68 69 70 71 75 78 81
CONTENTS 8 Multidimensional Networks 8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Multidirectional Networks . . . . . . . . . . . . . . . . . . 8.2.2 Multidimensional Long Short-Term Memory . . . . . . . . 8.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Air Freight Data . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 MNIST Data . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 83 83 85 87 90 91 91 92 93 9 Hierarchical Subsampling Networks 9.1 Network Architecture 96 . . . . . . . . . . . . . . . . . . . . . . . . 97 9.1.1 Subsampling Window Sizes . . . . . . . . . . . . . . . . . 99 9.1.2 Hidden Layer Sizes . . . . . . . . . . . . . . . . . . . . . . 99 9.1.3 Number of Levels . . . . . . . . . . . . . . . . . . . . . . . 100 9.1.4 Multidimensional Networks . . . . . . . . . . . . . . . . . 100 9.1.5 Output Layers . . . . . . . . . . . . . . . . . . . . . . . . 101 9.1.6 Complete System . . . . . . . . . . . . . . . . . . . . . . . 103 9.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 9.2.1 Offline Arabic Handwriting Recognition . . . . . . . . . . 106 9.2.2 Online Arabic Handwriting Recognition . . . . . . . . . . 108 9.2.3 French Handwriting Recognition . . . . . . . . . . . . . . 111 9.2.4 Farsi/Arabic Character Classification . . . . . . . . . . . 112 9.2.5 Phoneme Recognition . . . . . . . . . . . . . . . . . . . . 113 Bibliography Acknowledgements 117 128
List of Tables 5.1 Framewise phoneme classification results on TIMIT . . . . . . . . 5.2 Comparison of BLSTM with previous network . . . . . . . . . . . 45 46 6.1 Phoneme recognition results on TIMIT . . . . . . . . . . . . . . . 50 7.1 Phoneme recognition results on TIMIT with 61 phonemes . . . . 7.2 Folding the 61 phonemes in TIMIT onto 39 categories . . . . . . 7.3 Phoneme recognition results on TIMIT with 39 phonemes . . . . 7.4 Keyword spotting results on Verbmobil . . . . . . . . . . . . . . 7.5 Character recognition results on IAM-OnDB . . . . . . . . . . . 7.6 Word recognition on IAM-OnDB . . . . . . . . . . . . . . . . . . 7.7 Word recognition results on IAM-DB . . . . . . . . . . . . . . . . 69 70 72 73 76 76 81 8.1 Classification results on MNIST . . . . . . . . . . . . . . . . . . . 93 9.1 Networks for offline Arabic handwriting recognition . . . . . . . . 107 9.2 Offline Arabic handwriting recognition competition results . . . . 108 9.3 Networks for online Arabic handwriting recognition . . . . . . . . 110 9.4 Online Arabic handwriting recognition competition results . . . . 111 9.5 Network for French handwriting recognition . . . . . . . . . . . . 112 9.6 French handwriting recognition competition results . . . . . . . . 113 9.7 Networks for Farsi/Arabic handwriting recognition . . . . . . . . 114 9.8 Farsi/Arabic handwriting recognition competition results . . . . 114 9.9 Networks for phoneme recognition on TIMIT . . . . . . . . . . . 116 9.10 Phoneme recognition results on TIMIT . . . . . . . . . . . . . . . 116 iv
List of Figures 2.1 Sequence labelling . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Three classes of sequence labelling task . . . . . . . . . . . . . . . Importance of context in segment classification . . . . . . . . . . 2.3 3.1 A multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . 3.2 Neural network activation functions . . . . . . . . . . . . . . . . 3.3 A recurrent neural network . . . . . . . . . . . . . . . . . . . . . 3.4 An unfolded recurrent network . . . . . . . . . . . . . . . . . . . 3.5 An unfolded bidirectional network . . . . . . . . . . . . . . . . . 3.6 Sequential Jacobian for a bidirectional network . . . . . . . . . . 3.7 Overfitting on training data . . . . . . . . . . . . . . . . . . . . . 3.8 Different Kinds of Input Perturbation . . . . . . . . . . . . . . . 4.1 The vanishing gradient problem for RNNs . . . . . . . . . . . . . 4.2 LSTM memory block with one cell . . . . . . . . . . . . . . . . . 4.3 An LSTM network . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Preservation of gradient information by LSTM . . . . . . . . . . 5.1 Various networks classifying an excerpt from TIMIT . . . . . . . 5.2 Framewise phoneme classification results on TIMIT . . . . . . . . 5.3 Learning curves on TIMIT . . . . . . . . . . . . . . . . . . . . . . 5.4 BLSTM network classifying the utterance “one oh five” . . . . . 7.1 CTC and framewise classification . . . . . . . . . . . . . . . . . . 7.2 Unidirectional and Bidirectional CTC Networks Phonetically Tran- scribing an Excerpt from TIMIT . . . . . . . . . . . . . . . . . . 7.3 CTC forward-backward algorithm . . . . . . . . . . . . . . . . . 7.4 Evolution of the CTC error signal during training . . . . . . . . . 7.5 Problem with best path decoding . . . . . . . . . . . . . . . . . . 7.6 Prefix search decoding . . . . . . . . . . . . . . . . . . . . . . . . 7.7 CTC outputs for keyword spotting on Verbmobil . . . . . . . . . 7.8 Sequential Jacobian for keyword spotting on Verbmobil . . . . . 7.9 BLSTM-CTC network labelling an excerpt from IAM-OnDB . . 7.10 BLSTM-CTC Sequential Jacobian from IAM-OnDB with raw in- puts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 BLSTM-CTC Sequential Jacobian from IAM-OnDB with prepro- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . cessed inputs 8 9 10 13 14 18 20 22 24 27 28 32 33 34 35 42 44 44 47 53 56 58 61 62 63 74 74 77 79 80 8.1 MDRNN forward pass . . . . . . . . . . . . . . . . . . . . . . . . 85 v
LIST OF FIGURES 8.2 MDRNN backward pass . . . . . . . . . . . . . . . . . . . . . . . 8.3 Sequence ordering of 2D data . . . . . . . . . . . . . . . . . . . . 8.4 Context available to a unidirectional two dimensional RNN . . . 8.5 Axes used by the hidden layers in a multidirectional MDRNN . . 8.6 Context available to a multidirectional MDRNN . . . . . . . . . 8.7 Frame from the Air Freight database . . . . . . . . . . . . . . . . 8.8 MNIST image before and after deformation . . . . . . . . . . . . 8.9 MDRNN applied to an image from the Air Freight database . . . 8.10 Sequential Jacobian of an MDRNN for an image from MNIST . . vi 85 85 88 88 88 92 93 94 95 97 9.1 Information flow through an HSRNN . . . . . . . . . . . . . . . . 98 9.2 An unfolded HSRNN . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Information flow through a multidirectional HSRNN . . . . . . . 101 9.4 HSRNN applied to offline Arabic handwriting recognition . . . . 104 9.5 Offline Arabic word images . . . . . . . . . . . . . . . . . . . . . 106 9.6 Offline Arabic error curves . . . . . . . . . . . . . . . . . . . . . . 109 9.7 Online Arabic input sequences . . . . . . . . . . . . . . . . . . . 110 9.8 French word images . . . . . . . . . . . . . . . . . . . . . . . . . 111 9.9 Farsi character images . . . . . . . . . . . . . . . . . . . . . . . . 114 9.10 Three representations of a TIMIT utterance . . . . . . . . . . . . 115
List of Algorithms 3.1 BRNN Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 BRNN Backward Pass . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Online Learning with Gradient Descent . . . . . . . . . . . . . . 3.4 Online Learning with Gradient Descent and Weight Noise . . . . 7.1 Prefix Search Decoding . . . . . . . . . . . . . . . . . . . . . . . 7.2 CTC Token Passing . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 MDRNN Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . 8.2 MDRNN Backward Pass . . . . . . . . . . . . . . . . . . . . . . . 8.3 Multidirectional MDRNN Forward Pass . . . . . . . . . . . . . . 8.4 Multidirectional MDRNN Backward Pass . . . . . . . . . . . . . 21 22 25 29 64 67 86 87 89 89 vii
分享到:
收藏