logo资料库

automatic speech recognition a deep learning approach.pdf

第1页 / 共329页
第2页 / 共329页
第3页 / 共329页
第4页 / 共329页
第5页 / 共329页
第6页 / 共329页
第7页 / 共329页
第8页 / 共329页
资料共329页,剩余部分请下载后查看
Foreword
Preface
Contents
Acronyms
Symbols
1 Introduction
1.1 Automatic Speech Recognition: A Bridge for Better Communication
1.1.1 Human--Human Communication
1.1.2 Human--Machine Communication
1.2 Basic Architecture of ASR Systems
1.3 Book Organization
1.3.1 Part I: Conventional Acoustic Models
1.3.2 Part II: Deep Neural Networks
1.3.3 Part III: DNN-HMM Hybrid Systems for ASR
1.3.4 Part IV: Representation Learning in Deep Neural Networks
1.3.5 Part V: Advanced Deep Models
References
Part IConventional Acoustic Models
2 Gaussian Mixture Models
2.1 Random Variables
2.2 Gaussian and Gaussian-Mixture Random Variables
2.3 Parameter Estimation
2.4 Mixture of Gaussians as a Model for the Distribution of Speech Features
References
3 Hidden Markov Models and the Variants
3.1 Introduction
3.2 Markov Chains
3.3 Hidden Markov Sequences and Models
3.3.1 Characterization of a Hidden Markov Model
3.3.2 Simulation of a Hidden Markov Model
3.3.3 Likelihood Evaluation of a Hidden Markov Model
3.3.4 An Algorithm for Efficient Likelihood Evaluation
3.3.5 Proofs of the Forward and Backward Recursions
3.4 EM Algorithm and Its Application to Learning HMM Parameters
3.4.1 Introduction to EM Algorithm
3.4.2 Applying EM to Learning the HMM---Baum-Welch Algorithm
3.5 Viterbi Algorithm for Decoding HMM State Sequences
3.5.1 Dynamic Programming and Viterbi Algorithm
3.5.2 Dynamic Programming for Decoding HMM States
3.6 The HMM and Variants for Generative Speech Modeling and Recognition
3.6.1 GMM-HMMs for Speech Modeling and Recognition
3.6.2 Trajectory and Hidden Dynamic Models for Speech Modeling and Recognition
3.6.3 The Speech Recognition Problem Using Generative Models of HMM and Its Variants
References
Part IIDeep Neural Networks
4 Deep Neural Networks
4.1 The Deep Neural Network Architecture
4.2 Parameter Estimation with Error Backpropagation
4.2.1 Training Criteria
4.2.2 Training Algorithms
4.3 Practical Considerations
4.3.1 Data Preprocessing
4.3.2 Model Initialization
4.3.3 Weight Decay
4.3.4 Dropout
4.3.5 Batch Size Selection
4.3.6 Sample Randomization
4.3.7 Momentum
4.3.8 Learning Rate and Stopping Criterion
4.3.9 Network Architecture
4.3.10 Reproducibility and Restartability
References
5 Advanced Model Initialization Techniques
5.1 Restricted Boltzmann Machines
5.1.1 Properties of RBMs
5.1.2 RBM Parameter Learning
5.2 Deep Belief Network Pretraining
5.3 Pretraining with Denoising Autoencoder
5.4 Discriminative Pretraining
5.5 Hybrid Pretraining
5.6 Dropout Pretraining
References
Part IIIDeep Neural Network-Hidden MarkovModel Hybrid Systems for AutomaticSpeech Recognition
6 Deep Neural Network-Hidden Markov Model Hybrid Systems
6.1 DNN-HMM Hybrid Systems
6.1.1 Architecture
6.1.2 Decoding with CD-DNN-HMM
6.1.3 Training Procedure for CD-DNN-HMMs
6.1.4 Effects of Contextual Window
6.2 Key Components in the CD-DNN-HMM and Their Analysis
6.2.1 Datasets and Baselines for Comparisons and Analysis
6.2.2 Modeling Monophone States or Senones
6.2.3 Deeper Is Better
6.2.4 Exploit Neighboring Frames
6.2.5 Pretraining
6.2.6 Better Alignment Helps
6.2.7 Tuning Transition Probability
6.3 Kullback-Leibler Divergence-Based HMM
References
7 Training and Decoding Speedup
7.1 Training Speedup
7.1.1 Pipelined Backpropagation Using Multiple GPUs
7.1.2 Asynchronous SGD
7.1.3 Augmented Lagrangian Methods and Alternating Directions Method of Multipliers
7.1.4 Reduce Model Size
7.1.5 Other Approaches
7.2 Decoding Speedup
7.2.1 Parallel Computation
7.2.2 Sparse Network
7.2.3 Low-Rank Approximation
7.2.4 Teach Small DNN with Large DNN
7.2.5 Multiframe DNN
References
8 Deep Neural Network Sequence-Discriminative Training
8.1 Sequence-Discriminative Training Criteria
8.1.1 Maximum Mutual Information
8.1.2 Boosted MMI
8.1.3 MPE/sMBR
8.1.4 A Uniformed Formulation
8.2 Practical Considerations
8.2.1 Lattice Generation
8.2.2 Lattice Compensation
8.2.3 Frame Smoothing
8.2.4 Learning Rate Adjustment
8.2.5 Training Criterion Selection
8.2.6 Other Considerations
8.3 Noise Contrastive Estimation
8.3.1 Casting Probability Density Estimation Problem as a Classifier Design Problem
8.3.2 Extension to Unnormalized Models
8.3.3 Apply NCE in DNN Training
References
Part IVRepresentation Learningin Deep Neural Networks
9 Feature Representation Learning in Deep Neural Networks
9.1 Joint Learning of Feature Representation and Classifier
9.2 Feature Hierarchy
9.3 Flexibility in Using Arbitrary Input Features
9.4 Robustness of Features
9.4.1 Robust to Speaker Variations
9.4.2 Robust to Environment Variations
9.5 Robustness Across All Conditions
9.5.1 Robustness Across Noise Levels
9.5.2 Robustness Across Speaking Rates
9.6 Lack of Generalization Over Large Distortions
References
10 Fuse Deep Neural Network and Gaussian Mixture Model Systems
10.1 Use DNN-Derived Features in GMM-HMM Systems
10.1.1 GMM-HMM with Tandem and Bottleneck Features
10.1.2 DNN-HMM Hybrid System Versus GMM-HMM System with DNN-Derived Features
10.2 Fuse Recognition Results
10.2.1 ROVER
10.2.2 SCARF
10.2.3 MBR Lattice Combination
10.3 Fuse Frame-Level Acoustic Scores
10.4 Multistream Speech Recognition
References
11 Adaptation of Deep Neural Networks
11.1 The Adaptation Problem for Deep Neural Networks
11.2 Linear Transformations
11.2.1 Linear Input Networks
11.2.2 Linear Output Networks
11.3 Linear Hidden Networks
11.4 Conservative Training
11.4.1 L2 Regularization
11.4.2 KL-Divergence Regularization
11.4.3 Reducing Per-Speaker Footprint
11.5 Subspace Methods
11.5.1 Subspace Construction Through Principal Component Analysis
11.5.2 Noise-Aware, Speaker-Aware, and Device-Aware Training
11.5.3 Tensor
11.6 Effectiveness of DNN Speaker Adaptation
11.6.1 KL-Divergence Regularization Approach
11.6.2 Speaker-Aware Training
References
Part VAdvanced Deep Models
12 Representation Sharing and Transfer in Deep Neural Networks
12.1 Multitask and Transfer Learning
12.1.1 Multitask Learning
12.1.2 Transfer Learning
12.2 Multilingual and Crosslingual Speech Recognition
12.2.1 Tandem/Bottleneck-Based Crosslingual Speech Recognition
12.2.2 Shared-Hidden-Layer Multilingual DNN
12.2.3 Crosslingual Model Transfer
12.3 Multiobjective Training of Deep Neural Networks for Speech Recognition
12.3.1 Robust Speech Recognition with Multitask Learning
12.3.2 Improved Phone Recognition with Multitask Learning
12.3.3 Recognizing both Phonemes and Graphemes
12.4 Robust Speech Recognition Exploiting Audio-Visual Information
References
13 Recurrent Neural Networks and Related Models
13.1 Introduction
13.2 State-Space Formulation of the Basic Recurrent Neural Network
13.3 The Backpropagation-Through-Time Learning Algorithm
13.3.1 Objective Function for Minimization
13.3.2 Recursive Computation of Error Terms
13.3.3 Update of RNN Weights
13.4 A Primal-Dual Technique for Learning Recurrent Neural Networks
13.4.1 Difficulties in Learning RNNs
13.4.2 Echo-State Property and Its Sufficient Condition
13.4.3 Learning RNNs as a Constrained Optimization Problem
13.4.4 A Primal-Dual Method for Learning RNNs
13.5 Recurrent Neural Networks Incorporating LSTM Cells
13.5.1 Motivations and Applications
13.5.2 The Architecture of LSTM Cells
13.5.3 Training the LSTM-RNN
13.6 Analyzing Recurrent Neural Networks---A Contrastive Approach
13.6.1 Direction of Information Flow: Top-Down versus Bottom-Up
13.6.2 The Nature of Representations: Localist or Distributed
13.6.3 Interpretability: Inferring Latent Layers versus End-to-End Learning
13.6.4 Parameterization: Parsimonious Conditionals versus Massive Weight Matrices
13.6.5 Methods of Model Learning: Variational Inference versus Gradient Descent
13.6.6 Recognition Accuracy Comparisons
13.7 Discussions
References
14 Computational Network
14.1 Computational Network
14.2 Forward Computation
14.3 Model Training
14.4 Typical Computation Nodes
14.4.1 Computation Node Types with No Operand
14.4.2 Computation Node Types with One Operand
14.4.3 Computation Node Types with Two Operands
14.4.4 Computation Node Types for Computing Statistics
14.5 Convolutional Neural Network
14.6 Recurrent Connections
14.6.1 Sample by Sample Processing Only Within Loops
14.6.2 Processing Multiple Utterances Simultaneously
14.6.3 Building Arbitrary Recurrent Neural Networks
References
15 Summary and Future Directions
15.1 Road Map
15.1.1 Debut of DNNs for ASR
15.1.2 Speedup of DNN Training and Decoding
15.1.3 Sequence Discriminative Training
15.1.4 Feature Processing
15.1.5 Adaptation
15.1.6 Multitask and Transfer Learning
15.1.7 Convolution Neural Networks
15.1.8 Recurrent Neural Networks and LSTM
15.1.9 Other Deep Models
15.2 State of the Art and Future Directions
15.2.1 State of the Art---A Brief Analysis
15.2.2 Future Directions
References
Index
Signals and Communication Technology Dong Yu Li Deng Automatic Speech Recognition A Deep Learning Approach
Signals and Communication Technology
More information about this series at http://www.springer.com/series/4748
Dong Yu Li Deng Automatic Speech Recognition A Deep Learning Approach 123
Dong Yu Microsoft Research Bothell USA Li Deng Microsoft Research Redmond, WA USA ISSN 1860-4862 ISBN 978-1-4471-5778-6 DOI 10.1007/978-1-4471-5779-3 ISSN 1860-4870 (electronic) ISBN 978-1-4471-5779-3 (eBook) Library of Congress Control Number: 2014951663 Springer London Heidelberg New York Dordrecht © Springer-Verlag London 2015 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To my wife and parents Dong Yu To Lih-Yuan, Lloyd, Craig, Lyle, Arie, and Axel Li Deng
Foreword This is the first book on automatic speech recognition (ASR) that is focused on the deep learning approach, and in particular, deep neural network (DNN) technology. The landmark book represents a big milestone in the journey of the DNN tech- nology, which has achieved overwhelming successes in ASR over the past few years. Following the authors’ recent book on “Deep Learning: Methods and Applications”, this new book digs deeply and exclusively into ASR technology and applications, which were only relatively lightly covered in the previous book in parallel with numerous other applications of deep learning. Importantly, the background material of ASR and technical detail of DNNs including rigorous mathematical descriptions and software implementation are provided in this book, invaluable for ASR experts as well as advanced students. One unique aspect of this book is to broaden the view of deep learning from DNNs, as commonly adopted in ASR by now, to encompass also deep generative models that have advantages of naturally embedding domain knowledge and problem constraints. The background material did justice to the incredible richness of deep and dynamic generative models of speech developed by ASR researchers since early 90’s, yet without losing sight of the unifying principles with respect to the recent rapid development of deep discriminative models of DNNs. Compre- hensive comparisons of the relative strengths of these two very different types of deep models using the example of recurrent neural nets versus hidden dynamic models are particularly insightful, opening an exciting and promising direction for new development of deep learning in ASR as well as in other signal and infor- mation processing applications. From the historical perspective, four generations of ASR technology have been recently analyzed. The 4th Generation technology is really embodied in deep learning elaborated in this book, especially when DNNs are seamlessly integrated with deep generative models that would enable extended knowledge processing in a most natural fashion. All in all, this beautifully produced book is likely to become a definitive ref- erence for ASR practitioners in the deep learning era of 4th generation ASR. The book masterfully covers the basic concepts required to understand the ASR field as a whole, and it also details in depth the powerful deep learning methods that have vii
viii Foreword shattered the field in 2 recent years. The readers of this book will become articulate in the new state-of-the-art of ASR established by the DNN technology, and be poised to build new ASR systems to match or exceed human performance. By Sadaoki Furui, President of Toyota Technological Institute at Chicago, and Professor at the Tokyo Institute of Technology.
分享到:
收藏