automatic speech recognition a deep learning approach.pdf

发布时间：2022-06-19 发布人：admin 分类：说明书资料大小：6.71M 资料格式：pdf 举报版权申诉

f9e48323-9806-4c6a-a54d-a63dec51a731.pdf-第1页.png

第1页 / 共329页

f9e48323-9806-4c6a-a54d-a63dec51a731.pdf-第2页.png

第2页 / 共329页

f9e48323-9806-4c6a-a54d-a63dec51a731.pdf-第3页.png

第3页 / 共329页

f9e48323-9806-4c6a-a54d-a63dec51a731.pdf-第4页.png

第4页 / 共329页

f9e48323-9806-4c6a-a54d-a63dec51a731.pdf-第5页.png

第5页 / 共329页

f9e48323-9806-4c6a-a54d-a63dec51a731.pdf-第6页.png

第6页 / 共329页

f9e48323-9806-4c6a-a54d-a63dec51a731.pdf-第7页.png

第7页 / 共329页

f9e48323-9806-4c6a-a54d-a63dec51a731.pdf-第8页.png

第8页 / 共329页

Foreword

Preface

Contents

Acronyms

Symbols

1 Introduction

1.1 Automatic Speech Recognition: A Bridge for Better Communication

1.1.1 Human--Human Communication

1.1.2 Human--Machine Communication

1.2 Basic Architecture of ASR Systems

1.3 Book Organization

1.3.1 Part I: Conventional Acoustic Models

1.3.2 Part II: Deep Neural Networks

1.3.3 Part III: DNN-HMM Hybrid Systems for ASR

1.3.4 Part IV: Representation Learning in Deep Neural Networks

1.3.5 Part V: Advanced Deep Models

References

Part IConventional Acoustic Models

2 Gaussian Mixture Models

2.1 Random Variables

2.2 Gaussian and Gaussian-Mixture Random Variables

2.3 Parameter Estimation

2.4 Mixture of Gaussians as a Model for the Distribution of Speech Features

References

3 Hidden Markov Models and the Variants

3.1 Introduction

3.2 Markov Chains

3.3 Hidden Markov Sequences and Models

3.3.1 Characterization of a Hidden Markov Model

3.3.2 Simulation of a Hidden Markov Model

3.3.3 Likelihood Evaluation of a Hidden Markov Model

3.3.4 An Algorithm for Efficient Likelihood Evaluation

3.3.5 Proofs of the Forward and Backward Recursions

3.4 EM Algorithm and Its Application to Learning HMM Parameters

3.4.1 Introduction to EM Algorithm

3.4.2 Applying EM to Learning the HMM---Baum-Welch Algorithm

3.5 Viterbi Algorithm for Decoding HMM State Sequences

3.5.1 Dynamic Programming and Viterbi Algorithm

3.5.2 Dynamic Programming for Decoding HMM States

3.6 The HMM and Variants for Generative Speech Modeling and Recognition

3.6.1 GMM-HMMs for Speech Modeling and Recognition

3.6.2 Trajectory and Hidden Dynamic Models for Speech Modeling and Recognition

3.6.3 The Speech Recognition Problem Using Generative Models of HMM and Its Variants

References

Part IIDeep Neural Networks

4 Deep Neural Networks

4.1 The Deep Neural Network Architecture

4.2 Parameter Estimation with Error Backpropagation

4.2.1 Training Criteria

4.2.2 Training Algorithms

4.3 Practical Considerations

4.3.1 Data Preprocessing

4.3.2 Model Initialization

4.3.3 Weight Decay

4.3.4 Dropout

4.3.5 Batch Size Selection

4.3.6 Sample Randomization

4.3.7 Momentum

4.3.8 Learning Rate and Stopping Criterion

4.3.9 Network Architecture

4.3.10 Reproducibility and Restartability

References

5 Advanced Model Initialization Techniques

5.1 Restricted Boltzmann Machines

5.1.1 Properties of RBMs

5.1.2 RBM Parameter Learning

5.2 Deep Belief Network Pretraining

5.3 Pretraining with Denoising Autoencoder

5.4 Discriminative Pretraining

5.5 Hybrid Pretraining

5.6 Dropout Pretraining

References

Part IIIDeep Neural Network-Hidden MarkovModel Hybrid Systems for AutomaticSpeech Recognition

6 Deep Neural Network-Hidden Markov Model Hybrid Systems

6.1 DNN-HMM Hybrid Systems

6.1.1 Architecture

6.1.2 Decoding with CD-DNN-HMM

6.1.3 Training Procedure for CD-DNN-HMMs

6.1.4 Effects of Contextual Window

6.2 Key Components in the CD-DNN-HMM and Their Analysis

6.2.1 Datasets and Baselines for Comparisons and Analysis

6.2.2 Modeling Monophone States or Senones

6.2.3 Deeper Is Better

6.2.4 Exploit Neighboring Frames

6.2.5 Pretraining

6.2.6 Better Alignment Helps

6.2.7 Tuning Transition Probability

6.3 Kullback-Leibler Divergence-Based HMM

References

7 Training and Decoding Speedup

7.1 Training Speedup

7.1.1 Pipelined Backpropagation Using Multiple GPUs

7.1.2 Asynchronous SGD

7.1.3 Augmented Lagrangian Methods and Alternating Directions Method of Multipliers

7.1.4 Reduce Model Size

7.1.5 Other Approaches

7.2 Decoding Speedup

7.2.1 Parallel Computation

7.2.2 Sparse Network

7.2.3 Low-Rank Approximation

7.2.4 Teach Small DNN with Large DNN

7.2.5 Multiframe DNN

References

8 Deep Neural Network Sequence-Discriminative Training

8.1 Sequence-Discriminative Training Criteria

8.1.1 Maximum Mutual Information

8.1.2 Boosted MMI

8.1.3 MPE/sMBR

8.1.4 A Uniformed Formulation

8.2 Practical Considerations

8.2.1 Lattice Generation

8.2.2 Lattice Compensation

8.2.3 Frame Smoothing

8.2.4 Learning Rate Adjustment

8.2.5 Training Criterion Selection

8.2.6 Other Considerations

8.3 Noise Contrastive Estimation

8.3.1 Casting Probability Density Estimation Problem as a Classifier Design Problem

8.3.2 Extension to Unnormalized Models

8.3.3 Apply NCE in DNN Training

References

Part IVRepresentation Learningin Deep Neural Networks

9 Feature Representation Learning in Deep Neural Networks

9.1 Joint Learning of Feature Representation and Classifier

9.2 Feature Hierarchy

9.3 Flexibility in Using Arbitrary Input Features

9.4 Robustness of Features

9.4.1 Robust to Speaker Variations

9.4.2 Robust to Environment Variations

9.5 Robustness Across All Conditions

9.5.1 Robustness Across Noise Levels

9.5.2 Robustness Across Speaking Rates

9.6 Lack of Generalization Over Large Distortions

References

10 Fuse Deep Neural Network and Gaussian Mixture Model Systems

10.1 Use DNN-Derived Features in GMM-HMM Systems

10.1.1 GMM-HMM with Tandem and Bottleneck Features

10.1.2 DNN-HMM Hybrid System Versus GMM-HMM System with DNN-Derived Features

10.2 Fuse Recognition Results

10.2.1 ROVER

10.2.2 SCARF

10.2.3 MBR Lattice Combination

10.3 Fuse Frame-Level Acoustic Scores

10.4 Multistream Speech Recognition

References

11 Adaptation of Deep Neural Networks

11.1 The Adaptation Problem for Deep Neural Networks

11.2 Linear Transformations

11.2.1 Linear Input Networks

11.2.2 Linear Output Networks

11.3 Linear Hidden Networks

11.4 Conservative Training

11.4.1 L2 Regularization

11.4.2 KL-Divergence Regularization

11.4.3 Reducing Per-Speaker Footprint

11.5 Subspace Methods

11.5.1 Subspace Construction Through Principal Component Analysis

11.5.2 Noise-Aware, Speaker-Aware, and Device-Aware Training

11.5.3 Tensor

11.6 Effectiveness of DNN Speaker Adaptation

11.6.1 KL-Divergence Regularization Approach

11.6.2 Speaker-Aware Training

References

Part VAdvanced Deep Models

12 Representation Sharing and Transfer in Deep Neural Networks

12.1 Multitask and Transfer Learning

12.1.1 Multitask Learning

12.1.2 Transfer Learning

12.2 Multilingual and Crosslingual Speech Recognition

12.2.1 Tandem/Bottleneck-Based Crosslingual Speech Recognition

12.2.2 Shared-Hidden-Layer Multilingual DNN

12.2.3 Crosslingual Model Transfer

12.3 Multiobjective Training of Deep Neural Networks for Speech Recognition

12.3.1 Robust Speech Recognition with Multitask Learning

12.3.2 Improved Phone Recognition with Multitask Learning

12.3.3 Recognizing both Phonemes and Graphemes

12.4 Robust Speech Recognition Exploiting Audio-Visual Information

References

13 Recurrent Neural Networks and Related Models

13.1 Introduction

13.2 State-Space Formulation of the Basic Recurrent Neural Network

13.3 The Backpropagation-Through-Time Learning Algorithm

13.3.1 Objective Function for Minimization

13.3.2 Recursive Computation of Error Terms

13.3.3 Update of RNN Weights

13.4 A Primal-Dual Technique for Learning Recurrent Neural Networks

13.4.1 Difficulties in Learning RNNs

13.4.2 Echo-State Property and Its Sufficient Condition

13.4.3 Learning RNNs as a Constrained Optimization Problem

13.4.4 A Primal-Dual Method for Learning RNNs

13.5 Recurrent Neural Networks Incorporating LSTM Cells

13.5.1 Motivations and Applications

13.5.2 The Architecture of LSTM Cells

13.5.3 Training the LSTM-RNN

13.6 Analyzing Recurrent Neural Networks---A Contrastive Approach

13.6.1 Direction of Information Flow: Top-Down versus Bottom-Up

13.6.2 The Nature of Representations: Localist or Distributed

13.6.3 Interpretability: Inferring Latent Layers versus End-to-End Learning

13.6.4 Parameterization: Parsimonious Conditionals versus Massive Weight Matrices

13.6.5 Methods of Model Learning: Variational Inference versus Gradient Descent

13.6.6 Recognition Accuracy Comparisons

13.7 Discussions

References

14 Computational Network

14.1 Computational Network

14.2 Forward Computation

14.3 Model Training

14.4 Typical Computation Nodes

14.4.1 Computation Node Types with No Operand

14.4.2 Computation Node Types with One Operand

14.4.3 Computation Node Types with Two Operands

14.4.4 Computation Node Types for Computing Statistics

14.5 Convolutional Neural Network

14.6 Recurrent Connections

14.6.1 Sample by Sample Processing Only Within Loops

14.6.2 Processing Multiple Utterances Simultaneously

14.6.3 Building Arbitrary Recurrent Neural Networks

References

15 Summary and Future Directions

15.1 Road Map

15.1.1 Debut of DNNs for ASR

15.1.2 Speedup of DNN Training and Decoding

15.1.3 Sequence Discriminative Training

15.1.4 Feature Processing

15.1.5 Adaptation

15.1.6 Multitask and Transfer Learning

15.1.7 Convolution Neural Networks

15.1.8 Recurrent Neural Networks and LSTM

15.1.9 Other Deep Models

15.2 State of the Art and Future Directions

15.2.1 State of the Art---A Brief Analysis

15.2.2 Future Directions

References

Index

Signals and Communication Technology Dong Yu Li Deng Automatic Speech Recognition A Deep Learning Approach

Signals and Communication Technology

More information about this series at http://www.springer.com/series/4748

Dong Yu Li Deng Automatic Speech Recognition A Deep Learning Approach 123

Dong Yu Microsoft Research Bothell USA Li Deng Microsoft Research Redmond, WA USA ISSN 1860-4862 ISBN 978-1-4471-5778-6 DOI 10.1007/978-1-4471-5779-3 ISSN 1860-4870 (electronic) ISBN 978-1-4471-5779-3 (eBook) Library of Congress Control Number: 2014951663 Springer London Heidelberg New York Dordrecht © Springer-Verlag London 2015 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied speciﬁcally for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

To my wife and parents Dong Yu To Lih-Yuan, Lloyd, Craig, Lyle, Arie, and Axel Li Deng

Foreword This is the ﬁrst book on automatic speech recognition (ASR) that is focused on the deep learning approach, and in particular, deep neural network (DNN) technology. The landmark book represents a big milestone in the journey of the DNN tech- nology, which has achieved overwhelming successes in ASR over the past few years. Following the authors’ recent book on “Deep Learning: Methods and Applications”, this new book digs deeply and exclusively into ASR technology and applications, which were only relatively lightly covered in the previous book in parallel with numerous other applications of deep learning. Importantly, the background material of ASR and technical detail of DNNs including rigorous mathematical descriptions and software implementation are provided in this book, invaluable for ASR experts as well as advanced students. One unique aspect of this book is to broaden the view of deep learning from DNNs, as commonly adopted in ASR by now, to encompass also deep generative models that have advantages of naturally embedding domain knowledge and problem constraints. The background material did justice to the incredible richness of deep and dynamic generative models of speech developed by ASR researchers since early 90’s, yet without losing sight of the unifying principles with respect to the recent rapid development of deep discriminative models of DNNs. Compre- hensive comparisons of the relative strengths of these two very different types of deep models using the example of recurrent neural nets versus hidden dynamic models are particularly insightful, opening an exciting and promising direction for new development of deep learning in ASR as well as in other signal and infor- mation processing applications. From the historical perspective, four generations of ASR technology have been recently analyzed. The 4th Generation technology is really embodied in deep learning elaborated in this book, especially when DNNs are seamlessly integrated with deep generative models that would enable extended knowledge processing in a most natural fashion. All in all, this beautifully produced book is likely to become a deﬁnitive ref- erence for ASR practitioners in the deep learning era of 4th generation ASR. The book masterfully covers the basic concepts required to understand the ASR ﬁeld as a whole, and it also details in depth the powerful deep learning methods that have vii

viii Foreword shattered the ﬁeld in 2 recent years. The readers of this book will become articulate in the new state-of-the-art of ASR established by the DNN technology, and be poised to build new ASR systems to match or exceed human performance. By Sadaoki Furui, President of Toyota Technological Institute at Chicago, and Professor at the Tokyo Institute of Technology.

分享到：

赞收藏

资料库

automatic speech recognition a deep learning approach.pdf

相关推荐

人工智能

热门标签

最新资料