logo资料库

Deep Learning for NLP and Speech Recognition.pdf

第1页 / 共640页
第2页 / 共640页
第3页 / 共640页
第4页 / 共640页
第5页 / 共640页
第6页 / 共640页
第7页 / 共640页
第8页 / 共640页
资料共640页,剩余部分请下载后查看
Foreword
Preface
Why This Book?
Who Is This Book for?
What Does This Book Cover?
Acknowledgments
Contents
Notation
Part I Machine Learning, NLP, and Speech Introduction
1 Introduction
1.1 Machine Learning
1.1.1 Supervised Learning
1.1.2 Unsupervised Learning
1.1.3 Semi-Supervised Learning and Active Learning
1.1.4 Transfer Learning and Multitask Learning
1.1.5 Reinforcement Learning
1.2 History
1.2.1 Deep Learning: A Brief History
1.2.2 Natural Language Processing: A Brief History
1.2.3 Automatic Speech Recognition: A Brief History
1.3 Tools, Libraries, Datasets, and Resources for the Practitioners
1.3.1 Deep Learning
1.3.2 Natural Language Processing
1.3.3 Speech Recognition
1.3.3.1 Frameworks
1.3.3.2 Audio Processing
1.3.3.3 Additional Tools and Libraries
1.3.4 Books
1.3.5 Online Courses and Resources
1.3.6 Datasets
1.4 Case Studies and Implementation Details
References
2 Basics of Machine Learning
2.1 Introduction
2.2 Supervised Learning: Framework and Formal Definitions
2.2.1 Input Space and Samples
2.2.2 Target Function and Labels
2.2.3 Training and Prediction
2.3 The Learning Process
2.4 Machine Learning Theory
2.4.1 Generalization–Approximation Trade-Off via the Vapnik–Chervonenkis Analysis
2.4.2 Generalization–Approximation Trade-off via the Bias–Variance Analysis
2.4.3 Model Performance and Evaluation Metrics
2.4.3.1 Classification Evaluation Metrics
2.4.3.2 Regression Evaluation Metrics
2.4.4 Model Validation
2.4.5 Model Estimation and Comparisons
2.4.6 Practical Tips for Machine Learning
2.5 Linear Algorithms
2.5.1 Linear Regression
2.5.1.1 Discussion Points
2.5.2 Perceptron
2.5.2.1 Discussion Points
2.5.3 Regularization
2.5.3.1 Ridge Regularization: L2 Norm
2.5.3.2 Lasso Regularization: L1 Norm
2.5.4 Logistic Regression
2.5.4.1 Gradient Descent
2.5.4.2 Stochastic Gradient Descent
2.5.5 Generative Classifiers
2.5.5.1 Naive Bayes
2.5.5.2 Linear Discriminant Analysis
2.5.6 Practical Tips for Linear Algorithms
2.6 Non-linear Algorithms
2.6.1 Support Vector Machines
2.6.2 Other Non-linear Algorithms
2.7 Feature Transformation, Selection, and Dimensionality Reduction
2.7.1 Feature Transformation
2.7.1.1 Centering or Zero Mean
2.7.1.2 Unit Range
2.7.1.3 Standardization
2.7.1.4 Discretization
2.7.2 Feature Selection and Reduction
2.7.2.1 Principal Component Analysis
2.8 Sequence Data and Modeling
2.8.1 Discrete Time Markov Chains
2.8.2 Discriminative Approach: Hidden Markov Models
2.8.3 Generative Approach: Conditional Random Fields
2.8.3.1 Feature Functions
2.8.3.2 CRF Distribution
2.8.3.3 CRF Training
2.9 Case Study
2.9.1 Software Tools and Libraries
2.9.2 Exploratory Data Analysis (EDA)
2.9.3 Model Training and Hyperparameter Search
2.9.3.1 Feature Transformation and Reduction Impact
2.9.3.2 Hyperparameter Search and Validation
2.9.3.3 Learning Curves
2.9.4 Final Training and Testing Models
2.9.5 Exercises for Readers and Practitioners
References
3 Text and Speech Basics
3.1 Introduction
3.1.1 Computational Linguistics
3.1.2 Natural Language
3.1.3 Model of Language
3.2 Morphological Analysis
3.2.1 Stemming
3.2.2 Lemmatization
3.3 Lexical Representations
3.3.1 Tokens
3.3.2 Stop Words
3.3.3 N-Grams
3.3.4 Documents
3.3.4.1 Document-Term Matrix
3.3.4.2 Bag-of-Words
3.3.4.3 TFIDF
3.4 Syntactic Representations
3.4.1 Part-of-Speech
3.4.1.1 Rules Based
3.4.1.2 Hidden Markov Models
3.4.2 Dependency Parsing
3.4.2.1 Context-Free Grammars
3.4.2.2 Chunking
3.4.2.3 Treebanks
3.5 Semantic Representations
3.5.1 Named Entity Recognition
3.5.2 Relation Extraction
3.5.3 Event Extraction
3.5.4 Semantic Role Labeling
3.6 Discourse Representations
3.6.1 Cohesion
3.6.2 Coherence
3.6.3 Anaphora/Cataphora
3.6.4 Local and Global Coreference
3.7 Language Models
3.7.1 N-Gram Model
3.7.2 Laplace Smoothing
3.7.3 Out-of-Vocabulary
3.7.4 Perplexity
3.8 Text Classification
3.8.1 Machine Learning Approach
3.8.2 Sentiment Analysis
3.8.2.1 Emotional State Model
3.8.2.2 Subjectivity and Objectivity Detection
3.8.3 Entailment
3.9 Text Clustering
3.9.1 Lexical Chains
3.9.2 Topic Modeling
3.9.2.1 LSA
3.9.2.2 LDA
3.10 Machine Translation
3.10.1 Dictionary Based
3.10.2 Statistical Translation
3.11 Question Answering
3.11.1 Information Retrieval Based
3.11.2 Knowledge-Based QA
3.11.3 Automated Reasoning
3.12 Automatic Summarization
3.12.1 Extraction Based
3.12.2 Abstraction Based
3.13 Automated Speech Recognition
3.13.1 Acoustic Model
3.13.1.1 Spectrograms
3.13.1.2 MFCC
3.14 Case Study
3.14.1 Software Tools and Libraries
3.14.2 EDA
3.14.3 Text Clustering
3.14.4 Topic Modeling
3.14.4.1 LSA
3.14.4.2 LDA
3.14.5 Text Classification
3.14.6 Exercises for Readers and Practitioners
References
Part II Deep Learning Basics
4 Basics of Deep Learning
4.1 Introduction
4.2 Perceptron Algorithm Explained
4.2.1 Bias
4.2.2 Linear and Non-linear Separability
4.3 Multilayer Perceptron (Neural Networks)
4.3.1 Training an MLP
4.3.2 Forward Propagation
4.3.3 Error Computation
4.3.4 Backpropagation
4.3.5 Parameter Update
4.3.6 Universal Approximation Theorem
4.4 Deep Learning
4.4.1 Activation Functions
4.4.1.1 Sigmoid
4.4.1.2 Tanh
4.4.1.3 ReLU
4.4.1.4 Other Activation Functions
4.4.1.5 Softmax
4.4.1.6 Hierarchical Softmax
4.4.2 Loss Functions
4.4.2.1 Mean Squared (L2) Error
4.4.2.2 Mean Absolute (L1) Error
4.4.2.3 Negative Log Likelihood
4.4.2.4 Hinge Loss
4.4.2.5 Kullback–Leibler (KL) Loss
4.4.3 Optimization Methods
4.4.3.1 Stochastic Gradient Descent
4.4.3.2 Momentum
4.4.3.3 Adagrad
4.4.3.4 RMS-Prop
4.4.3.5 ADAM
4.5 Model Training
4.5.1 Early Stopping
4.5.2 Vanishing/Exploding Gradients
4.5.3 Full-Batch and Mini-Batch Gradient Decent
4.5.4 Regularization
4.5.4.1 L2 Regularization: Weight Decay
4.5.4.2 L1 Regularization
4.5.4.3 Dropout
4.5.4.4 Multitask Learning
4.5.4.5 Parameter Sharing
4.5.4.6 Batch Normalization
4.5.5 Hyperparameter Selection
4.5.5.1 Manual Tuning
4.5.5.2 Automated Tuning
4.5.6 Data Availability and Quality
4.5.6.1 Data Augmentation
4.5.6.2 Bagging
4.5.6.3 Adversarial Training
4.5.7 Discussion
4.5.7.1 Computation and Memory Constraints
4.6 Unsupervised Deep Learning
4.6.1 Energy-Based Models
4.6.2 Restricted Boltzmann Machines
4.6.3 Deep Belief Networks
4.6.4 Autoencoders
4.6.4.1 Undercomplete Autoencoders
4.6.4.2 Denoising Autoencoders
4.6.4.3 Sparse Autoencoders
4.6.4.4 Variational Autoencoders
4.6.5 Sparse Coding
4.6.6 Generative Adversarial Networks
4.7 Framework Considerations
4.7.1 Layer Abstraction
4.7.2 Computational Graphs
4.7.3 Reverse-Mode Automatic Differentiation
4.7.4 Static Computational Graphs
4.7.5 Dynamic Computational Graphs
4.8 Case Study
4.8.1 Software Tools and Libraries
4.8.2 Exploratory Data Analysis (EDA)
4.8.3 Supervised Learning
4.8.4 Unsupervised Learning
4.8.5 Classifying with Unsupervised Features
4.8.6 Results
4.8.7 Exercises for Readers and Practitioners
References
5 Distributed Representations
5.1 Introduction
5.2 Distributional Semantics
5.2.1 Vector Space Model
5.2.1.1 Curse of Dimensionality
5.2.2 Word Representations
5.2.2.1 Co-occurrence
5.2.2.2 LSA
5.2.3 Neural Language Models
5.2.3.1 Bengio
5.2.3.2 Collobert and Weston
5.2.4 word2vec
5.2.4.1 CBOW
5.2.4.2 Skip-Gram
5.2.4.3 Hierarchical Softmax
5.2.4.4 Negative Sampling
5.2.4.5 Phrase Representations
5.2.4.6 word2vec CBOW: Forward and Backward Propagation
5.2.4.7 word2vec Skip-gram: Forward and Backward Propagation
5.2.5 GloVe
5.2.6 Spectral Word Embeddings
5.2.7 Multilingual Word Embeddings
5.3 Limitations of Word Embeddings
5.3.1 Out of Vocabulary
5.3.2 Antonymy
5.3.3 Polysemy
5.3.3.1 Clustering-Weighted Context Embeddings
5.3.3.2 Sense2vec
5.3.4 Biased Embeddings
5.3.5 Other Limitations
5.4 Beyond Word Embeddings
5.4.1 Subword Embeddings
5.4.2 Word Vector Quantization
5.4.3 Sentence Embeddings
5.4.4 Concept Embeddings
5.4.5 Retrofitting with Semantic Lexicons
5.4.6 Gaussian Embeddings
5.4.6.1 Word2Gauss
5.4.6.2 Bayesian Skip-Gram
5.4.7 Hyperbolic Embeddings
5.5 Applications
5.5.1 Classification
5.5.2 Document Clustering
5.5.3 Language Modeling
5.5.4 Text Anomaly Detection
5.5.5 Contextualized Embeddings
5.6 Case Study
5.6.1 Software Tools and Libraries
5.6.2 Exploratory Data Analysis
5.6.3 Learning Word Embeddings
5.6.3.1 Word2Vec
5.6.3.2 Negative Sampling
5.6.3.3 Training the Model
5.6.3.4 Visualize Embeddings
5.6.3.5 Using the Gensim package
5.6.3.6 Similarity
5.6.3.7 GloVe Embeddings
5.6.3.8 Co-occurrence Matrix
5.6.3.9 GloVe Training
5.6.3.10 GloVe Vector Similarity
5.6.3.11 Using the Glove Package
5.6.4 Document Clustering
5.6.4.1 Document Vectors
5.6.5 Word Sense Disambiguation
5.6.5.1 Supervised Disambiguation Annotations
5.6.5.2 Training with word2vec
5.6.6 Exercises for Readers and Practitioners
References
6 Convolutional Neural Networks
6.1 Introduction
6.2 Basic Building Blocks of CNN
6.2.1 Convolution and Correlation in Linear Time-Invariant Systems
6.2.1.1 Linear Time-Invariant Systems
6.2.1.2 The Convolution Operator and Its Properties
6.2.1.3 Cross-Correlation and Its Properties
6.2.2 Local Connectivity or Sparse Interactions
6.2.3 Parameter Sharing
6.2.4 Spatial Arrangement
6.2.5 Detector Using Nonlinearity
6.2.6 Pooling and Subsampling
6.2.6.1 Max Pooling
6.2.6.2 Average Pooling
6.2.6.3 L2-Norm Pooling
6.2.6.4 Stochastic Pooling
6.2.6.5 Spectral Pooling
6.3 Forward and Backpropagation in CNN
6.3.1 Gradient with Respect to Weights ∂E∂W
6.3.2 Gradient with Respect to the Inputs ∂E∂X
6.3.3 Max Pooling Layer
6.4 Text Inputs and CNNs
6.4.1 Word Embeddings and CNN
6.4.2 Character-Based Representation and CNN
6.5 Classic CNN Architectures
6.5.1 LeNet-5
6.5.2 AlexNet
6.5.3 VGG-16
6.6 Modern CNN Architectures
6.6.1 Stacked or Hierarchical CNN
6.6.2 Dilated CNN
6.6.3 Inception Networks
6.6.4 Other CNN Structures
6.7 Applications of CNN in NLP
6.7.1 Text Classification and Categorization
6.7.2 Text Clustering and Topic Mining
6.7.3 Syntactic Parsing
6.7.4 Information Extraction
6.7.5 Machine Translation
6.7.6 Summarizations
6.7.7 Question and Answers
6.8 Fast Algorithms for Convolutions
6.8.1 Convolution Theorem and Fast Fourier Transform
6.8.2 Fast Filtering Algorithm
6.9 Case Study
6.9.1 Software Tools and Libraries
6.9.2 Exploratory Data Analysis
6.9.3 Data Preprocessing and Data Splits
6.9.4 CNN Model Experiments
6.9.5 Understanding and Improving the Models
6.9.6 Exercises for Readers and Practitioners
6.10 Discussion
References
7 Recurrent Neural Networks
7.1 Introduction
7.2 Basic Building Blocks of RNNs
7.2.1 Recurrence and Memory
7.2.2 PyTorch Example
7.3 RNNs and Properties
7.3.1 Forward and Backpropagation in RNNs
7.3.1.1 Output Weights (V)
7.3.1.2 Recurrent Weights (W)
7.3.1.3 Input Weights (U)
7.3.1.4 Aggregate Gradient
7.3.2 Vanishing Gradient Problem and Regularization
7.3.2.1 Long Short-Term Memory
7.3.2.2 Gated Recurrent Unit
7.3.2.3 Gradient Clipping
7.3.2.4 BPTT Sequence Length
7.3.2.5 Recurrent Dropout
7.4 Deep RNN Architectures
7.4.1 Deep RNNs
7.4.2 Residual LSTM
7.4.3 Recurrent Highway Networks
7.4.4 Bidirectional RNNs
7.4.5 SRU and Quasi-RNN
7.4.6 Recursive Neural Networks
7.5 Extensions of Recurrent Networks
7.5.1 Sequence-to-Sequence
7.5.2 Attention
7.5.3 Pointer Networks
7.5.4 Transformer Networks
7.6 Applications of RNNs in NLP
7.6.1 Text Classification
7.6.2 Part-of-Speech Tagging and Named EntityRecognition
7.6.3 Dependency Parsing
7.6.4 Topic Modeling and Summarization
7.6.5 Question Answering
7.6.6 Multi-Modal
7.6.7 Language Models
7.6.7.1 Perplexity
7.6.7.2 Recurrent Variational Autoencoder
7.6.8 Neural Machine Translation
7.6.8.1 BLEU
7.6.9 Prediction/Sampling Output
7.6.9.1 Greedy Search
7.6.9.2 Random Sampling and Temperature Sampling
7.6.9.3 Optimizing Output: Beam Search Decoding
7.7 Case Study
7.7.1 Software Tools and Libraries
7.7.2 Exploratory Data Analysis
7.7.2.1 Sequence Length Filtering
7.7.2.2 Vocabulary Inspection
7.7.3 Model Training
7.7.3.1 RNN Baseline
7.7.3.2 RNN, LSTM, and GRU Comparison
7.7.3.3 RNN, LSTM, and GRU Layer Depth Comparison
7.7.3.4 Bidirectional RNN, LSTM, and GRU Comparison
7.7.3.5 Deep Bidirectional Comparison
7.7.3.6 Transformer Network
7.7.3.7 Comparison of Experiments
7.7.4 Results
7.7.5 Exercises for Readers and Practitioners
7.8 Discussion
7.8.1 Memorization or Generalization
7.8.2 Future of RNNs
References
8 Automatic Speech Recognition
8.1 Introduction
8.2 Acoustic Features
8.2.1 Speech Production
8.2.2 Raw Waveform
8.2.3 MFCC
8.2.3.1 Pre-emphasis
8.2.3.2 Framing
8.2.3.3 Windowing
8.2.3.4 Fast Fourier Transform
8.2.3.5 Mel Filter Bank
8.2.3.6 Discrete Cosine Transform
8.2.3.7 Delta Energy and Delta Spectrum
8.2.4 Other Feature Types
8.2.4.1 Automatically Learned
8.3 Phones
8.4 Statistical Speech Recognition
8.4.1 Acoustic Model: P(X|W)
8.4.1.1 Lexicon Model: P(S|W)
8.4.2 Language Model: P(W)
8.4.3 HMM Decoding
8.5 Error Metrics
8.6 DNN/HMM Hybrid Model
8.7 Case Study
8.7.1 Dataset: Common Voice
8.7.2 Software Tools and Libraries
8.7.3 Sphinx
8.7.3.1 Data Preparation
8.7.3.2 Model Training
8.7.4 Kaldi
8.7.4.1 Data Preparation
8.7.4.2 Model Training
8.7.5 Results
8.7.6 Exercises for Readers and Practitioners
References
Part III Advanced Deep Learning Techniques for Text and Speech
9 Attention and Memory Augmented Networks
9.1 Introduction
9.2 Attention Mechanism
9.2.1 The Need for Attention Mechanism
9.2.2 Soft Attention
9.2.3 Scores-Based Attention
9.2.4 Soft vs. Hard Attention
9.2.5 Local vs. Global Attention
9.2.6 Self-Attention
9.2.7 Key-Value Attention
9.2.8 Multi-Head Self-Attention
9.2.9 Hierarchical Attention
9.2.10 Applications of Attention Mechanism in Text and Speech
9.3 Memory Augmented Networks
9.3.1 Memory Networks
9.3.2 End-to-End Memory Networks
9.3.2.1 Single Layer MemN2N
9.3.2.2 Input and Query
9.3.2.3 Controller and Memory
9.3.2.4 Controller and Output
9.3.2.5 Final Prediction and Learning
9.3.2.6 Multiple Layers
9.3.3 Neural Turing Machines
9.3.3.1 Read Operations
9.3.3.2 Write Operations
9.3.3.3 Addressing Mechanism
9.3.4 Differentiable Neural Computer
9.3.4.1 Input and Outputs
9.3.4.2 Memory Reads and Writes
9.3.4.3 Selective Attention
9.3.5 Dynamic Memory Networks
9.3.5.1 Input Module
9.3.5.2 Question Module
9.3.5.3 Episodic Memory Module
9.3.5.4 Answer Module
9.3.5.5 Training
9.3.6 Neural Stack, Queues, and Deques
9.3.6.1 Neural Stack
9.3.6.2 Recurrent Networks, Controller, and Training
9.3.7 Recurrent Entity Networks
9.3.7.1 Input Encoder
9.3.7.2 Dynamic Memory
9.3.7.3 Output Module and Training
9.3.8 Applications of Memory Augmented Networks in Text and Speech
9.4 Case Study
9.4.1 Attention-Based NMT
9.4.2 Exploratory Data Analysis
9.4.2.1 Software Tools and Libraries
9.4.2.2 Model Training
9.4.2.3 Bahdanau Attention
9.4.2.4 Results
9.4.3 Question and Answering
9.4.3.1 Software Tools and Libraries
9.4.3.2 Exploratory Data Analysis
9.4.3.3 LSTM Baseline
9.4.3.4 End-to-End Memory Network
9.4.4 Dynamic Memory Network
9.4.4.1 Differentiable Neural Computer
9.4.4.2 Recurrent Entity Network
9.4.5 Exercises for Readers and Practitioners
References
10 Transfer Learning: Scenarios, Self-Taught Learning, and Multitask Learning
10.1 Introduction
10.2 Transfer Learning: Definition, Scenarios, and Categorization
10.2.1 Definition
10.2.2 Transfer Learning Scenarios
10.2.3 Transfer Learning Categories
10.3 Self-Taught Learning
10.3.1 Techniques
10.3.1.1 Unsupervised Pre-training and Supervised Fine-Tuning
10.3.2 Theory
10.3.3 Applications in NLP
10.3.4 Applications in Speech
10.4 Multitask Learning
10.4.1 Techniques
10.4.1.1 Multilinear Relationship Network
10.4.1.2 Fully Adaptive Feature Sharing Network
10.4.1.3 Cross-Stitch Networks
10.4.1.4 A Joint Many-Task Network
10.4.1.5 Sluice Networks
10.4.2 Theory
10.4.3 Applications in NLP
10.4.4 Applications in Speech Recognition
10.5 Case Study
10.5.1 Software Tools and Libraries
10.5.2 Exploratory Data Analysis
10.5.3 Multitask Learning Experiments and Analysis
10.5.4 Exercises for Readers and Practitioners
References
11 Transfer Learning: Domain Adaptation
11.1 Introduction
11.1.1 Techniques
11.1.1.1 Stacked Autoencoders
11.1.1.2 Deep Interpolation Between Source and Target
11.1.1.3 Deep Domain Confusion
11.1.1.4 Deep Adaptation Network
11.1.1.5 Domain-Invariant Representation
11.1.1.6 Domain Confusion and Invariant Representation
11.1.1.7 Domain-Adversarial Neural Network
11.1.1.8 Adversarial Discriminative Domain Adaptation
11.1.1.9 Coupled Generative Adversarial Networks
11.1.1.10 Cycle Generative Adversarial Networks
11.1.1.11 Domain Separation Networks
11.1.2 Theory
11.1.2.1 Siamese Networks Based Domain Adaptations
11.1.2.2 Optimal Transport
11.1.3 Applications in NLP
11.1.4 Applications in Speech Recognition
11.2 Zero-Shot, One-Shot, and Few-Shot Learning
11.2.1 Zero-Shot Learning
11.2.1.1 Techniques
11.2.2 One-Shot Learning
11.2.2.1 Techniques
11.2.3 Few-Shot Learning
11.2.3.1 Techniques
11.2.4 Theory
11.2.5 Applications in NLP and Speech Recognition
11.3 Case Study
11.3.1 Software Tools and Libraries
11.3.2 Exploratory Data Analysis
11.3.3 Domain Adaptation Experiments
11.3.3.1 Preprocessing
11.3.3.2 Experiments
11.3.3.3 Results and Analysis
11.3.4 Exercises for Readers and Practitioners
References
12 End-to-End Speech Recognition
12.1 Introduction
12.2 Connectionist Temporal Classification (CTC)
12.2.1 End-to-End Phoneme Recognition
12.2.2 Deep Speech
12.2.2.1 GPU Parallelism
12.2.3 Deep Speech 2
12.2.4 Wav2Letter
12.2.5 Extensions of CTC
12.2.5.1 Gram-CTC
12.2.5.2 RNN Transducer
12.3 Seq-to-Seq
12.3.0.1 Content-Based Attention
12.3.0.2 Location-Aware Attention
12.3.1 Early Seq-to-Seq ASR
12.3.2 Listen, Attend, and Spell (LAS)
12.4 Multitask Learning
12.5 End-to-End Decoding
12.5.1 Language Models for ASR
12.5.1.1 N-gram
12.5.1.2 RNN Language Models
12.5.2 CTC Decoding
12.5.3 Attention Decoding
12.5.3.1 Shallow Fusion
12.5.4 Combined Language Model Training
12.5.4.1 Deep Fusion
12.5.4.2 Cold Fusion
12.5.5 Combined CTC–Attention Decoding
12.5.5.1 Rescoring
12.5.6 One-Pass Decoding
12.6 Speech Embeddings and Unsupervised Speech Recognition
12.6.1 Speech Embeddings
12.6.2 Unspeech
12.6.3 Audio Word2Vec
12.7 Case Study
12.7.1 Software Tools and Libraries
12.7.2 Deep Speech 2
12.7.2.1 Data Preparation
12.7.2.2 Acoustic Model Training
12.7.3 Language Model Training
12.7.4 ESPnet
12.7.4.1 Data Preparation
12.7.4.2 Model Training
12.7.5 Results
12.7.6 Exercises for Readers and Practitioners
References
13 Deep Reinforcement Learning for Text and Speech
13.1 Introduction
13.2 RL Fundamentals
13.2.1 Markov Decision Processes
13.2.2 Value, Q, and Advantage Functions
13.2.3 Bellman Equations
13.2.4 Optimality
13.2.5 Dynamic Programming Methods
13.2.5.1 Policy Evaluation
13.2.5.2 Policy Improvement
13.2.5.3 Value Iteration
13.2.5.4 Bootstrapping
13.2.5.5 Asynchronous DP
13.2.6 Monte Carlo
13.2.6.1 Importance Sampling
13.2.7 Temporal Difference Learning
13.2.7.1 SARSA
13.2.8 Policy Gradient
13.2.9 Q-Learning
13.2.10 Actor-Critic
13.2.10.1 Advantage Actor Critic A2C
13.2.10.2 Asynchronous Advantage Actor Critic A3C
13.3 Deep Reinforcement Learning Algorithms
13.3.1 Why RL for Seq2seq
13.3.2 Deep Policy Gradient
13.3.3 Deep Q-Learning
13.3.3.1 DQN
13.3.3.2 Double DQN
13.3.3.3 Dueling Networks
13.3.4 Deep Advantage Actor-Critic
13.4 DRL for Text
13.4.1 Information Extraction
13.4.1.1 Entity Extraction
13.4.1.2 Relation Extraction
13.4.1.3 Action Extraction
13.4.1.4 Joint Entity/Relation Extraction
13.4.2 Text Classification
13.4.3 Dialogue Systems
13.4.4 Text Summarization
13.4.5 Machine Translation
13.5 DRL for Speech
13.5.1 Automatic Speech Recognition
13.5.2 Speech Enhancement and Noise Suppression
13.6 Case Study
13.6.1 Software Tools and Libraries
13.6.2 Text Summarization
13.6.3 Exploratory Data Analysis
13.6.3.1 Seq2Seq Model
13.6.3.2 Policy Gradient
13.6.3.3 DDQN
13.6.4 Exercises for Readers and Practitioners
References
Future Outlook
End-to-End Architecture Prevalence
Transition to AI-Centric
Specialized Hardware
Transition Away from Supervised Learning
Explainable AI
Model Development and Deployment Process
Democratization of AI
NLP Trends
Speech Trends
Closing Remarks
Index
UdayKamath· JohnLiu· JamesWhitaker Deep Learning for NLP and Speech Recognition
Deep Learning for NLP and Speech Recognition
Uday Kamath • John Liu • James Whitaker Deep Learning for NLP and Speech Recognition 123
Uday Kamath Digital Reasoning Systems Inc. McLean VA, USA James Whitaker Digital Reasoning Systems Inc. McLean VA, USA John Liu Intelluron Corporation Nashville TN, USA ISBN 978-3-030-14595-8 https://doi.org/10.1007/978-3-030-14596-5 ISBN 978-3-030-14596-5 (eBook) © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my parents Krishna and Bharathi, my wife Pratibha, the kids Aaroh and Brandy, my family and friends for their support. –Uday Kamath To Catherine, Gabrielle Kaili-May, Eugene and Tina for inspiring me always. –John Liu To my mother Nancy for her constant support, my family, and my friends who have blessed my life with love. –James Whitaker
Foreword The publication of this book is a perfect timing. Existing books on deep learning either focus on theoretical aspects or are largely manuals for tools. But this book presents an unprecedented analysis and comparison of deep learning techniques for natural language and speech processing, closing the substantial gap between the- ory and practice. Each chapter discusses the theory underpinning the topics, and an exceptional collection of 13 case studies in different application areas is presented. They include classification via distributed representation, summarization, machine translation, sentiment analysis, transfer learning, multitask NLP, end-to-end speech, and question answering. Each case study includes the implementation and compar- ison of state-of-the-art techniques, and the accompanying website provides source code and data. This is extraordinarily valuable for practitioners, who can experiment firsthand with the methods and can deepen their understanding of the methods by applying them to real-world scenarios. This book offers a comprehensive coverage of deep learning, from its foundations to advanced and recent topics, including word embedding, convolutional neural net- works, recurrent neural networks, attention mechanisms, memory-augmented net- works, multitask learning, domain adaptation, and reinforcement learning. The book is a great resource for practitioners and researchers both in industry and academia, and the discussed case studies and associated material can serve as inspiration for a variety of projects and hands-on assignments in a classroom setting. Associate Professor at GMU Fairfax, VA, USA February 2019 Carlotta Domeniconi, PhD Natural language and speech processing applications such as virtual assistants and smart speakers play an important and ever-growing role in our lives. At the same time, amid an increasing number of publications, it is becoming harder to iden- tify the most promising approaches. As the Chief Analytics Officer at Digital Rea- soning and with a PhD in Big Data Machine Learning, Uday has access to both the practical and research aspects of this rapidly growing field. Having authored vii
viii Foreword Mastering Java Machine Learning, he is uniquely suited to break down both practi- cal and cutting-edge approaches. This book combines both theoretical and practical aspects of machine learning in a rare blend. It consists of an introduction that makes it accessible to people starting in the field, an overview of state-of-the-art methods that should be interesting even to people working in research, and a selection of hands-on examples that ground the material in real-world applications and demon- strate its usefulness to industry practitioners. Research Scientist at DeepMind London, UK February 2019 Sebastian Ruder, PhD A few years ago, I picked up a few text-books to study topics related to arti- ficial intelligence—such as natural language processing and computer vision. My memory of reading these text-books largely consisted of staring helplessly out of the window. Whenever I attempted to implement the described concepts and math, I wouldn’t know where to start. This is fairly common in books written for aca- demic purposes; they mockingly leave the actual implementation “as an exercise to the reader.” There are a few exceptional books that try to bridge this gap, written by people who know the importance of going beyond the math all the way to a working system. This book is one of those exceptions—with it’s discussions, case studies, code snippets, and comprehensive references, it delightfully bridges the gap between learning and doing. I especially like the use of Python and open-source tools out there. It’s an opin- ionated take on implementing machine learning systems—one might ask the fol- lowing question: “Why not X,” where X could be Java, C++, or Matlab? However, I find solace in the fact that it’s the most popular opinion, which gives the read- ers an immense support structure as they implement their own ideas. In the mod- ern Internet-connected world, joining a popular ecosystem is equivalent to having thousands of humans connecting together to help each other—from Stack Overflow posts solving an error message to GitHub repositories implementing high-quality systems. To give you perspective, I’ve seen the other side, supporting a niche com- munity of enthusiasts in machine learning using the programming language Lua for several years. It was a daily struggle to do new things—even basic things such as making a bar chart—precisely because our community of people was a few orders of magnitude smaller than Python’s. Overall, I hope the reader enjoys a modern, practical take on deep learning sys- tems, leveraging open-source machine learning systems heavily, and being taught a lot of “tricks of the trade” by the incredibly talented authors, one of whom I’ve known for years and have seen build robust speech recognition systems. Research Engineer at Facebook AI Research (FAIR) New York, NY, USA February 2019 Soumith Chintala, PhD
Preface Why This Book? With the widespread adoption of deep learning, natural language processing (NLP), and speech applications in various domains such as finance, healthcare, and gov- ernment and across our daily lives, there is a growing need for one comprehensive resource that maps deep learning techniques to NLP and speech and provides in- sights into using the tools and libraries for real-world applications. Many books focus on deep learning theory or deep learning for NLP-specific tasks, while oth- ers are cookbooks for tools and libraries. But, the constant flux of new algorithms, tools, frameworks, and libraries in a rapidly evolving landscape means that there are few available texts that contain explanations of the recent deep learning methods and state-of-the-art approaches applicable to NLP and speech, as well as real-world case studies with code to provide hands-on experience. As an example, you would find it difficult to find a single source that explains the impact of neural attention techniques applied to a real-world NLP task such as machine translation across a range of approaches, from the basic to the state-of-the-art. Likewise, it would be difficult to find a source that includes accompanying code based on well-known li- braries with comparisons and analysis across these techniques. This book provides the following all in one place: • A comprehensive resource that builds up from elementary deep learning, text, and speech principles to advanced state-of-the-art neural architectures • A ready reference for deep learning techniques applicable to common NLP and speech recognition applications • A useful resource on successful architectures and algorithms with essential math- ematical insights explained in detail • An in-depth reference and comparison of the latest end-to-end neural speech processing approaches ix
分享到:
收藏