Foreword
Preface
Why This Book?
Who Is This Book for?
What Does This Book Cover?
Acknowledgments
Contents
Notation
Part I Machine Learning, NLP, and Speech Introduction
1 Introduction
1.1 Machine Learning
1.1.1 Supervised Learning
1.1.2 Unsupervised Learning
1.1.3 Semi-Supervised Learning and Active Learning
1.1.4 Transfer Learning and Multitask Learning
1.1.5 Reinforcement Learning
1.2 History
1.2.1 Deep Learning: A Brief History
1.2.2 Natural Language Processing: A Brief History
1.2.3 Automatic Speech Recognition: A Brief History
1.3 Tools, Libraries, Datasets, and Resources for the Practitioners
1.3.1 Deep Learning
1.3.2 Natural Language Processing
1.3.3 Speech Recognition
1.3.3.1 Frameworks
1.3.3.2 Audio Processing
1.3.3.3 Additional Tools and Libraries
1.3.4 Books
1.3.5 Online Courses and Resources
1.3.6 Datasets
1.4 Case Studies and Implementation Details
References
2 Basics of Machine Learning
2.1 Introduction
2.2 Supervised Learning: Framework and Formal Definitions
2.2.1 Input Space and Samples
2.2.2 Target Function and Labels
2.2.3 Training and Prediction
2.3 The Learning Process
2.4 Machine Learning Theory
2.4.1 Generalization–Approximation Trade-Off via the Vapnik–Chervonenkis Analysis
2.4.2 Generalization–Approximation Trade-off via the Bias–Variance Analysis
2.4.3 Model Performance and Evaluation Metrics
2.4.3.1 Classification Evaluation Metrics
2.4.3.2 Regression Evaluation Metrics
2.4.4 Model Validation
2.4.5 Model Estimation and Comparisons
2.4.6 Practical Tips for Machine Learning
2.5 Linear Algorithms
2.5.1 Linear Regression
2.5.1.1 Discussion Points
2.5.2 Perceptron
2.5.2.1 Discussion Points
2.5.3 Regularization
2.5.3.1 Ridge Regularization: L2 Norm
2.5.3.2 Lasso Regularization: L1 Norm
2.5.4 Logistic Regression
2.5.4.1 Gradient Descent
2.5.4.2 Stochastic Gradient Descent
2.5.5 Generative Classifiers
2.5.5.1 Naive Bayes
2.5.5.2 Linear Discriminant Analysis
2.5.6 Practical Tips for Linear Algorithms
2.6 Non-linear Algorithms
2.6.1 Support Vector Machines
2.6.2 Other Non-linear Algorithms
2.7 Feature Transformation, Selection, and Dimensionality Reduction
2.7.1 Feature Transformation
2.7.1.1 Centering or Zero Mean
2.7.1.2 Unit Range
2.7.1.3 Standardization
2.7.1.4 Discretization
2.7.2 Feature Selection and Reduction
2.7.2.1 Principal Component Analysis
2.8 Sequence Data and Modeling
2.8.1 Discrete Time Markov Chains
2.8.2 Discriminative Approach: Hidden Markov Models
2.8.3 Generative Approach: Conditional Random Fields
2.8.3.1 Feature Functions
2.8.3.2 CRF Distribution
2.8.3.3 CRF Training
2.9 Case Study
2.9.1 Software Tools and Libraries
2.9.2 Exploratory Data Analysis (EDA)
2.9.3 Model Training and Hyperparameter Search
2.9.3.1 Feature Transformation and Reduction Impact
2.9.3.2 Hyperparameter Search and Validation
2.9.3.3 Learning Curves
2.9.4 Final Training and Testing Models
2.9.5 Exercises for Readers and Practitioners
References
3 Text and Speech Basics
3.1 Introduction
3.1.1 Computational Linguistics
3.1.2 Natural Language
3.1.3 Model of Language
3.2 Morphological Analysis
3.2.1 Stemming
3.2.2 Lemmatization
3.3 Lexical Representations
3.3.1 Tokens
3.3.2 Stop Words
3.3.3 N-Grams
3.3.4 Documents
3.3.4.1 Document-Term Matrix
3.3.4.2 Bag-of-Words
3.3.4.3 TFIDF
3.4 Syntactic Representations
3.4.1 Part-of-Speech
3.4.1.1 Rules Based
3.4.1.2 Hidden Markov Models
3.4.2 Dependency Parsing
3.4.2.1 Context-Free Grammars
3.4.2.2 Chunking
3.4.2.3 Treebanks
3.5 Semantic Representations
3.5.1 Named Entity Recognition
3.5.2 Relation Extraction
3.5.3 Event Extraction
3.5.4 Semantic Role Labeling
3.6 Discourse Representations
3.6.1 Cohesion
3.6.2 Coherence
3.6.3 Anaphora/Cataphora
3.6.4 Local and Global Coreference
3.7 Language Models
3.7.1 N-Gram Model
3.7.2 Laplace Smoothing
3.7.3 Out-of-Vocabulary
3.7.4 Perplexity
3.8 Text Classification
3.8.1 Machine Learning Approach
3.8.2 Sentiment Analysis
3.8.2.1 Emotional State Model
3.8.2.2 Subjectivity and Objectivity Detection
3.8.3 Entailment
3.9 Text Clustering
3.9.1 Lexical Chains
3.9.2 Topic Modeling
3.9.2.1 LSA
3.9.2.2 LDA
3.10 Machine Translation
3.10.1 Dictionary Based
3.10.2 Statistical Translation
3.11 Question Answering
3.11.1 Information Retrieval Based
3.11.2 Knowledge-Based QA
3.11.3 Automated Reasoning
3.12 Automatic Summarization
3.12.1 Extraction Based
3.12.2 Abstraction Based
3.13 Automated Speech Recognition
3.13.1 Acoustic Model
3.13.1.1 Spectrograms
3.13.1.2 MFCC
3.14 Case Study
3.14.1 Software Tools and Libraries
3.14.2 EDA
3.14.3 Text Clustering
3.14.4 Topic Modeling
3.14.4.1 LSA
3.14.4.2 LDA
3.14.5 Text Classification
3.14.6 Exercises for Readers and Practitioners
References
Part II Deep Learning Basics
4 Basics of Deep Learning
4.1 Introduction
4.2 Perceptron Algorithm Explained
4.2.1 Bias
4.2.2 Linear and Non-linear Separability
4.3 Multilayer Perceptron (Neural Networks)
4.3.1 Training an MLP
4.3.2 Forward Propagation
4.3.3 Error Computation
4.3.4 Backpropagation
4.3.5 Parameter Update
4.3.6 Universal Approximation Theorem
4.4 Deep Learning
4.4.1 Activation Functions
4.4.1.1 Sigmoid
4.4.1.2 Tanh
4.4.1.3 ReLU
4.4.1.4 Other Activation Functions
4.4.1.5 Softmax
4.4.1.6 Hierarchical Softmax
4.4.2 Loss Functions
4.4.2.1 Mean Squared (L2) Error
4.4.2.2 Mean Absolute (L1) Error
4.4.2.3 Negative Log Likelihood
4.4.2.4 Hinge Loss
4.4.2.5 Kullback–Leibler (KL) Loss
4.4.3 Optimization Methods
4.4.3.1 Stochastic Gradient Descent
4.4.3.2 Momentum
4.4.3.3 Adagrad
4.4.3.4 RMS-Prop
4.4.3.5 ADAM
4.5 Model Training
4.5.1 Early Stopping
4.5.2 Vanishing/Exploding Gradients
4.5.3 Full-Batch and Mini-Batch Gradient Decent
4.5.4 Regularization
4.5.4.1 L2 Regularization: Weight Decay
4.5.4.2 L1 Regularization
4.5.4.3 Dropout
4.5.4.4 Multitask Learning
4.5.4.5 Parameter Sharing
4.5.4.6 Batch Normalization
4.5.5 Hyperparameter Selection
4.5.5.1 Manual Tuning
4.5.5.2 Automated Tuning
4.5.6 Data Availability and Quality
4.5.6.1 Data Augmentation
4.5.6.2 Bagging
4.5.6.3 Adversarial Training
4.5.7 Discussion
4.5.7.1 Computation and Memory Constraints
4.6 Unsupervised Deep Learning
4.6.1 Energy-Based Models
4.6.2 Restricted Boltzmann Machines
4.6.3 Deep Belief Networks
4.6.4 Autoencoders
4.6.4.1 Undercomplete Autoencoders
4.6.4.2 Denoising Autoencoders
4.6.4.3 Sparse Autoencoders
4.6.4.4 Variational Autoencoders
4.6.5 Sparse Coding
4.6.6 Generative Adversarial Networks
4.7 Framework Considerations
4.7.1 Layer Abstraction
4.7.2 Computational Graphs
4.7.3 Reverse-Mode Automatic Differentiation
4.7.4 Static Computational Graphs
4.7.5 Dynamic Computational Graphs
4.8 Case Study
4.8.1 Software Tools and Libraries
4.8.2 Exploratory Data Analysis (EDA)
4.8.3 Supervised Learning
4.8.4 Unsupervised Learning
4.8.5 Classifying with Unsupervised Features
4.8.6 Results
4.8.7 Exercises for Readers and Practitioners
References
5 Distributed Representations
5.1 Introduction
5.2 Distributional Semantics
5.2.1 Vector Space Model
5.2.1.1 Curse of Dimensionality
5.2.2 Word Representations
5.2.2.1 Co-occurrence
5.2.2.2 LSA
5.2.3 Neural Language Models
5.2.3.1 Bengio
5.2.3.2 Collobert and Weston
5.2.4 word2vec
5.2.4.1 CBOW
5.2.4.2 Skip-Gram
5.2.4.3 Hierarchical Softmax
5.2.4.4 Negative Sampling
5.2.4.5 Phrase Representations
5.2.4.6 word2vec CBOW: Forward and Backward Propagation
5.2.4.7 word2vec Skip-gram: Forward and Backward Propagation
5.2.5 GloVe
5.2.6 Spectral Word Embeddings
5.2.7 Multilingual Word Embeddings
5.3 Limitations of Word Embeddings
5.3.1 Out of Vocabulary
5.3.2 Antonymy
5.3.3 Polysemy
5.3.3.1 Clustering-Weighted Context Embeddings
5.3.3.2 Sense2vec
5.3.4 Biased Embeddings
5.3.5 Other Limitations
5.4 Beyond Word Embeddings
5.4.1 Subword Embeddings
5.4.2 Word Vector Quantization
5.4.3 Sentence Embeddings
5.4.4 Concept Embeddings
5.4.5 Retrofitting with Semantic Lexicons
5.4.6 Gaussian Embeddings
5.4.6.1 Word2Gauss
5.4.6.2 Bayesian Skip-Gram
5.4.7 Hyperbolic Embeddings
5.5 Applications
5.5.1 Classification
5.5.2 Document Clustering
5.5.3 Language Modeling
5.5.4 Text Anomaly Detection
5.5.5 Contextualized Embeddings
5.6 Case Study
5.6.1 Software Tools and Libraries
5.6.2 Exploratory Data Analysis
5.6.3 Learning Word Embeddings
5.6.3.1 Word2Vec
5.6.3.2 Negative Sampling
5.6.3.3 Training the Model
5.6.3.4 Visualize Embeddings
5.6.3.5 Using the Gensim package
5.6.3.6 Similarity
5.6.3.7 GloVe Embeddings
5.6.3.8 Co-occurrence Matrix
5.6.3.9 GloVe Training
5.6.3.10 GloVe Vector Similarity
5.6.3.11 Using the Glove Package
5.6.4 Document Clustering
5.6.4.1 Document Vectors
5.6.5 Word Sense Disambiguation
5.6.5.1 Supervised Disambiguation Annotations
5.6.5.2 Training with word2vec
5.6.6 Exercises for Readers and Practitioners
References
6 Convolutional Neural Networks
6.1 Introduction
6.2 Basic Building Blocks of CNN
6.2.1 Convolution and Correlation in Linear Time-Invariant Systems
6.2.1.1 Linear Time-Invariant Systems
6.2.1.2 The Convolution Operator and Its Properties
6.2.1.3 Cross-Correlation and Its Properties
6.2.2 Local Connectivity or Sparse Interactions
6.2.3 Parameter Sharing
6.2.4 Spatial Arrangement
6.2.5 Detector Using Nonlinearity
6.2.6 Pooling and Subsampling
6.2.6.1 Max Pooling
6.2.6.2 Average Pooling
6.2.6.3 L2-Norm Pooling
6.2.6.4 Stochastic Pooling
6.2.6.5 Spectral Pooling
6.3 Forward and Backpropagation in CNN
6.3.1 Gradient with Respect to Weights ∂E∂W
6.3.2 Gradient with Respect to the Inputs ∂E∂X
6.3.3 Max Pooling Layer
6.4 Text Inputs and CNNs
6.4.1 Word Embeddings and CNN
6.4.2 Character-Based Representation and CNN
6.5 Classic CNN Architectures
6.5.1 LeNet-5
6.5.2 AlexNet
6.5.3 VGG-16
6.6 Modern CNN Architectures
6.6.1 Stacked or Hierarchical CNN
6.6.2 Dilated CNN
6.6.3 Inception Networks
6.6.4 Other CNN Structures
6.7 Applications of CNN in NLP
6.7.1 Text Classification and Categorization
6.7.2 Text Clustering and Topic Mining
6.7.3 Syntactic Parsing
6.7.4 Information Extraction
6.7.5 Machine Translation
6.7.6 Summarizations
6.7.7 Question and Answers
6.8 Fast Algorithms for Convolutions
6.8.1 Convolution Theorem and Fast Fourier Transform
6.8.2 Fast Filtering Algorithm
6.9 Case Study
6.9.1 Software Tools and Libraries
6.9.2 Exploratory Data Analysis
6.9.3 Data Preprocessing and Data Splits
6.9.4 CNN Model Experiments
6.9.5 Understanding and Improving the Models
6.9.6 Exercises for Readers and Practitioners
6.10 Discussion
References
7 Recurrent Neural Networks
7.1 Introduction
7.2 Basic Building Blocks of RNNs
7.2.1 Recurrence and Memory
7.2.2 PyTorch Example
7.3 RNNs and Properties
7.3.1 Forward and Backpropagation in RNNs
7.3.1.1 Output Weights (V)
7.3.1.2 Recurrent Weights (W)
7.3.1.3 Input Weights (U)
7.3.1.4 Aggregate Gradient
7.3.2 Vanishing Gradient Problem and Regularization
7.3.2.1 Long Short-Term Memory
7.3.2.2 Gated Recurrent Unit
7.3.2.3 Gradient Clipping
7.3.2.4 BPTT Sequence Length
7.3.2.5 Recurrent Dropout
7.4 Deep RNN Architectures
7.4.1 Deep RNNs
7.4.2 Residual LSTM
7.4.3 Recurrent Highway Networks
7.4.4 Bidirectional RNNs
7.4.5 SRU and Quasi-RNN
7.4.6 Recursive Neural Networks
7.5 Extensions of Recurrent Networks
7.5.1 Sequence-to-Sequence
7.5.2 Attention
7.5.3 Pointer Networks
7.5.4 Transformer Networks
7.6 Applications of RNNs in NLP
7.6.1 Text Classification
7.6.2 Part-of-Speech Tagging and Named EntityRecognition
7.6.3 Dependency Parsing
7.6.4 Topic Modeling and Summarization
7.6.5 Question Answering
7.6.6 Multi-Modal
7.6.7 Language Models
7.6.7.1 Perplexity
7.6.7.2 Recurrent Variational Autoencoder
7.6.8 Neural Machine Translation
7.6.8.1 BLEU
7.6.9 Prediction/Sampling Output
7.6.9.1 Greedy Search
7.6.9.2 Random Sampling and Temperature Sampling
7.6.9.3 Optimizing Output: Beam Search Decoding
7.7 Case Study
7.7.1 Software Tools and Libraries
7.7.2 Exploratory Data Analysis
7.7.2.1 Sequence Length Filtering
7.7.2.2 Vocabulary Inspection
7.7.3 Model Training
7.7.3.1 RNN Baseline
7.7.3.2 RNN, LSTM, and GRU Comparison
7.7.3.3 RNN, LSTM, and GRU Layer Depth Comparison
7.7.3.4 Bidirectional RNN, LSTM, and GRU Comparison
7.7.3.5 Deep Bidirectional Comparison
7.7.3.6 Transformer Network
7.7.3.7 Comparison of Experiments
7.7.4 Results
7.7.5 Exercises for Readers and Practitioners
7.8 Discussion
7.8.1 Memorization or Generalization
7.8.2 Future of RNNs
References
8 Automatic Speech Recognition
8.1 Introduction
8.2 Acoustic Features
8.2.1 Speech Production
8.2.2 Raw Waveform
8.2.3 MFCC
8.2.3.1 Pre-emphasis
8.2.3.2 Framing
8.2.3.3 Windowing
8.2.3.4 Fast Fourier Transform
8.2.3.5 Mel Filter Bank
8.2.3.6 Discrete Cosine Transform
8.2.3.7 Delta Energy and Delta Spectrum
8.2.4 Other Feature Types
8.2.4.1 Automatically Learned
8.3 Phones
8.4 Statistical Speech Recognition
8.4.1 Acoustic Model: P(X|W)
8.4.1.1 Lexicon Model: P(S|W)
8.4.2 Language Model: P(W)
8.4.3 HMM Decoding
8.5 Error Metrics
8.6 DNN/HMM Hybrid Model
8.7 Case Study
8.7.1 Dataset: Common Voice
8.7.2 Software Tools and Libraries
8.7.3 Sphinx
8.7.3.1 Data Preparation
8.7.3.2 Model Training
8.7.4 Kaldi
8.7.4.1 Data Preparation
8.7.4.2 Model Training
8.7.5 Results
8.7.6 Exercises for Readers and Practitioners
References
Part III Advanced Deep Learning Techniques for Text and Speech
9 Attention and Memory Augmented Networks
9.1 Introduction
9.2 Attention Mechanism
9.2.1 The Need for Attention Mechanism
9.2.2 Soft Attention
9.2.3 Scores-Based Attention
9.2.4 Soft vs. Hard Attention
9.2.5 Local vs. Global Attention
9.2.6 Self-Attention
9.2.7 Key-Value Attention
9.2.8 Multi-Head Self-Attention
9.2.9 Hierarchical Attention
9.2.10 Applications of Attention Mechanism in Text and Speech
9.3 Memory Augmented Networks
9.3.1 Memory Networks
9.3.2 End-to-End Memory Networks
9.3.2.1 Single Layer MemN2N
9.3.2.2 Input and Query
9.3.2.3 Controller and Memory
9.3.2.4 Controller and Output
9.3.2.5 Final Prediction and Learning
9.3.2.6 Multiple Layers
9.3.3 Neural Turing Machines
9.3.3.1 Read Operations
9.3.3.2 Write Operations
9.3.3.3 Addressing Mechanism
9.3.4 Differentiable Neural Computer
9.3.4.1 Input and Outputs
9.3.4.2 Memory Reads and Writes
9.3.4.3 Selective Attention
9.3.5 Dynamic Memory Networks
9.3.5.1 Input Module
9.3.5.2 Question Module
9.3.5.3 Episodic Memory Module
9.3.5.4 Answer Module
9.3.5.5 Training
9.3.6 Neural Stack, Queues, and Deques
9.3.6.1 Neural Stack
9.3.6.2 Recurrent Networks, Controller, and Training
9.3.7 Recurrent Entity Networks
9.3.7.1 Input Encoder
9.3.7.2 Dynamic Memory
9.3.7.3 Output Module and Training
9.3.8 Applications of Memory Augmented Networks in Text and Speech
9.4 Case Study
9.4.1 Attention-Based NMT
9.4.2 Exploratory Data Analysis
9.4.2.1 Software Tools and Libraries
9.4.2.2 Model Training
9.4.2.3 Bahdanau Attention
9.4.2.4 Results
9.4.3 Question and Answering
9.4.3.1 Software Tools and Libraries
9.4.3.2 Exploratory Data Analysis
9.4.3.3 LSTM Baseline
9.4.3.4 End-to-End Memory Network
9.4.4 Dynamic Memory Network
9.4.4.1 Differentiable Neural Computer
9.4.4.2 Recurrent Entity Network
9.4.5 Exercises for Readers and Practitioners
References
10 Transfer Learning: Scenarios, Self-Taught Learning, and Multitask Learning
10.1 Introduction
10.2 Transfer Learning: Definition, Scenarios, and Categorization
10.2.1 Definition
10.2.2 Transfer Learning Scenarios
10.2.3 Transfer Learning Categories
10.3 Self-Taught Learning
10.3.1 Techniques
10.3.1.1 Unsupervised Pre-training and Supervised Fine-Tuning
10.3.2 Theory
10.3.3 Applications in NLP
10.3.4 Applications in Speech
10.4 Multitask Learning
10.4.1 Techniques
10.4.1.1 Multilinear Relationship Network
10.4.1.2 Fully Adaptive Feature Sharing Network
10.4.1.3 Cross-Stitch Networks
10.4.1.4 A Joint Many-Task Network
10.4.1.5 Sluice Networks
10.4.2 Theory
10.4.3 Applications in NLP
10.4.4 Applications in Speech Recognition
10.5 Case Study
10.5.1 Software Tools and Libraries
10.5.2 Exploratory Data Analysis
10.5.3 Multitask Learning Experiments and Analysis
10.5.4 Exercises for Readers and Practitioners
References
11 Transfer Learning: Domain Adaptation
11.1 Introduction
11.1.1 Techniques
11.1.1.1 Stacked Autoencoders
11.1.1.2 Deep Interpolation Between Source and Target
11.1.1.3 Deep Domain Confusion
11.1.1.4 Deep Adaptation Network
11.1.1.5 Domain-Invariant Representation
11.1.1.6 Domain Confusion and Invariant Representation
11.1.1.7 Domain-Adversarial Neural Network
11.1.1.8 Adversarial Discriminative Domain Adaptation
11.1.1.9 Coupled Generative Adversarial Networks
11.1.1.10 Cycle Generative Adversarial Networks
11.1.1.11 Domain Separation Networks
11.1.2 Theory
11.1.2.1 Siamese Networks Based Domain Adaptations
11.1.2.2 Optimal Transport
11.1.3 Applications in NLP
11.1.4 Applications in Speech Recognition
11.2 Zero-Shot, One-Shot, and Few-Shot Learning
11.2.1 Zero-Shot Learning
11.2.1.1 Techniques
11.2.2 One-Shot Learning
11.2.2.1 Techniques
11.2.3 Few-Shot Learning
11.2.3.1 Techniques
11.2.4 Theory
11.2.5 Applications in NLP and Speech Recognition
11.3 Case Study
11.3.1 Software Tools and Libraries
11.3.2 Exploratory Data Analysis
11.3.3 Domain Adaptation Experiments
11.3.3.1 Preprocessing
11.3.3.2 Experiments
11.3.3.3 Results and Analysis
11.3.4 Exercises for Readers and Practitioners
References
12 End-to-End Speech Recognition
12.1 Introduction
12.2 Connectionist Temporal Classification (CTC)
12.2.1 End-to-End Phoneme Recognition
12.2.2 Deep Speech
12.2.2.1 GPU Parallelism
12.2.3 Deep Speech 2
12.2.4 Wav2Letter
12.2.5 Extensions of CTC
12.2.5.1 Gram-CTC
12.2.5.2 RNN Transducer
12.3 Seq-to-Seq
12.3.0.1 Content-Based Attention
12.3.0.2 Location-Aware Attention
12.3.1 Early Seq-to-Seq ASR
12.3.2 Listen, Attend, and Spell (LAS)
12.4 Multitask Learning
12.5 End-to-End Decoding
12.5.1 Language Models for ASR
12.5.1.1 N-gram
12.5.1.2 RNN Language Models
12.5.2 CTC Decoding
12.5.3 Attention Decoding
12.5.3.1 Shallow Fusion
12.5.4 Combined Language Model Training
12.5.4.1 Deep Fusion
12.5.4.2 Cold Fusion
12.5.5 Combined CTC–Attention Decoding
12.5.5.1 Rescoring
12.5.6 One-Pass Decoding
12.6 Speech Embeddings and Unsupervised Speech Recognition
12.6.1 Speech Embeddings
12.6.2 Unspeech
12.6.3 Audio Word2Vec
12.7 Case Study
12.7.1 Software Tools and Libraries
12.7.2 Deep Speech 2
12.7.2.1 Data Preparation
12.7.2.2 Acoustic Model Training
12.7.3 Language Model Training
12.7.4 ESPnet
12.7.4.1 Data Preparation
12.7.4.2 Model Training
12.7.5 Results
12.7.6 Exercises for Readers and Practitioners
References
13 Deep Reinforcement Learning for Text and Speech
13.1 Introduction
13.2 RL Fundamentals
13.2.1 Markov Decision Processes
13.2.2 Value, Q, and Advantage Functions
13.2.3 Bellman Equations
13.2.4 Optimality
13.2.5 Dynamic Programming Methods
13.2.5.1 Policy Evaluation
13.2.5.2 Policy Improvement
13.2.5.3 Value Iteration
13.2.5.4 Bootstrapping
13.2.5.5 Asynchronous DP
13.2.6 Monte Carlo
13.2.6.1 Importance Sampling
13.2.7 Temporal Difference Learning
13.2.7.1 SARSA
13.2.8 Policy Gradient
13.2.9 Q-Learning
13.2.10 Actor-Critic
13.2.10.1 Advantage Actor Critic A2C
13.2.10.2 Asynchronous Advantage Actor Critic A3C
13.3 Deep Reinforcement Learning Algorithms
13.3.1 Why RL for Seq2seq
13.3.2 Deep Policy Gradient
13.3.3 Deep Q-Learning
13.3.3.1 DQN
13.3.3.2 Double DQN
13.3.3.3 Dueling Networks
13.3.4 Deep Advantage Actor-Critic
13.4 DRL for Text
13.4.1 Information Extraction
13.4.1.1 Entity Extraction
13.4.1.2 Relation Extraction
13.4.1.3 Action Extraction
13.4.1.4 Joint Entity/Relation Extraction
13.4.2 Text Classification
13.4.3 Dialogue Systems
13.4.4 Text Summarization
13.4.5 Machine Translation
13.5 DRL for Speech
13.5.1 Automatic Speech Recognition
13.5.2 Speech Enhancement and Noise Suppression
13.6 Case Study
13.6.1 Software Tools and Libraries
13.6.2 Text Summarization
13.6.3 Exploratory Data Analysis
13.6.3.1 Seq2Seq Model
13.6.3.2 Policy Gradient
13.6.3.3 DDQN
13.6.4 Exercises for Readers and Practitioners
References
Future Outlook
End-to-End Architecture Prevalence
Transition to AI-Centric
Specialized Hardware
Transition Away from Supervised Learning
Explainable AI
Model Development and Deployment Process
Democratization of AI
NLP Trends
Speech Trends
Closing Remarks
Index