logo资料库

Spoken Language Processing A Guide to Theory Algorithm+and Syste....pdf

第1页 / 共965页
第2页 / 共965页
第3页 / 共965页
第4页 / 共965页
第5页 / 共965页
第6页 / 共965页
第7页 / 共965页
第8页 / 共965页
资料共965页,剩余部分请下载后查看
Cover
Table of Contents
Foreword
Preface
Introduction
1.1. MOTIVATIONS
1.1.1. Spoken Language Interface
1.1.2. Speech- to- speech Translation
1.1.3. Knowledge Partners
1.2. SPOKEN LANGUAGE SYSTEM ARCHITECTURE
1.2.1. Automatic Speech Recognition
1.2.2. Text- to- Speech Conversion
1.2.3. Spoken Language Understanding
1.3. BOOK ORGANIZATION
1.3.1. Part I: Fundamental Theory
1.3.2. Part II: Speech Processing
1.3.3. Part III: Speech Recognition
1.3.4. Part IV: Text- to- Speech Systems
1.3.5. Part V: Spoken Language Systems
1.4. TARGET AUDIENCES
1.5. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
PART I: FUNDAMENTAL THEORY
Spoken Language Structure
2.1. SOUND HUMAN SPEECH SYSTEMS
2.1.1. Sound
2.1.2. Speech Production
Speech Perception 2.1.3.
2.2. PHONETICS PHONOLOGY
2.2.1. Phonemes
2.2.2. The Allophone: Sound and Context
2.2.3. Speech Rate and Coarticulation
2.3. SYLLABLES WORDS
2.3.1. Syllables
2.3.2. Words
2.4. SYNTAX SEMANTICS
2.4.1. Syntactic Constituents
2.4.2. Semantic Roles
2.4.3. Lexical Semantics
2.4.4. Logical Form
2.5. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
Probability, Statistics, and Information Theory
3.1. PROBABILITY THEORY
3.1.1. Conditional Probability And Bayes' Rule
3.1.2. Random Variables
3.1.3. Mean and Variance
3.1.4. Covariance and Correlation
3.1.5. Random Vectors and Multivariate Distributions
3.1.6. Some Useful Distributions
3.1.6.3. Geometric Distributions
3.1.7. Gaussian Distributions
3.2. ESTIMATION THEORY
3.2.1. Minimum/ Least Mean Squared Error Estimation
3.2.2. Maximum Likelihood Estimation
3.2.3. Bayesian Estimation and MAP Estimation
3.3. SIGNIFICANCE
3.3.1. Level of Significance
3.3.2. Normal Test (Z- Test)
3.3.3. Goodness- of- Fit Test
3.3.4. Matched- Pairs Test
3.4. INFORMATION THEORY
3.4.1. Entropy
3.4.2. Conditional Entropy
3.4.3. The Source Coding Theorem
3.4.4. Mutual Information and Channel Coding
3.5. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
Pattern Recognition
4.1. BAYES DECISION THEORY
4.1.1. Minimum- Error- Rate Decision Rules
4.1.2. Discriminant Functions
4.2. HOW CONSTRUCT CLASSIFIERS
4.2.1. Gaussian Classifiers
4.2.2. The Curse of Dimensionality
4.2.3. Estimating the Error Rate
4.2.4. Comparing Classifiers
4.3. DISCRIMINATIVE TRAINING
4.3.1. Maximum Mutual Information Estimation
4.3.2. Minimum- Error- Rate Estimation
4.3.3. Neural Networks
4.4. UNSUPERVISED ESTIMATION METHODS
4.4.1. Vector Quantization
4.4.2. The EM Algorithm
4.4.3. Multivariate Gaussian Mixture Density Estimation
4.5. CLASSIFICATION REGRESSION TREES
4.5.1. Choice of Question Set
4.5.2. Splitting Criteria
4.5.3. Growing the Tree
4.5.4. Missing Values and Conflict Resolution
4.5.5. Complex Questions
4.5.6. The Right- Sized Tree
4.6. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
PART II SPEECH PROCESSING
Digital Signal Processing
5.1. DIGITAL SIGNALS SYSTEMS
5.1.1. Sinusoidal Signals
5.1.2. Other Digital Signals
5.1.3. Digital Systems
5.2. CONTINUOUS- FREQUENCY TRANSFORMS
5.2.1. The Fourier Transform
5.2.2. Z- Transform
5.2.3. Z- Transforms of Elementary Functions
5.2.4. Properties of the Z and Fourier Transform
5.3. DISCRETE- FREQUENCY TRANSFORMS
5.3.1. The Discrete Fourier Transform (DFT)
5.3.2. Fourier Transforms of Periodic Signals
5.3.3. The Fast Fourier Transform (FFT)
5.3.4. Circular Convolution
5.3.5. The Discrete Cosine Transform (DCT)
5.4. DIGITAL FILTERS WINDOWS
5.4.1. The Ideal Low- Pass Filter
5.4.2. Window Functions
5.4.3. FIR Filters
5.4.4. IIR Filters
5.5. DIGITAL PROCESSING ANALOG SIGNALS
5.5.1. Fourier Transform of Analog Signals
5.5.2. The Sampling Theorem
5.5.3. Analog- to- Digital Conversion
5.5.4. Digital- to- Analog Conversion
5.6. MULTIRATE SIGNAL PROCESSING
5.6.1. Decimation
5.6.2. Interpolation
5.6.3. Resampling
5.7. FILTERBANKS
5.7.1. Two- Band Conjugate Quadrature Filters
5.7.2. Multiresolution Filterbanks
5.7.3. The FFT as a Filterbank
5.7.4. Modulated Lapped Transforms
5.8. STOCHASTIC PROCESSES
5.8.1. Statistics of Stochastic Processes
5.8.2. Stationary Processes
5.8.3. LTI Systems with Stochastic Inputs
5.8.4. Power Spectral Density
5.8.5. Noise
5.9. HISTORICAL PERSPECTIVE AND FURTHER READING
REFERENCES
Speech Signal Representations
6.1. SHORT- TIME FOURIER ANALYSIS
6.1.1. Spectrograms
6.1.2. Pitch- Synchronous Analysis
6.2. ACOUSTICAL MODEL SPEECH PRODUCTION
6.2.1. Glottal Excitation
6.2.2. Lossless Tube Concatenation
6.2.3. Source- Filter Models of Speech Production
6.3. LINEAR PREDICTIVE CODING
6.3.1. The Orthogonality Principle
6.3.2. Solution of the LPC Equations
6.3.3. Spectral Analysis via LPC
6.3.4. The Prediction Error
6.3.5. Equivalent Representations
6.4. CEPSTRAL PROCESSING
6.4.1. The Real and Complex Cepstrum
6.4.2. Cepstrum of Pole- Zero Filters
6.4.3. Cepstrum of Periodic Signals
6.4.4. Cepstrum of Speech Signals
6.4.5. Source- Filter Separation via the Cepstrum
6.5. PERCEPTUALLY- MOTIVATED REPRESENTATIONS
6.5.1. The Bilinear Transform
6.5.2. Mel- Frequency Cepstrum
6.5.3. Perceptual Linear Prediction (PLP)
6.6. FORMANT FREQUENCIES
6.6.1. Statistical Formant Tracking
6.7. THE ROLE PITCH
6.7.1. Autocorrelation Method
6.7.2. Normalized Cross- Correlation Method
6.7.3. Signal Conditioning
6.7.4. Pitch Tracking
6.8. HISTORICAL PERSPECTIVE AND FUTURE READING
REFERENCES
Speech Coding
7.1. SPEECH CODERS ATTRIBUTES
7.2. SCALAR WAVEFORM CODERS
7.2.1. Linear Pulse Code Modulation (PCM)
7.2.2. µ µ µ µ -law and A- law PCM
7.2.3. Adaptive PCM
7.2.4. Differential Quantization
7.3. SCALAR FREQUENCY DOMAIN CODERS
7.3.1. Benefits of Masking
7.3.2. Transform Coders
7.3.3. Consumer Audio
7.3.4. Digital Audio Broadcasting (DAB)
7.4. CODE EXCITED LINEAR PREDICTION (CELP)
7.4.1. LPC Vocoder
7.4.2. Analysis by Synthesis
7.4.3. Pitch Prediction: Adaptive Codebook
7.4.4. Perceptual Weighting and Postfiltering
7.4.5. Parameter Quantization
7.4.6. CELP Standards
7.5. LOW- BIT RATE SPEECH CODERS
7.5.1. Mixed- Excitation LPC Vocoder
7.5.2. Harmonic Coding
7.5.3. Waveform Interpolation
7.6. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
PART III: SPEECH RECOGNITION
Hidden Markov Models
8.1. THE MARKOV CHAIN
8.2. DEFINITION HIDDEN MARKOV MODEL
8.2.1. Dynamic Programming and DTW
8.2.2. How to Evaluate an HMM – The Forward Algorithm
8.2.3. How to Decode an HMM - The Viterbi Algorithm
8.2.4. How to Estimate HMM Parameters – Baum- Welch Algorithm
8.3. CONTINUOUS SEMI- HMMS
8.3.1. Continuous Mixture Density HMMs
8.3.2. Semi- continuous HMMs
8.4. PRACTICAL ISSUES USING HMMS
8.4.1. Initial Estimates
8.4.2. Model Topology
8.4.3. Training Criteria
8.4.4. Deleted Interpolation
8.4.5. Parameter Smoothing
8.4.6. Probability Representations
8.5. HMM LIMITATIONS
8.5.1. Duration Modeling
8.5.2. First- Order Assumption
8.5.3. Conditional Independence Assumption
8.6. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
Acoustic Modeling
9.1. VARIABILITY SPEECH SIGNAL
9.1.1. Context Variability
9.1.2. Style Variability
9.1.3. Speaker Variability
9.1.4. Environment Variability
9.2. HOW MEASURE SPEECH RECOGNITION ERRORS
9.3. SIGNAL PROCESSING— EXTRACTING FEATURES
9.3.1. Signal Acquisition
9.3.2. End- Point Detection
9.3.3. MFCC and Its Dynamic Features
9.3.4. Feature Transformation
9.4. PHONETIC MODELING— SELECTING APPROPRIATE
UNITS
9.4.1. Comparison of Different Units
9.4.2. Context Dependency
9.4.3. Clustered Acoustic- Phonetic Units
9.4.4. Lexical Baseforms
9.5. ACOUSTIC MODELING— SCORING ACOUSTIC FEATURES
9.5.1. Choice of HMM Output Distributions
9.5.2. Isolated vs. Continuous Speech Training
9.6. ADAPTIVE TECHNIQUES— MINIMIZING MISMATCHES
9.6.1. Maximum a Posteriori (MAP)
9.6.2. Maximum Likelihood Linear Regression (MLLR)
9.6.3. MLLR and MAP Comparison
9.6.4. Clustered Models
9.7. CONFIDENCE MEASURES: MEASURING RELIABILITY
9.7.1. Filler Models
9.7.2. Transformation Models
9.7.3. Combination Models
9.8. OTHER TECHNIQUES
9.8.1. Neural Networks
9.8.2. Segment Models
9.9. CASE STUDY: WHISPER
9.10. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
Environmental Robustness
10.1. THE ACOUSTICAL ENVIRONMENT
10.1.1. Additive Noise
10.1.2. Reverberation
10.1.3. A Model of the Environment
10.2. ACOUSTICAL TRANSDUCERS
10.2.2. Directionality Patterns
10.2.3. Other Transduction Categories
10.3. ADAPTIVE ECHO CANCELLATION (AEC)
10.3.1. The LMS Algorithm
10.3.2. Convergence Properties of the LMS Algorithm
10.3.3. Normalized LMS Algorithm
10.3.4. Transform- Domain LMS Algorithm
10.3.5. The RLS Algorithm
10.4. MULTIMICROPHONE SPEECH ENHANCEMENT
10.4.1. Microphone Arrays
10.4.2. Blind Source Separation
10.5. ENVIRONMENT COMPENSATION PREPROCESSING
10.5.1. Spectral Subtraction
10.5.2. Frequency- Domain MMSE from Stereo Data
10.5.3. Wiener Filtering
10.5.4. Cepstral Mean Normalization (CMN)
10.5.5. Real- Time Cepstral Normalization
10.5.6. The Use of Gaussian Mixture Models
10.6. ENVIRONMENTAL MODEL ADAPTATION
10.6.1. Retraining on Corrupted Speech
10.6.2. Model Adaptation
10.6.3. Parallel Model Combination
10.6.4. Vector Taylor Series
10.6.5. Retraining on Compensated Features
10.7. MODELING NONSTATIONARY NOISE
10.8. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
Language Modeling
11.1. FORMAL LANGUAGE THEORY
11.1.1. Chomsky Hierarchy
11.1.2. Chart Parsing for Context- Free Grammars
11.2. STOCHASTIC LANGUAGE MODELS
11.2.1. Probabilistic Context- Free Grammars
11.2.2. N- gram Language Models
11.3. COMPLEXITY MEASURE LANGUAGE MODELS
11.4. N- GRAM SMOOTHING
11.4.1. Deleted Interpolation Smoothing
11.4.2. Backoff Smoothing
11.4.3. Class n- grams
11.4.4. Performance of n- gram Smoothing
11.5. ADAPTIVE LANGUAGE MODELS
11.5.1. Cache Language Models
11.5.2. Topic- Adaptive Models
11.5.3. Maximum Entropy Models
11.6. PRACTICAL ISSUES
11.6.1. Vocabulary Selection
11.6.2. N- gram Pruning
11.6.3. CFG vs n- gram Models
11.7. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
Basic Search Algorithms
12.1. BASIC SEARCH ALGORITHMS
12.1.1. General Graph Searching Procedures
12.1.2. Blind Graph Search Algorithms
12.1.3. Heuristic Graph Search
12.2. SEARCH ALGORITHMS FOR SPEECH RECOGNITION
12.2.1. Decoder Basics
12.2.2. Combining Acoustic And Language Models
12.2.3. Isolated Word Recognition
12.2.4. Continuous Speech Recognition
12.3. LANGUAGE MODEL STATES
12.3.1. Search Space with FSM and CFG
12.3.2. Search Space with the Unigram
12.3.3. Search Space with Bigrams
12.3.4. Search Space with Trigrams
12.3.5. How to Handle Silences Between Words
12.4. TIME- SYNCHRONOUS VITERBI BEAM SEARCH
12.4.1. The Use of Beam
12.4.2. Viterbi Beam Search
12.5. STACK (A SEARCH)
12.5.1. Admissible Heuristics for Remaining Path
12.5.2. When to Extend New Words
12.5.3. Fast Match
12.5.4. Stack Pruning
12.5.5. Multistack Search
12.6. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
Large Vocabulary Search Algorithms
13.1. EFFICIENT MANIPULATION TREE LEXICON
13.1.1. Lexical Tree
13.1.2. Multiple Copies of Pronunciation Trees
13.1.3. Factored Language Probabilities
13.1.4. Optimization of Lexical Trees
13.1.5. Exploiting Subtree Polymorphism
13.1.6. Context- Dependent Units and Inter- Word Triphones
13.2. OTHER EFFICIENT SEARCH TECHNIQUES
13.2.1. Using Entire HMM as a State in Search
13.2.2. Different Layers of Beams
13.2.3. Fast Match
13.3. N- MULTIPASS SEARCH STRATEGIES
13.3.1. N- Best Lists and Word Lattices
13.3.2. The Exact N- best Algorithm
13.3.3. Word- Dependent N- Best and Word- Lattice Algorithm
13.3.4. The Forward- Backward Search Algorithm
13.3.5. One- Pass vs. Multipass Search
13.4. SEARCH- ALGORITHM EVALUATION
13.5. CASE STUDY— MICROSOFT WHISPER
13.5.1. The CFG Search Architecture
13.5.2. The N- Gram Search Architecture
13.6. HISTORICAL PERSPECTIVES FURTHER READING
REFERENCES
PART IV: TEXT-TO-SPEECH SYSTEMS
Text and Phonetic Analysis
14.1. MODULES DATA FLOW
14.1.1. Modules
14.1.2. Data Flows
14.1.3. Localization Issues
14.2. LEXICON
14.3. DOCUMENT STRUCTURE DETECTION
14.3.1. Chapter and Section Headers
14.3.2. Lists
14.3.3. Paragraphs
14.3.4. Sentences
14.3.5. E- mail
14.3.6. Web Pages
14.3.7. Dialog Turns and Speech Acts
14.4. TEXT NORMALIZATION
14.4.1. Abbreviations and Acronyms
14.4.2. Number Formats
14.4.3. Domain- Specific Tags
14.4.4. Miscellaneous Formats
14.5. LINGUISTIC ANALYSIS
14.6. HOMOGRAPH DISAMBIGUATION
14.7. MORPHOLOGICAL
14.8. LETTER- SOUND CONVERSION
14.9. EVALUATION
14.10. CASE STUDY: FESTIVAL
14.10.1. Lexicon
14.10.2. Text Analysis
14.10.3. Phonetic Analysis
14.11. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
Prosody
15.1. THE ROLE UNDERSTANDING
15.2. PROSODY GENERATION SCHEMATIC
15.3. SPEAKING STYLE
15.3.1. Character
15.3.2. Emotion
15.4. SYMBOLIC PROSODY
15.4.1. Pauses
15.4.2. Prosodic Phrases
15.4.3. Accent
15.4.4. Tone
15.4.5. Tune
15.4.6. Prosodic Transcription Systems
15.5. DURATION ASSIGNMENT
15.5.1. Rule- Based Methods
15.5.2. CART- Based Durations
15.6. PITCH GENERATION
15.6.1. Attributes of Pitch Contours
15.6.2. Baseline F0 Contour Generation
15.6.3. Parametric F0 Generation
15.6.4. Corpus- Based F0 Generation
15.7. PROSODY MARKUP LANGUAGES
15.8. PROSODY EVALUATION
15.9. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
Speech Synthesis
16.1. ATTRIBUTES SPEECH SYNTHESIS
16.2. FORMANT SPEECH SYNTHESIS
16.2.1. Waveform Generation from Formant Values
16.2.2. Formant Generation by Rule
16.2.3. Data- Driven Formant Generation
16.2.4. Articulatory Synthesis
16.3. CONCATENATIVE SPEECH SYNTHESIS
16.3.1. Choice of Unit
16.3.2. Optimal Unit String: The Decoding Process
16.3.3. Unit Inventory Design
16.4. PROSODIC MODIFICATION SPEECH
16.4.1. Synchronous Overlap and Add (SOLA)
16.4.2. Pitch Synchronous Overlap and Add (PSOLA)
16.4.3. Spectral Behavior of PSOLA
16.4.4. Synthesis Epoch Calculation
16.4.5. Pitch- Scale Modification Epoch Calculation
16.4.6. Time- Scale Modification Epoch Calculation
16.4.7. Pitch- Scale Time- Scale Epoch Calculation
16.4.8. Waveform Mapping
16.4.9. Epoch Detection
16.4.10. Problems with PSOLA
16.5. SOURCE- FILTER MODELS PROSODY MODIFICATION
16.5.1. Prosody Modification of the LPC Residual
16.5.2. Mixed Excitation Models
16.5.3. Voice Effects
16.6. EVALUATION TTS SYSTEMS
16.6.1. Intelligibility Tests
16.6.2. Overall Quality Tests
16.6.3. Preference Tests
16.6.4. Functional Tests
16.6.5. Automated Tests
16.7. HISTORICAL PERSPECTIVE AND FUTURE READING
REFERENCES
PART V: SPOKEN LANGUAGE SYSTEMS
Spoken Language Understanding
17.1. WRITTEN SPOKEN LANGUAGES
17.1.1. Style
17.1.2. Disfluency
17.1.3. Communicative Prosody
17.2. DIALOG STRUCTURE
17.2.1. Units of Dialog
17.2.2. Dialog (Speech) Acts
17.2.3. Dialog Control
17.3. SEMANTIC REPRESENTATION
17.3.1. Semantic Frames
17.3.2. Conceptual Graphs
17.4. SENTENCE INTERPRETATION
17.4.1. Robust Parsing
17.4.2. Statistical Pattern Matching
17.5. DISCOURSE ANALYSIS
17.5.1. Resolution of Relative Expression
17.5.2. Automatic Inference and Inconsistency Detection
17.6. DIALOG MANAGEMENT
17.6.1. Dialog Grammars
17.6.2. Plan- Based Systems
17.6.3. Dialog Behavior
17.7. RESPONSE GENERATION AND RENDITION
17.7.1. Response Content Generation
17.7.2. Concept- to- Speech Rendition
17.7.3. Other Renditions
17.8. EVALUATION
17.8.1. Evaluation in the ATIS Task
17.8.2. PARADISE Framework
17.9. CASE STUDY— DR. WHO
17.9.1. Semantic Representation
17.9.2. Semantic Parser (Sentence Interpretation)
17.9.3. Discourse Analysis
17.9.4. Dialog Manager
17.10. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
Applications and User Interfaces
18.1. APPLICATION ARCHITECTURE
18.2. TYPICAL APPLICATIONS
18.2.1. Computer Command and Control
18.2.2. Telephony Applications
18.2.3. Dictation
18.2.4. Accessibility
18.2.5. Handheld Devices
18.2.6. Automobile Applications
18.2.7. Speaker Recognition
18.3. SPEECH INTERFACE DESIGN
18.3.1. General Principles
18.3.2. Handling Errors
18.3.3. Other Considerations
18.3.4. Dialog Flow
18.4. INTERNATIONALIZATION
18.5. CASE STUDY— MIPAD
18.5.1. Specifying the Application
18.5.2. Rapid Prototyping
18.5.3. Evaluation
18.5.4. Iterations
18.6. HISTORICAL PERSPECTIVE FURTHER READING
REFERENCES
Index
TABLE OF CONTENTS 1.2.1. 1.2.2. 1.2.3. 1.1.1. 1.1.2. 1.1.3. 1. INTRODUCTION................................................................................................... 1 1.1. MOTIVATIONS .......................................................................................................2 Spoken Language Interface ...................................................................2 Speech-to-speech Translation................................................................3 Knowledge Partners...............................................................................3 1.2. SPOKEN LANGUAGE SYSTEM ARCHITECTURE ........................................................4 Automatic Speech Recognition ..............................................................4 Text-to-Speech Conversion ....................................................................6 Spoken Language Understanding..........................................................7 1.3. BOOK ORGANIZATION ............................................................................................9 Part I: Fundamental Theory ..................................................................9 Part II: Speech Processing ....................................................................9 Part III: Speech Recognition ...............................................................10 Part IV: Text-to-Speech Systems..........................................................10 Part V: Spoken Language Systems ......................................................10 1.4. TARGET AUDIENCES.............................................................................................11 1.5. HISTORICAL PERSPECTIVE AND FURTHER READING .............................................11 1.3.1. 1.3.2. 1.3.3. 1.3.4. 1.3.5. PART I: FUNDAMENTAL THEORY 2.2.1. 2.2.2. 2.2.3. 2.1.1. 2.1.2. 2.1.3. 2. SPOKEN LANGUAGE STRUCTURE .........................................................19 2.1. SOUND AND HUMAN SPEECH SYSTEMS................................................................21 Sound ...................................................................................................21 Speech Production ...............................................................................24 Speech Perception................................................................................28 2.2. PHONETICS AND PHONOLOGY...............................................................................36 Phonemes.............................................................................................36 The Allophone: Sound and Context .....................................................47 Speech Rate and Coarticulation ..........................................................49 2.3. SYLLABLES AND WORDS ......................................................................................50 Syllables...............................................................................................51 2.3.1. 2.3.2. Words...................................................................................................52 2.4. SYNTAX AND SEMANTICS.....................................................................................57 Syntactic Constituents..........................................................................58 Semantic Roles.....................................................................................63 Lexical Semantics ................................................................................64 Logical Form .......................................................................................66 2.5. HISTORICAL PERSPECTIVE AND FURTHER READING .............................................68 2.4.1. 2.4.2. 2.4.3. 2.4.4. i
ii TABLE OF CONTENTS 3. PROBABILITY, STATISTICS AND INFORMATION THEORY ..73 3.1. PROBABILITY THEORY .........................................................................................74 Conditional Probability And Bayes' Rule ............................................75 3.1.1. Random Variables................................................................................77 3.1.2. 3.1.3. Mean and Variance..............................................................................79 Covariance and Correlation ................................................................83 3.1.4. Random Vectors and Multivariate Distributions .................................84 3.1.5. Some Useful Distributions ...................................................................85 3.1.6. 3.1.7. Gaussian Distributions ........................................................................92 3.2. ESTIMATION THEORY ...........................................................................................98 3.2.1. Minimum/Least Mean Squared Error Estimation................................99 3.2.2. Maximum Likelihood Estimation .......................................................104 Bayesian Estimation and MAP Estimation ........................................108 3.2.3. 3.3. SIGNIFICANCE TESTING.......................................................................................114 Level of Significance..........................................................................114 Normal Test (Z-Test)..........................................................................116 2χ Goodness-of-Fit Test ...................................................................117 3.3.3. 3.3.4. Matched-Pairs Test............................................................................119 INFORMATION THEORY ......................................................................................121 Entropy ..............................................................................................121 3.4.1. Conditional Entropy ..........................................................................124 3.4.2. The Source Coding Theorem .............................................................125 3.4.3. 3.4.4. Mutual Information and Channel Coding..........................................127 3.5. HISTORICAL PERSPECTIVE AND FURTHER READING ...........................................129 3.3.1. 3.3.2. 3.4. 4. PATTERN RECOGNITION ...........................................................................133 4.1. BAYES DECISION THEORY..................................................................................134 4.1.1. Minimum-Error-Rate Decision Rules ................................................135 4.1.2. Discriminant Functions .....................................................................138 4.2. HOW TO CONSTRUCT CLASSIFIERS .....................................................................140 4.2.1. Gaussian Classifiers ..........................................................................142 The Curse of Dimensionality..............................................................144 4.2.2. Estimating the Error Rate..................................................................146 4.2.3. Comparing Classifiers .......................................................................148 4.2.4. 4.3. DISCRIMINATIVE TRAINING ................................................................................150 4.3.1. Maximum Mutual Information Estimation.........................................150 4.3.2. Minimum-Error-Rate Estimation.......................................................156 Neural Networks ................................................................................158 4.3.3. 4.4. UNSUPERVISED ESTIMATION METHODS .............................................................163 Vector Quantization...........................................................................164 4.4.1. The EM Algorithm .............................................................................170 4.4.2. 4.4.3. Multivariate Gaussian Mixture Density Estimation...........................172
TABLE OF CONTENTS iii 4.5. CLASSIFICATION AND REGRESSION TREES..........................................................176 Choice of Question Set.......................................................................177 4.5.1. Splitting Criteria ................................................................................179 4.5.2. 4.5.3. Growing the Tree ...............................................................................181 4.5.4. Missing Values and Conflict Resolution ............................................182 Complex Questions ............................................................................183 4.5.5. The Right-Sized Tree..........................................................................185 4.5.6. 4.6. HISTORICAL PERSPECTIVE AND FURTHER READING ...........................................190 PART II SPEECH PROCESSING 5.3.1. 5.3.2. 5.3.3. 5.3.4. 5.3.5. 5. DIGITAL SIGNAL PROCESSING..............................................................201 5.1. DIGITAL SIGNALS AND SYSTEMS ........................................................................202 Sinusoidal Signals..............................................................................203 5.1.1. 5.1.2. Other Digital Signals .........................................................................206 5.1.3. Digital Systems ..................................................................................206 5.2. CONTINUOUS-FREQUENCY TRANSFORMS...........................................................209 The Fourier Transform ......................................................................209 Z-Transform .......................................................................................211 Z-Transforms of Elementary Functions .............................................212 Properties of the Z and Fourier Transform .......................................215 5.3. DISCRETE-FREQUENCY TRANSFORMS................................................................216 The Discrete Fourier Transform (DFT).............................................218 Fourier Transforms of Periodic Signals ............................................219 The Fast Fourier Transform (FFT)....................................................222 Circular Convolution.........................................................................227 The Discrete Cosine Transform (DCT)..............................................228 5.4. DIGITAL FILTERS AND WINDOWS........................................................................229 The Ideal Low-Pass Filter .................................................................229 5.4.1. 5.4.2. Window Functions .............................................................................230 FIR Filters..........................................................................................232 5.4.3. IIR Filters...........................................................................................238 5.4.4. 5.5. DIGITAL PROCESSING OF ANALOG SIGNALS........................................................242 Fourier Transform of Analog Signals................................................242 5.5.1. The Sampling Theorem ......................................................................243 5.5.2. Analog-to-Digital Conversion ...........................................................245 5.5.3. 5.5.4. Digital-to-Analog Conversion ...........................................................246 5.6. MULTIRATE SIGNAL PROCESSING.......................................................................247 5.6.1. Decimation.........................................................................................248 Interpolation ......................................................................................249 5.6.2. Resampling ........................................................................................250 5.6.3. 5.7. FILTERBANKS .....................................................................................................250 Two-Band Conjugate Quadrature Filters..........................................250 5.2.1. 5.2.2. 5.2.3. 5.2.4. 5.7.1.
iv TABLE OF CONTENTS 5.7.2. Multiresolution Filterbanks ...............................................................253 The FFT as a Filterbank....................................................................255 5.7.3. 5.7.4. Modulated Lapped Transforms..........................................................257 5.8. STOCHASTIC PROCESSES ....................................................................................259 Statistics of Stochastic Processes.......................................................260 Stationary Processes..........................................................................263 LTI Systems with Stochastic Inputs....................................................266 Power Spectral Density......................................................................267 Noise ..................................................................................................269 5.9. HISTORICAL PERSPECTIVE AND FURTHER READING...........................................269 5.8.1. 5.8.2. 5.8.3. 5.8.4. 5.8.5. 6.3.1. 6.3.2. 6.3.3. 6.3.4. 6.3.5. 6. SPEECH SIGNAL REPRESENTATIONS ...............................................273 6.1. SHORT-TIME FOURIER ANALYSIS.......................................................................274 Spectrograms .....................................................................................279 Pitch-Synchronous Analysis...............................................................281 6.2. ACOUSTICAL MODEL OF SPEECH PRODUCTION ..................................................281 6.2.1. Glottal Excitation...............................................................................282 Lossless Tube Concatenation.............................................................282 6.2.2. Source-Filter Models of Speech Production......................................286 6.2.3. 6.3. LINEAR PREDICTIVE CODING..............................................................................288 The Orthogonality Principle ..............................................................289 Solution of the LPC Equations...........................................................291 Spectral Analysis via LPC .................................................................298 The Prediction Error..........................................................................299 Equivalent Representations ...............................................................301 6.4. CEPSTRAL PROCESSING......................................................................................304 The Real and Complex Cepstrum.......................................................305 Cepstrum of Pole-Zero Filters ...........................................................306 Cepstrum of Periodic Signals ............................................................309 Cepstrum of Speech Signals...............................................................310 Source-Filter Separation via the Cepstrum .......................................311 6.5. PERCEPTUALLY-MOTIVATED REPRESENTATIONS ...............................................313 The Bilinear Transform......................................................................313 6.5.1. 6.5.2. Mel-Frequency Cepstrum ..................................................................314 Perceptual Linear Prediction (PLP)..................................................316 6.5.3. 6.6. FORMANT FREQUENCIES ....................................................................................316 Statistical Formant Tracking .............................................................318 6.7. THE ROLE OF PITCH ...........................................................................................321 Autocorrelation Method.....................................................................321 Normalized Cross-Correlation Method .............................................324 Signal Conditioning ...........................................................................327 Pitch Tracking....................................................................................327 6.8. HISTORICAL PERSPECTIVE AND FUTURE READING.............................................329 6.4.1. 6.4.2. 6.4.3. 6.4.4. 6.4.5. 6.1.1. 6.1.2. 6.6.1. 6.7.1. 6.7.2. 6.7.3. 6.7.4.
TABLE OF CONTENTS v 7. SPEECH CODING ..............................................................................................335 7.1. SPEECH CODERS ATTRIBUTES ............................................................................336 7.2. SCALAR WAVEFORM CODERS ............................................................................338 Linear Pulse Code Modulation (PCM)..............................................338 7.2.1. 7.2.2. µ-law and A-law PCM .......................................................................340 Adaptive PCM....................................................................................342 7.2.3. 7.2.4. Differential Quantization ...................................................................343 7.3. SCALAR FREQUENCY DOMAIN CODERS..............................................................346 Benefits of Masking............................................................................346 7.3.1. Transform Coders ..............................................................................348 7.3.2. Consumer Audio ................................................................................349 7.3.3. 7.3.4. Digital Audio Broadcasting (DAB)....................................................349 7.4. CODE EXCITED LINEAR PREDICTION (CELP) .....................................................350 LPC Vocoder......................................................................................350 Analysis by Synthesis .........................................................................351 Pitch Prediction: Adaptive Codebook ...............................................354 Perceptual Weighting and Postfiltering.............................................355 Parameter Quantization.....................................................................356 CELP Standards ................................................................................357 7.5. LOW-BIT RATE SPEECH CODERS........................................................................359 7.5.1. Mixed-Excitation LPC Vocoder.........................................................360 7.5.2. Harmonic Coding ..............................................................................360 7.5.3. Waveform Interpolation .....................................................................365 7.6. HISTORICAL PERSPECTIVE AND FURTHER READING ...........................................369 7.4.1. 7.4.2. 7.4.3. 7.4.4. 7.4.5. 7.4.6. PART III: SPEECH RECOGNITION 8. HIDDEN MARKOV MODELS......................................................................375 8.1. THE MARKOV CHAIN .........................................................................................376 8.2. DEFINITION OF THE HIDDEN MARKOV MODEL ...................................................378 8.2.1. Dynamic Programming and DTW .....................................................381 8.2.2. How to Evaluate an HMM – The Forward Algorithm.......................383 8.2.3. How to Decode an HMM - The Viterbi Algorithm.............................385 8.2.4. How to Estimate HMM Parameters – Baum-Welch Algorithm .........387 8.3. CONTINUOUS AND SEMI-CONTINUOUS HMMS ...................................................392 Continuous Mixture Density HMMs ..................................................392 Semi-continuous HMMs.....................................................................394 8.4. PRACTICAL ISSUES IN USING HMMS..................................................................396 Initial Estimates .................................................................................396 8.4.1. 8.4.2. Model Topology .................................................................................397 Training Criteria................................................................................399 8.4.3. 8.4.4. Deleted Interpolation.........................................................................399 8.3.1. 8.3.2.
vi TABLE OF CONTENTS 8.4.5. 8.4.6. Parameter Smoothing ........................................................................401 Probability Representations...............................................................402 8.5. HMM LIMITATIONS ...........................................................................................403 8.5.1. Duration Modeling ............................................................................404 First-Order Assumption .....................................................................406 8.5.2. Conditional Independence Assumption..............................................407 8.5.3. 8.6. HISTORICAL PERSPECTIVE AND FURTHER READING ...........................................407 9.1.1. 9.1.2. 9.1.3. 9.1.4. 9.4.1. 9.4.2. 9.4.3. 9.4.4. 9. ACOUSTIC MODELING.................................................................................413 9.1. VARIABILITY IN THE SPEECH SIGNAL..................................................................414 Context Variability.............................................................................415 Style Variability .................................................................................416 Speaker Variability ............................................................................416 Environment Variability.....................................................................417 9.2. HOW TO MEASURE SPEECH RECOGNITION ERRORS............................................417 9.3. SIGNAL PROCESSING—EXTRACTING FEATURES.................................................419 Signal Acquisition..............................................................................420 9.3.1. End-Point Detection ..........................................................................421 9.3.2. 9.3.3. MFCC and Its Dynamic Features......................................................423 Feature Transformation.....................................................................424 9.3.4. 9.4. PHONETIC MODELING—SELECTING APPROPRIATE UNITS ..................................426 Comparison of Different Units...........................................................427 Context Dependency ..........................................................................428 Clustered Acoustic-Phonetic Units ....................................................430 Lexical Baseforms..............................................................................434 9.5. ACOUSTIC MODELING—SCORING ACOUSTIC FEATURES....................................437 Choice of HMM Output Distributions................................................437 Isolated vs. Continuous Speech Training...........................................439 9.6. ADAPTIVE TECHNIQUES—MINIMIZING MISMATCHES ........................................442 9.6.1. Maximum a Posteriori (MAP)............................................................443 9.6.2. Maximum Likelihood Linear Regression (MLLR)..............................446 9.6.3. MLLR and MAP Comparison ............................................................448 Clustered Models ...............................................................................450 9.6.4. 9.7. CONFIDENCE MEASURES: MEASURING THE RELIABILITY...................................451 Filler Models......................................................................................451 Transformation Models......................................................................452 Combination Models..........................................................................454 9.8. OTHER TECHNIQUES ..........................................................................................455 Neural Networks ................................................................................455 Segment Models .................................................................................457 9.9. CASE STUDY: WHISPER......................................................................................462 9.10. HISTORICAL PERSPECTIVE AND FURTHER READING..........................................463 9.5.1. 9.5.2. 9.7.1. 9.7.2. 9.7.3. 9.8.1. 9.8.2.
分享到:
收藏