Spoken Language Processing A Guide to Theory Algorithm+and Syste....pdf

发布时间：2022-05-30 发布人：admin 分类：说明书资料大小：10.82M 资料格式：pdf 举报版权申诉

wyhan55-5065435-4744300845218185168.pdf-第1页.png

第1页 / 共965页

wyhan55-5065435-4744300845218185168.pdf-第2页.png

第2页 / 共965页

wyhan55-5065435-4744300845218185168.pdf-第3页.png

第3页 / 共965页

wyhan55-5065435-4744300845218185168.pdf-第4页.png

第4页 / 共965页

wyhan55-5065435-4744300845218185168.pdf-第5页.png

第5页 / 共965页

wyhan55-5065435-4744300845218185168.pdf-第6页.png

第6页 / 共965页

wyhan55-5065435-4744300845218185168.pdf-第7页.png

第7页 / 共965页

wyhan55-5065435-4744300845218185168.pdf-第8页.png

第8页 / 共965页

Cover

Table of Contents

Foreword

Preface

Introduction

1.1. MOTIVATIONS

1.1.1. Spoken Language Interface

1.1.2. Speech- to- speech Translation

1.1.3. Knowledge Partners

1.2. SPOKEN LANGUAGE SYSTEM ARCHITECTURE

1.2.1. Automatic Speech Recognition

1.2.2. Text- to- Speech Conversion

1.2.3. Spoken Language Understanding

1.3. BOOK ORGANIZATION

1.3.1. Part I: Fundamental Theory

1.3.2. Part II: Speech Processing

1.3.3. Part III: Speech Recognition

1.3.4. Part IV: Text- to- Speech Systems

1.3.5. Part V: Spoken Language Systems

1.4. TARGET AUDIENCES

1.5. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

PART I: FUNDAMENTAL THEORY

Spoken Language Structure

2.1. SOUND HUMAN SPEECH SYSTEMS

2.1.1. Sound

2.1.2. Speech Production

Speech Perception 2.1.3.

2.2. PHONETICS PHONOLOGY

2.2.1. Phonemes

2.2.2. The Allophone: Sound and Context

2.2.3. Speech Rate and Coarticulation

2.3. SYLLABLES WORDS

2.3.1. Syllables

2.3.2. Words

2.4. SYNTAX SEMANTICS

2.4.1. Syntactic Constituents

2.4.2. Semantic Roles

2.4.3. Lexical Semantics

2.4.4. Logical Form

2.5. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

Probability, Statistics, and Information Theory

3.1. PROBABILITY THEORY

3.1.1. Conditional Probability And Bayes' Rule

3.1.2. Random Variables

3.1.3. Mean and Variance

3.1.4. Covariance and Correlation

3.1.5. Random Vectors and Multivariate Distributions

3.1.6. Some Useful Distributions

3.1.6.3. Geometric Distributions

3.1.7. Gaussian Distributions

3.2. ESTIMATION THEORY

3.2.1. Minimum/ Least Mean Squared Error Estimation

3.2.2. Maximum Likelihood Estimation

3.2.3. Bayesian Estimation and MAP Estimation

3.3. SIGNIFICANCE

3.3.1. Level of Significance

3.3.2. Normal Test (Z- Test)

3.3.3. Goodness- of- Fit Test

3.3.4. Matched- Pairs Test

3.4. INFORMATION THEORY

3.4.1. Entropy

3.4.2. Conditional Entropy

3.4.3. The Source Coding Theorem

3.4.4. Mutual Information and Channel Coding

3.5. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

Pattern Recognition

4.1. BAYES DECISION THEORY

4.1.1. Minimum- Error- Rate Decision Rules

4.1.2. Discriminant Functions

4.2. HOW CONSTRUCT CLASSIFIERS

4.2.1. Gaussian Classifiers

4.2.2. The Curse of Dimensionality

4.2.3. Estimating the Error Rate

4.2.4. Comparing Classifiers

4.3. DISCRIMINATIVE TRAINING

4.3.1. Maximum Mutual Information Estimation

4.3.2. Minimum- Error- Rate Estimation

4.3.3. Neural Networks

4.4. UNSUPERVISED ESTIMATION METHODS

4.4.1. Vector Quantization

4.4.2. The EM Algorithm

4.4.3. Multivariate Gaussian Mixture Density Estimation

4.5. CLASSIFICATION REGRESSION TREES

4.5.1. Choice of Question Set

4.5.2. Splitting Criteria

4.5.3. Growing the Tree

4.5.4. Missing Values and Conflict Resolution

4.5.5. Complex Questions

4.5.6. The Right- Sized Tree

4.6. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

PART II SPEECH PROCESSING

Digital Signal Processing

5.1. DIGITAL SIGNALS SYSTEMS

5.1.1. Sinusoidal Signals

5.1.2. Other Digital Signals

5.1.3. Digital Systems

5.2. CONTINUOUS- FREQUENCY TRANSFORMS

5.2.1. The Fourier Transform

5.2.2. Z- Transform

5.2.3. Z- Transforms of Elementary Functions

5.2.4. Properties of the Z and Fourier Transform

5.3. DISCRETE- FREQUENCY TRANSFORMS

5.3.1. The Discrete Fourier Transform (DFT)

5.3.2. Fourier Transforms of Periodic Signals

5.3.3. The Fast Fourier Transform (FFT)

5.3.4. Circular Convolution

5.3.5. The Discrete Cosine Transform (DCT)

5.4. DIGITAL FILTERS WINDOWS

5.4.1. The Ideal Low- Pass Filter

5.4.2. Window Functions

5.4.3. FIR Filters

5.4.4. IIR Filters

5.5. DIGITAL PROCESSING ANALOG SIGNALS

5.5.1. Fourier Transform of Analog Signals

5.5.2. The Sampling Theorem

5.5.3. Analog- to- Digital Conversion

5.5.4. Digital- to- Analog Conversion

5.6. MULTIRATE SIGNAL PROCESSING

5.6.1. Decimation

5.6.2. Interpolation

5.6.3. Resampling

5.7. FILTERBANKS

5.7.1. Two- Band Conjugate Quadrature Filters

5.7.2. Multiresolution Filterbanks

5.7.3. The FFT as a Filterbank

5.7.4. Modulated Lapped Transforms

5.8. STOCHASTIC PROCESSES

5.8.1. Statistics of Stochastic Processes

5.8.2. Stationary Processes

5.8.3. LTI Systems with Stochastic Inputs

5.8.4. Power Spectral Density

5.8.5. Noise

5.9. HISTORICAL PERSPECTIVE AND FURTHER READING

REFERENCES

Speech Signal Representations

6.1. SHORT- TIME FOURIER ANALYSIS

6.1.1. Spectrograms

6.1.2. Pitch- Synchronous Analysis

6.2. ACOUSTICAL MODEL SPEECH PRODUCTION

6.2.1. Glottal Excitation

6.2.2. Lossless Tube Concatenation

6.2.3. Source- Filter Models of Speech Production

6.3. LINEAR PREDICTIVE CODING

6.3.1. The Orthogonality Principle

6.3.2. Solution of the LPC Equations

6.3.3. Spectral Analysis via LPC

6.3.4. The Prediction Error

6.3.5. Equivalent Representations

6.4. CEPSTRAL PROCESSING

6.4.1. The Real and Complex Cepstrum

6.4.2. Cepstrum of Pole- Zero Filters

6.4.3. Cepstrum of Periodic Signals

6.4.4. Cepstrum of Speech Signals

6.4.5. Source- Filter Separation via the Cepstrum

6.5. PERCEPTUALLY- MOTIVATED REPRESENTATIONS

6.5.1. The Bilinear Transform

6.5.2. Mel- Frequency Cepstrum

6.5.3. Perceptual Linear Prediction (PLP)

6.6. FORMANT FREQUENCIES

6.6.1. Statistical Formant Tracking

6.7. THE ROLE PITCH

6.7.1. Autocorrelation Method

6.7.2. Normalized Cross- Correlation Method

6.7.3. Signal Conditioning

6.7.4. Pitch Tracking

6.8. HISTORICAL PERSPECTIVE AND FUTURE READING

REFERENCES

Speech Coding

7.1. SPEECH CODERS ATTRIBUTES

7.2. SCALAR WAVEFORM CODERS

7.2.1. Linear Pulse Code Modulation (PCM)

7.2.2. µ µ µ µ -law and A- law PCM

7.2.3. Adaptive PCM

7.2.4. Differential Quantization

7.3. SCALAR FREQUENCY DOMAIN CODERS

7.3.1. Benefits of Masking

7.3.2. Transform Coders

7.3.3. Consumer Audio

7.3.4. Digital Audio Broadcasting (DAB)

7.4. CODE EXCITED LINEAR PREDICTION (CELP)

7.4.1. LPC Vocoder

7.4.2. Analysis by Synthesis

7.4.3. Pitch Prediction: Adaptive Codebook

7.4.4. Perceptual Weighting and Postfiltering

7.4.5. Parameter Quantization

7.4.6. CELP Standards

7.5. LOW- BIT RATE SPEECH CODERS

7.5.1. Mixed- Excitation LPC Vocoder

7.5.2. Harmonic Coding

7.5.3. Waveform Interpolation

7.6. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

PART III: SPEECH RECOGNITION

Hidden Markov Models

8.1. THE MARKOV CHAIN

8.2. DEFINITION HIDDEN MARKOV MODEL

8.2.1. Dynamic Programming and DTW

8.2.2. How to Evaluate an HMM – The Forward Algorithm

8.2.3. How to Decode an HMM - The Viterbi Algorithm

8.2.4. How to Estimate HMM Parameters – Baum- Welch Algorithm

8.3. CONTINUOUS SEMI- HMMS

8.3.1. Continuous Mixture Density HMMs

8.3.2. Semi- continuous HMMs

8.4. PRACTICAL ISSUES USING HMMS

8.4.1. Initial Estimates

8.4.2. Model Topology

8.4.3. Training Criteria

8.4.4. Deleted Interpolation

8.4.5. Parameter Smoothing

8.4.6. Probability Representations

8.5. HMM LIMITATIONS

8.5.1. Duration Modeling

8.5.2. First- Order Assumption

8.5.3. Conditional Independence Assumption

8.6. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

Acoustic Modeling

9.1. VARIABILITY SPEECH SIGNAL

9.1.1. Context Variability

9.1.2. Style Variability

9.1.3. Speaker Variability

9.1.4. Environment Variability

9.2. HOW MEASURE SPEECH RECOGNITION ERRORS

9.3. SIGNAL PROCESSING— EXTRACTING FEATURES

9.3.1. Signal Acquisition

9.3.2. End- Point Detection

9.3.3. MFCC and Its Dynamic Features

9.3.4. Feature Transformation

9.4. PHONETIC MODELING— SELECTING APPROPRIATE

UNITS

9.4.1. Comparison of Different Units

9.4.2. Context Dependency

9.4.3. Clustered Acoustic- Phonetic Units

9.4.4. Lexical Baseforms

9.5. ACOUSTIC MODELING— SCORING ACOUSTIC FEATURES

9.5.1. Choice of HMM Output Distributions

9.5.2. Isolated vs. Continuous Speech Training

9.6. ADAPTIVE TECHNIQUES— MINIMIZING MISMATCHES

9.6.1. Maximum a Posteriori (MAP)

9.6.2. Maximum Likelihood Linear Regression (MLLR)

9.6.3. MLLR and MAP Comparison

9.6.4. Clustered Models

9.7. CONFIDENCE MEASURES: MEASURING RELIABILITY

9.7.1. Filler Models

9.7.2. Transformation Models

9.7.3. Combination Models

9.8. OTHER TECHNIQUES

9.8.1. Neural Networks

9.8.2. Segment Models

9.9. CASE STUDY: WHISPER

9.10. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

Environmental Robustness

10.1. THE ACOUSTICAL ENVIRONMENT

10.1.1. Additive Noise

10.1.2. Reverberation

10.1.3. A Model of the Environment

10.2. ACOUSTICAL TRANSDUCERS

10.2.2. Directionality Patterns

10.2.3. Other Transduction Categories

10.3. ADAPTIVE ECHO CANCELLATION (AEC)

10.3.1. The LMS Algorithm

10.3.2. Convergence Properties of the LMS Algorithm

10.3.3. Normalized LMS Algorithm

10.3.4. Transform- Domain LMS Algorithm

10.3.5. The RLS Algorithm

10.4. MULTIMICROPHONE SPEECH ENHANCEMENT

10.4.1. Microphone Arrays

10.4.2. Blind Source Separation

10.5. ENVIRONMENT COMPENSATION PREPROCESSING

10.5.1. Spectral Subtraction

10.5.2. Frequency- Domain MMSE from Stereo Data

10.5.3. Wiener Filtering

10.5.4. Cepstral Mean Normalization (CMN)

10.5.5. Real- Time Cepstral Normalization

10.5.6. The Use of Gaussian Mixture Models

10.6. ENVIRONMENTAL MODEL ADAPTATION

10.6.1. Retraining on Corrupted Speech

10.6.2. Model Adaptation

10.6.3. Parallel Model Combination

10.6.4. Vector Taylor Series

10.6.5. Retraining on Compensated Features

10.7. MODELING NONSTATIONARY NOISE

10.8. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

Language Modeling

11.1. FORMAL LANGUAGE THEORY

11.1.1. Chomsky Hierarchy

11.1.2. Chart Parsing for Context- Free Grammars

11.2. STOCHASTIC LANGUAGE MODELS

11.2.1. Probabilistic Context- Free Grammars

11.2.2. N- gram Language Models

11.3. COMPLEXITY MEASURE LANGUAGE MODELS

11.4. N- GRAM SMOOTHING

11.4.1. Deleted Interpolation Smoothing

11.4.2. Backoff Smoothing

11.4.3. Class n- grams

11.4.4. Performance of n- gram Smoothing

11.5. ADAPTIVE LANGUAGE MODELS

11.5.1. Cache Language Models

11.5.2. Topic- Adaptive Models

11.5.3. Maximum Entropy Models

11.6. PRACTICAL ISSUES

11.6.1. Vocabulary Selection

11.6.2. N- gram Pruning

11.6.3. CFG vs n- gram Models

11.7. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

Basic Search Algorithms

12.1. BASIC SEARCH ALGORITHMS

12.1.1. General Graph Searching Procedures

12.1.2. Blind Graph Search Algorithms

12.1.3. Heuristic Graph Search

12.2. SEARCH ALGORITHMS FOR SPEECH RECOGNITION

12.2.1. Decoder Basics

12.2.2. Combining Acoustic And Language Models

12.2.3. Isolated Word Recognition

12.2.4. Continuous Speech Recognition

12.3. LANGUAGE MODEL STATES

12.3.1. Search Space with FSM and CFG

12.3.2. Search Space with the Unigram

12.3.3. Search Space with Bigrams

12.3.4. Search Space with Trigrams

12.3.5. How to Handle Silences Between Words

12.4. TIME- SYNCHRONOUS VITERBI BEAM SEARCH

12.4.1. The Use of Beam

12.4.2. Viterbi Beam Search

12.5. STACK (A SEARCH)

12.5.1. Admissible Heuristics for Remaining Path

12.5.2. When to Extend New Words

12.5.3. Fast Match

12.5.4. Stack Pruning

12.5.5. Multistack Search

12.6. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

Large Vocabulary Search Algorithms

13.1. EFFICIENT MANIPULATION TREE LEXICON

13.1.1. Lexical Tree

13.1.2. Multiple Copies of Pronunciation Trees

13.1.3. Factored Language Probabilities

13.1.4. Optimization of Lexical Trees

13.1.5. Exploiting Subtree Polymorphism

13.1.6. Context- Dependent Units and Inter- Word Triphones

13.2. OTHER EFFICIENT SEARCH TECHNIQUES

13.2.1. Using Entire HMM as a State in Search

13.2.2. Different Layers of Beams

13.2.3. Fast Match

13.3. N- MULTIPASS SEARCH STRATEGIES

13.3.1. N- Best Lists and Word Lattices

13.3.2. The Exact N- best Algorithm

13.3.3. Word- Dependent N- Best and Word- Lattice Algorithm

13.3.4. The Forward- Backward Search Algorithm

13.3.5. One- Pass vs. Multipass Search

13.4. SEARCH- ALGORITHM EVALUATION

13.5. CASE STUDY— MICROSOFT WHISPER

13.5.1. The CFG Search Architecture

13.5.2. The N- Gram Search Architecture

13.6. HISTORICAL PERSPECTIVES FURTHER READING

REFERENCES

PART IV: TEXT-TO-SPEECH SYSTEMS

Text and Phonetic Analysis

14.1. MODULES DATA FLOW

14.1.1. Modules

14.1.2. Data Flows

14.1.3. Localization Issues

14.2. LEXICON

14.3. DOCUMENT STRUCTURE DETECTION

14.3.1. Chapter and Section Headers

14.3.2. Lists

14.3.3. Paragraphs

14.3.4. Sentences

14.3.5. E- mail

14.3.6. Web Pages

14.3.7. Dialog Turns and Speech Acts

14.4. TEXT NORMALIZATION

14.4.1. Abbreviations and Acronyms

14.4.2. Number Formats

14.4.3. Domain- Specific Tags

14.4.4. Miscellaneous Formats

14.5. LINGUISTIC ANALYSIS

14.6. HOMOGRAPH DISAMBIGUATION

14.7. MORPHOLOGICAL

14.8. LETTER- SOUND CONVERSION

14.9. EVALUATION

14.10. CASE STUDY: FESTIVAL

14.10.1. Lexicon

14.10.2. Text Analysis

14.10.3. Phonetic Analysis

14.11. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

Prosody

15.1. THE ROLE UNDERSTANDING

15.2. PROSODY GENERATION SCHEMATIC

15.3. SPEAKING STYLE

15.3.1. Character

15.3.2. Emotion

15.4. SYMBOLIC PROSODY

15.4.1. Pauses

15.4.2. Prosodic Phrases

15.4.3. Accent

15.4.4. Tone

15.4.5. Tune

15.4.6. Prosodic Transcription Systems

15.5. DURATION ASSIGNMENT

15.5.1. Rule- Based Methods

15.5.2. CART- Based Durations

15.6. PITCH GENERATION

15.6.1. Attributes of Pitch Contours

15.6.2. Baseline F0 Contour Generation

15.6.3. Parametric F0 Generation

15.6.4. Corpus- Based F0 Generation

15.7. PROSODY MARKUP LANGUAGES

15.8. PROSODY EVALUATION

15.9. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

Speech Synthesis

16.1. ATTRIBUTES SPEECH SYNTHESIS

16.2. FORMANT SPEECH SYNTHESIS

16.2.1. Waveform Generation from Formant Values

16.2.2. Formant Generation by Rule

16.2.3. Data- Driven Formant Generation

16.2.4. Articulatory Synthesis

16.3. CONCATENATIVE SPEECH SYNTHESIS

16.3.1. Choice of Unit

16.3.2. Optimal Unit String: The Decoding Process

16.3.3. Unit Inventory Design

16.4. PROSODIC MODIFICATION SPEECH

16.4.1. Synchronous Overlap and Add (SOLA)

16.4.2. Pitch Synchronous Overlap and Add (PSOLA)

16.4.3. Spectral Behavior of PSOLA

16.4.4. Synthesis Epoch Calculation

16.4.5. Pitch- Scale Modification Epoch Calculation

16.4.6. Time- Scale Modification Epoch Calculation

16.4.7. Pitch- Scale Time- Scale Epoch Calculation

16.4.8. Waveform Mapping

16.4.9. Epoch Detection

16.4.10. Problems with PSOLA

16.5. SOURCE- FILTER MODELS PROSODY MODIFICATION

16.5.1. Prosody Modification of the LPC Residual

16.5.2. Mixed Excitation Models

16.5.3. Voice Effects

16.6. EVALUATION TTS SYSTEMS

16.6.1. Intelligibility Tests

16.6.2. Overall Quality Tests

16.6.3. Preference Tests

16.6.4. Functional Tests

16.6.5. Automated Tests

16.7. HISTORICAL PERSPECTIVE AND FUTURE READING

REFERENCES

PART V: SPOKEN LANGUAGE SYSTEMS

Spoken Language Understanding

17.1. WRITTEN SPOKEN LANGUAGES

17.1.1. Style

17.1.2. Disfluency

17.1.3. Communicative Prosody

17.2. DIALOG STRUCTURE

17.2.1. Units of Dialog

17.2.2. Dialog (Speech) Acts

17.2.3. Dialog Control

17.3. SEMANTIC REPRESENTATION

17.3.1. Semantic Frames

17.3.2. Conceptual Graphs

17.4. SENTENCE INTERPRETATION

17.4.1. Robust Parsing

17.4.2. Statistical Pattern Matching

17.5. DISCOURSE ANALYSIS

17.5.1. Resolution of Relative Expression

17.5.2. Automatic Inference and Inconsistency Detection

17.6. DIALOG MANAGEMENT

17.6.1. Dialog Grammars

17.6.2. Plan- Based Systems

17.6.3. Dialog Behavior

17.7. RESPONSE GENERATION AND RENDITION

17.7.1. Response Content Generation

17.7.2. Concept- to- Speech Rendition

17.7.3. Other Renditions

17.8. EVALUATION

17.8.1. Evaluation in the ATIS Task

17.8.2. PARADISE Framework

17.9. CASE STUDY— DR. WHO

17.9.1. Semantic Representation

17.9.2. Semantic Parser (Sentence Interpretation)

17.9.3. Discourse Analysis

17.9.4. Dialog Manager

17.10. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

Applications and User Interfaces

18.1. APPLICATION ARCHITECTURE

18.2. TYPICAL APPLICATIONS

18.2.1. Computer Command and Control

18.2.2. Telephony Applications

18.2.3. Dictation

18.2.4. Accessibility

18.2.5. Handheld Devices

18.2.6. Automobile Applications

18.2.7. Speaker Recognition

18.3. SPEECH INTERFACE DESIGN

18.3.1. General Principles

18.3.2. Handling Errors

18.3.3. Other Considerations

18.3.4. Dialog Flow

18.4. INTERNATIONALIZATION

18.5. CASE STUDY— MIPAD

18.5.1. Specifying the Application

18.5.2. Rapid Prototyping

18.5.3. Evaluation

18.5.4. Iterations

18.6. HISTORICAL PERSPECTIVE FURTHER READING

REFERENCES

Index

TABLE OF CONTENTS 1.2.1. 1.2.2. 1.2.3. 1.1.1. 1.1.2. 1.1.3. 1. INTRODUCTION................................................................................................... 1 1.1. MOTIVATIONS .......................................................................................................2 Spoken Language Interface ...................................................................2 Speech-to-speech Translation................................................................3 Knowledge Partners...............................................................................3 1.2. SPOKEN LANGUAGE SYSTEM ARCHITECTURE ........................................................4 Automatic Speech Recognition ..............................................................4 Text-to-Speech Conversion ....................................................................6 Spoken Language Understanding..........................................................7 1.3. BOOK ORGANIZATION ............................................................................................9 Part I: Fundamental Theory ..................................................................9 Part II: Speech Processing ....................................................................9 Part III: Speech Recognition ...............................................................10 Part IV: Text-to-Speech Systems..........................................................10 Part V: Spoken Language Systems ......................................................10 1.4. TARGET AUDIENCES.............................................................................................11 1.5. HISTORICAL PERSPECTIVE AND FURTHER READING .............................................11 1.3.1. 1.3.2. 1.3.3. 1.3.4. 1.3.5. PART I: FUNDAMENTAL THEORY 2.2.1. 2.2.2. 2.2.3. 2.1.1. 2.1.2. 2.1.3. 2. SPOKEN LANGUAGE STRUCTURE .........................................................19 2.1. SOUND AND HUMAN SPEECH SYSTEMS................................................................21 Sound ...................................................................................................21 Speech Production ...............................................................................24 Speech Perception................................................................................28 2.2. PHONETICS AND PHONOLOGY...............................................................................36 Phonemes.............................................................................................36 The Allophone: Sound and Context .....................................................47 Speech Rate and Coarticulation ..........................................................49 2.3. SYLLABLES AND WORDS ......................................................................................50 Syllables...............................................................................................51 2.3.1. 2.3.2. Words...................................................................................................52 2.4. SYNTAX AND SEMANTICS.....................................................................................57 Syntactic Constituents..........................................................................58 Semantic Roles.....................................................................................63 Lexical Semantics ................................................................................64 Logical Form .......................................................................................66 2.5. HISTORICAL PERSPECTIVE AND FURTHER READING .............................................68 2.4.1. 2.4.2. 2.4.3. 2.4.4. i

ii TABLE OF CONTENTS 3. PROBABILITY, STATISTICS AND INFORMATION THEORY ..73 3.1. PROBABILITY THEORY .........................................................................................74 Conditional Probability And Bayes' Rule ............................................75 3.1.1. Random Variables................................................................................77 3.1.2. 3.1.3. Mean and Variance..............................................................................79 Covariance and Correlation ................................................................83 3.1.4. Random Vectors and Multivariate Distributions .................................84 3.1.5. Some Useful Distributions ...................................................................85 3.1.6. 3.1.7. Gaussian Distributions ........................................................................92 3.2. ESTIMATION THEORY ...........................................................................................98 3.2.1. Minimum/Least Mean Squared Error Estimation................................99 3.2.2. Maximum Likelihood Estimation .......................................................104 Bayesian Estimation and MAP Estimation ........................................108 3.2.3. 3.3. SIGNIFICANCE TESTING.......................................................................................114 Level of Significance..........................................................................114 Normal Test (Z-Test)..........................................................................116 2χ Goodness-of-Fit Test ...................................................................117 3.3.3. 3.3.4. Matched-Pairs Test............................................................................119 INFORMATION THEORY ......................................................................................121 Entropy ..............................................................................................121 3.4.1. Conditional Entropy ..........................................................................124 3.4.2. The Source Coding Theorem .............................................................125 3.4.3. 3.4.4. Mutual Information and Channel Coding..........................................127 3.5. HISTORICAL PERSPECTIVE AND FURTHER READING ...........................................129 3.3.1. 3.3.2. 3.4. 4. PATTERN RECOGNITION ...........................................................................133 4.1. BAYES DECISION THEORY..................................................................................134 4.1.1. Minimum-Error-Rate Decision Rules ................................................135 4.1.2. Discriminant Functions .....................................................................138 4.2. HOW TO CONSTRUCT CLASSIFIERS .....................................................................140 4.2.1. Gaussian Classifiers ..........................................................................142 The Curse of Dimensionality..............................................................144 4.2.2. Estimating the Error Rate..................................................................146 4.2.3. Comparing Classifiers .......................................................................148 4.2.4. 4.3. DISCRIMINATIVE TRAINING ................................................................................150 4.3.1. Maximum Mutual Information Estimation.........................................150 4.3.2. Minimum-Error-Rate Estimation.......................................................156 Neural Networks ................................................................................158 4.3.3. 4.4. UNSUPERVISED ESTIMATION METHODS .............................................................163 Vector Quantization...........................................................................164 4.4.1. The EM Algorithm .............................................................................170 4.4.2. 4.4.3. Multivariate Gaussian Mixture Density Estimation...........................172

TABLE OF CONTENTS iii 4.5. CLASSIFICATION AND REGRESSION TREES..........................................................176 Choice of Question Set.......................................................................177 4.5.1. Splitting Criteria ................................................................................179 4.5.2. 4.5.3. Growing the Tree ...............................................................................181 4.5.4. Missing Values and Conflict Resolution ............................................182 Complex Questions ............................................................................183 4.5.5. The Right-Sized Tree..........................................................................185 4.5.6. 4.6. HISTORICAL PERSPECTIVE AND FURTHER READING ...........................................190 PART II SPEECH PROCESSING 5.3.1. 5.3.2. 5.3.3. 5.3.4. 5.3.5. 5. DIGITAL SIGNAL PROCESSING..............................................................201 5.1. DIGITAL SIGNALS AND SYSTEMS ........................................................................202 Sinusoidal Signals..............................................................................203 5.1.1. 5.1.2. Other Digital Signals .........................................................................206 5.1.3. Digital Systems ..................................................................................206 5.2. CONTINUOUS-FREQUENCY TRANSFORMS...........................................................209 The Fourier Transform ......................................................................209 Z-Transform .......................................................................................211 Z-Transforms of Elementary Functions .............................................212 Properties of the Z and Fourier Transform .......................................215 5.3. DISCRETE-FREQUENCY TRANSFORMS................................................................216 The Discrete Fourier Transform (DFT).............................................218 Fourier Transforms of Periodic Signals ............................................219 The Fast Fourier Transform (FFT)....................................................222 Circular Convolution.........................................................................227 The Discrete Cosine Transform (DCT)..............................................228 5.4. DIGITAL FILTERS AND WINDOWS........................................................................229 The Ideal Low-Pass Filter .................................................................229 5.4.1. 5.4.2. Window Functions .............................................................................230 FIR Filters..........................................................................................232 5.4.3. IIR Filters...........................................................................................238 5.4.4. 5.5. DIGITAL PROCESSING OF ANALOG SIGNALS........................................................242 Fourier Transform of Analog Signals................................................242 5.5.1. The Sampling Theorem ......................................................................243 5.5.2. Analog-to-Digital Conversion ...........................................................245 5.5.3. 5.5.4. Digital-to-Analog Conversion ...........................................................246 5.6. MULTIRATE SIGNAL PROCESSING.......................................................................247 5.6.1. Decimation.........................................................................................248 Interpolation ......................................................................................249 5.6.2. Resampling ........................................................................................250 5.6.3. 5.7. FILTERBANKS .....................................................................................................250 Two-Band Conjugate Quadrature Filters..........................................250 5.2.1. 5.2.2. 5.2.3. 5.2.4. 5.7.1.

iv TABLE OF CONTENTS 5.7.2. Multiresolution Filterbanks ...............................................................253 The FFT as a Filterbank....................................................................255 5.7.3. 5.7.4. Modulated Lapped Transforms..........................................................257 5.8. STOCHASTIC PROCESSES ....................................................................................259 Statistics of Stochastic Processes.......................................................260 Stationary Processes..........................................................................263 LTI Systems with Stochastic Inputs....................................................266 Power Spectral Density......................................................................267 Noise ..................................................................................................269 5.9. HISTORICAL PERSPECTIVE AND FURTHER READING...........................................269 5.8.1. 5.8.2. 5.8.3. 5.8.4. 5.8.5. 6.3.1. 6.3.2. 6.3.3. 6.3.4. 6.3.5. 6. SPEECH SIGNAL REPRESENTATIONS ...............................................273 6.1. SHORT-TIME FOURIER ANALYSIS.......................................................................274 Spectrograms .....................................................................................279 Pitch-Synchronous Analysis...............................................................281 6.2. ACOUSTICAL MODEL OF SPEECH PRODUCTION ..................................................281 6.2.1. Glottal Excitation...............................................................................282 Lossless Tube Concatenation.............................................................282 6.2.2. Source-Filter Models of Speech Production......................................286 6.2.3. 6.3. LINEAR PREDICTIVE CODING..............................................................................288 The Orthogonality Principle ..............................................................289 Solution of the LPC Equations...........................................................291 Spectral Analysis via LPC .................................................................298 The Prediction Error..........................................................................299 Equivalent Representations ...............................................................301 6.4. CEPSTRAL PROCESSING......................................................................................304 The Real and Complex Cepstrum.......................................................305 Cepstrum of Pole-Zero Filters ...........................................................306 Cepstrum of Periodic Signals ............................................................309 Cepstrum of Speech Signals...............................................................310 Source-Filter Separation via the Cepstrum .......................................311 6.5. PERCEPTUALLY-MOTIVATED REPRESENTATIONS ...............................................313 The Bilinear Transform......................................................................313 6.5.1. 6.5.2. Mel-Frequency Cepstrum ..................................................................314 Perceptual Linear Prediction (PLP)..................................................316 6.5.3. 6.6. FORMANT FREQUENCIES ....................................................................................316 Statistical Formant Tracking .............................................................318 6.7. THE ROLE OF PITCH ...........................................................................................321 Autocorrelation Method.....................................................................321 Normalized Cross-Correlation Method .............................................324 Signal Conditioning ...........................................................................327 Pitch Tracking....................................................................................327 6.8. HISTORICAL PERSPECTIVE AND FUTURE READING.............................................329 6.4.1. 6.4.2. 6.4.3. 6.4.4. 6.4.5. 6.1.1. 6.1.2. 6.6.1. 6.7.1. 6.7.2. 6.7.3. 6.7.4.

TABLE OF CONTENTS v 7. SPEECH CODING ..............................................................................................335 7.1. SPEECH CODERS ATTRIBUTES ............................................................................336 7.2. SCALAR WAVEFORM CODERS ............................................................................338 Linear Pulse Code Modulation (PCM)..............................................338 7.2.1. 7.2.2. µ-law and A-law PCM .......................................................................340 Adaptive PCM....................................................................................342 7.2.3. 7.2.4. Differential Quantization ...................................................................343 7.3. SCALAR FREQUENCY DOMAIN CODERS..............................................................346 Benefits of Masking............................................................................346 7.3.1. Transform Coders ..............................................................................348 7.3.2. Consumer Audio ................................................................................349 7.3.3. 7.3.4. Digital Audio Broadcasting (DAB)....................................................349 7.4. CODE EXCITED LINEAR PREDICTION (CELP) .....................................................350 LPC Vocoder......................................................................................350 Analysis by Synthesis .........................................................................351 Pitch Prediction: Adaptive Codebook ...............................................354 Perceptual Weighting and Postfiltering.............................................355 Parameter Quantization.....................................................................356 CELP Standards ................................................................................357 7.5. LOW-BIT RATE SPEECH CODERS........................................................................359 7.5.1. Mixed-Excitation LPC Vocoder.........................................................360 7.5.2. Harmonic Coding ..............................................................................360 7.5.3. Waveform Interpolation .....................................................................365 7.6. HISTORICAL PERSPECTIVE AND FURTHER READING ...........................................369 7.4.1. 7.4.2. 7.4.3. 7.4.4. 7.4.5. 7.4.6. PART III: SPEECH RECOGNITION 8. HIDDEN MARKOV MODELS......................................................................375 8.1. THE MARKOV CHAIN .........................................................................................376 8.2. DEFINITION OF THE HIDDEN MARKOV MODEL ...................................................378 8.2.1. Dynamic Programming and DTW .....................................................381 8.2.2. How to Evaluate an HMM – The Forward Algorithm.......................383 8.2.3. How to Decode an HMM - The Viterbi Algorithm.............................385 8.2.4. How to Estimate HMM Parameters – Baum-Welch Algorithm .........387 8.3. CONTINUOUS AND SEMI-CONTINUOUS HMMS ...................................................392 Continuous Mixture Density HMMs ..................................................392 Semi-continuous HMMs.....................................................................394 8.4. PRACTICAL ISSUES IN USING HMMS..................................................................396 Initial Estimates .................................................................................396 8.4.1. 8.4.2. Model Topology .................................................................................397 Training Criteria................................................................................399 8.4.3. 8.4.4. Deleted Interpolation.........................................................................399 8.3.1. 8.3.2.

vi TABLE OF CONTENTS 8.4.5. 8.4.6. Parameter Smoothing ........................................................................401 Probability Representations...............................................................402 8.5. HMM LIMITATIONS ...........................................................................................403 8.5.1. Duration Modeling ............................................................................404 First-Order Assumption .....................................................................406 8.5.2. Conditional Independence Assumption..............................................407 8.5.3. 8.6. HISTORICAL PERSPECTIVE AND FURTHER READING ...........................................407 9.1.1. 9.1.2. 9.1.3. 9.1.4. 9.4.1. 9.4.2. 9.4.3. 9.4.4. 9. ACOUSTIC MODELING.................................................................................413 9.1. VARIABILITY IN THE SPEECH SIGNAL..................................................................414 Context Variability.............................................................................415 Style Variability .................................................................................416 Speaker Variability ............................................................................416 Environment Variability.....................................................................417 9.2. HOW TO MEASURE SPEECH RECOGNITION ERRORS............................................417 9.3. SIGNAL PROCESSING—EXTRACTING FEATURES.................................................419 Signal Acquisition..............................................................................420 9.3.1. End-Point Detection ..........................................................................421 9.3.2. 9.3.3. MFCC and Its Dynamic Features......................................................423 Feature Transformation.....................................................................424 9.3.4. 9.4. PHONETIC MODELING—SELECTING APPROPRIATE UNITS ..................................426 Comparison of Different Units...........................................................427 Context Dependency ..........................................................................428 Clustered Acoustic-Phonetic Units ....................................................430 Lexical Baseforms..............................................................................434 9.5. ACOUSTIC MODELING—SCORING ACOUSTIC FEATURES....................................437 Choice of HMM Output Distributions................................................437 Isolated vs. Continuous Speech Training...........................................439 9.6. ADAPTIVE TECHNIQUES—MINIMIZING MISMATCHES ........................................442 9.6.1. Maximum a Posteriori (MAP)............................................................443 9.6.2. Maximum Likelihood Linear Regression (MLLR)..............................446 9.6.3. MLLR and MAP Comparison ............................................................448 Clustered Models ...............................................................................450 9.6.4. 9.7. CONFIDENCE MEASURES: MEASURING THE RELIABILITY...................................451 Filler Models......................................................................................451 Transformation Models......................................................................452 Combination Models..........................................................................454 9.8. OTHER TECHNIQUES ..........................................................................................455 Neural Networks ................................................................................455 Segment Models .................................................................................457 9.9. CASE STUDY: WHISPER......................................................................................462 9.10. HISTORICAL PERSPECTIVE AND FURTHER READING..........................................463 9.5.1. 9.5.2. 9.7.1. 9.7.2. 9.7.3. 9.8.1. 9.8.2.

分享到：

赞收藏

资料库

Spoken Language Processing A Guide to Theory Algorithm+and Syste....pdf

相关推荐

开发技术

热门标签

最新资料