logo资料库

Adaptive_Beamforming_and_Postfiltering_2008.pdf

第1页 / 共34页
第2页 / 共34页
第3页 / 共34页
第4页 / 共34页
第5页 / 共34页
第6页 / 共34页
第7页 / 共34页
第8页 / 共34页
资料共34页,剩余部分请下载后查看
Start
Search
Title Pages
Preface
List of Editors
List of Authors
Contents
List of Abbreviations
1 Introduction to Speech Processing
1.1 A Brief History of Speech Processing
1.2 Applications of Speech Processing
1.3 Organization of the Handbook
References
A Production, Perception, and Modeling of Speech
2 Physiological Processes of Speech Production
2.1 Overview of Speech Apparatus
2.2 Voice Production Mechanisms
2.2.1 Regulation of Respiration
2.2.2 Structure of the Larynx
2.2.3 Vocal Fold and its Oscillation
2.2.4 Regulation of Fundamental Frequency (F0)
2.2.5 Methods for Measuring Voice Production
2.3 Articulatory Mechanisms
2.3.1 Articulatory Organs
2.3.2 Vocal Tract and Nasal Cavity
2.3.3 Aspects of Articulation in Relation to Voicing
2.3.4 Articulators' Mobility and Coarticulation
2.3.5 Instruments for Observing Articulatory Dynamics
2.4 Summary
References
3 Nonlinear Cochlear Signal Processing and Masking in Speech Perception
3.1 Basics
3.1.1 Function of the Inner Ear
3.1.2 History of Cochlear Modeling
3.2 The Nonlinear Cochlea
3.2.1 Cochlear Modeling
3.2.2 Outer-Hair-Cell Transduction
3.2.3 Micromechanics
3.3 Neural Masking
3.3.1 Basic Definitions
3.3.2 Empirical Models
3.3.3 Models of the JND
3.3.4 A Direct Estimate of the Loudness JND
3.3.5 Determination of the Loudness SNR
3.3.6 Weber-Fraction Formula
3.4 Discussion and Summary
3.4.1 Model Validation
3.4.2 The Noise Model
References
4 Perception of Speech and Sound
4.1 Basic Psychoacoustic Quantities
4.1.1 Mapping of Intensity into Loudness
4.1.2 Pitch
4.1.3 Temporal Analysis and Modulation Perception
4.1.4 Binaural Hearing
4.1.5 Binaural Noise Suppression
4.2 Acoustical Information Required for Speech Perception
4.2.1 Speech Intelligibility and Speech Reception Threshold (SRT)
4.2.2 Measurement Methods
4.2.3 Factors Influencing Speech Intelligibility
4.2.4 Prediction Methods
4.3 Speech Feature Perception
4.3.1 Formant Features
4.3.2 Phonetic and Distinctive Feature Sets
4.3.3 Internal Representation Approach and Higher-Order Temporal-Spectral Features
4.3.4 Man-Machine Comparison
References
5 Speech Quality Assessment
5.1 Degradation Factors Affecting Speech Quality
5.2 Subjective Tests
5.2.1 Single Metric (Integral Speech Quality)
5.2.2 Multidimensional Metric (Diagnostic Speech-Quality)
5.2.3 Assessment of Specific Quality Dimensions
5.2.4 Test Implementation
5.2.5 Discussion of Subjective Tests
5.3 Objective Measures
5.3.1 Intrusive Listening Quality Measures
5.3.2 Non-Intrusive Listening Quality Measures
5.3.3 Objective Measures for Assessment of Conversational Quality
5.3.4 Discussion of Objective Measures
5.4 Conclusions
References
B Signal Processing for Speech
6 Wiener and Adaptive Filters
6.1 Overview
6.2 Signal Models
6.2.1 SISO Model
6.2.2 SIMO Model
6.2.3 MISO Model
6.2.4 MIMO Model
6.3 Derivation of the Wiener Filter
6.4 Impulse Response Tail Effect
6.5 Condition Number
6.5.1 Decomposition of the Correlation Matrix
6.5.2 Condition Number with the Frobenius Norm
6.5.3 Fast Computation of the Condition Number
6.6 Adaptive Algorithms
6.6.1 Deterministic Algorithm
6.6.2 Stochastic Algorithm
6.6.3 Variable-Step-Size NLMS Algorithm
6.6.4 Proportionate NLMS Algorithms
6.6.5 Sign Algorithms
6.7 MIMO Wiener Filter
6.7.1 Conditioning of the Covariance Matrix
6.8 Conclusions
References
7 Linear Prediction
7.1 Fundamentals
7.2 Forward Linear Prediction
7.3 Backward Linear Prediction
7.4 Levinson-Durbin Algorithm
7.5 Lattice Predictor
7.6 Spectral Representation
7.7 Linear Interpolation
7.8 Line Spectrum Pair Representation
7.9 Multichannel Linear Prediction
7.10 Conclusions
References
8 The Kalman Filter
8.1 Derivation of the Kalman Filter
8.1.1 The Minimum Mean Square Linear Optimal Estimator
8.1.2 The Estimation Error: Necessary and Sufficient Conditions for Optimality
8.1.3 The Kalman Filter
8.2 Examples: Estimation of Parametric Stochastic Process from Noisy Observations
8.2.1 Autoregressive (AR) Process
8.2.2 Moving-Average (MA) Process
8.2.3 Autoregressive Moving-Average (ARMA) Process
8.2.4 The Case of Temporally Correlated Noise
8.3 Extensions of the Kalman Filter
8.3.1 The Kalman Predictor
8.3.2 The Kalman Smoother
8.3.3 The Extended Kalman Filter
8.4 The Application of the Kalman Filter to Speech Processing
8.4.1 Literature Survey
8.4.2 Speech Enhancement
8.4.3 Speaker Tracking
8.5 Summary
References
9 Homomorphic Systems and Cepstrum Analysis of Speech
9.1 Definitions
9.1.1 Definition of the Cepstrum
9.1.2 Homomorphic Systems
9.1.3 Numerical Computation of Cepstra
9.2 Z-Transform Analysis
9.3 Discrete-Time Model for Speech Production
9.4 The Cepstrum of Speech
9.4.1 Short-Time Cepstrum of Speech
9.4.2 Homomorphic Filtering of Speech
9.5 Relation to LPC
9.5.1 LPC Versus Cepstrum Smoothing
9.5.2 Cepstrum from LPC Model
9.5.3 Minimum Phase and Recursive Computation
9.6 Application to Pitch Detection
9.7 Applications to Analysis/Synthesis Coding
9.7.1 Homomorphic Vocoder
9.7.2 Homomorphic Formant Vocoder
9.7.3 Analysis-by-Synthesis Vocoder
9.8 Applications to Speech Pattern Recognition
9.8.1 Compensation for Linear Filtering
9.8.2 Weighted Distance Measures
9.8.3 Group Delay Spectrum
9.8.4 Mel-Frequency Cepstrum Coefficients (MFCC)
9.9 Summary
References
10 Pitch and Voicing Determination of Speech with an Extension Toward Music Signals
10.1 Pitch in Time-Variant Quasiperiodic Acoustic Signals
10.1.1 Basic Definitions
10.1.2 Why is the Problem Difficult?
10.1.3 Categorizing the Methods
10.2 Short-Term Analysis PDAs
10.2.1 Correlation and Distance Function
10.2.2 Cepstrum and Other Double-Transform Methods
10.2.3 Frequency-Domain Methods: Harmonic Analysis
10.2.4 Active Modeling
10.2.5 Least Squares and Other Statistical Methods
10.2.6 Concluding Remarks
10.3 Selected Time-Domain Methods
10.3.1 Temporal Structure Investigation
10.3.2 Fundamental Harmonic Processing
10.3.3 Temporal Structure Simplification
10.3.4 Cascaded Solutions
10.4 A Short Look into Voicing Determination
10.4.1 Simultaneous Pitch and Voicing Determination
10.4.2 Pattern-Recognition VDAs
10.5 Evaluation and Postprocessing
10.5.1 Developing Reference PDAs with Instrumental Help
10.5.2 Error Analysis
10.5.3 Evaluation of PDAs and VDAs- Some Results
10.5.4 Postprocessing and Pitch Tracking
10.6 Applications in Speech and Music
10.7 Some New Challenges and Developments
10.7.1 Detecting the Instant of Glottal Closure
10.7.2 Multiple Pitch Determination
10.7.3 Instantaneousness Versus Reliability
10.8 Concluding Remarks
References
11 Formant Estimation and Tracking
11.1 Historical
11.2 Vocal Tract Resonances
11.3 Speech Production
11.4 Acoustics of the Vocal Tract
11.4.1 Two-Tube Models for Vowels
11.4.2 Three-Tube Models for Nasals and Fricatives
11.4.3 Obstruents
11.4.4 Coarticulation
11.5 Short-Time Speech Analysis
11.5.1 Vowels
11.5.2 Nasals
11.5.3 Fricatives and Stops
11.6 Formant Estimation
11.6.1 Continuity Constraints
11.6.2 Use of Phase Shift
11.6.3 Smoothing
11.7 Summary
References
12 The STFT, Sinusoidal Models, and Speech Modification
12.1 The Short-Time Fourier Transform
12.1.1 The STFT as a Sliding-Window Transform
12.1.2 The STFT as a Modulated Filter Bank
12.1.3 Original Formulation of the STFT
12.1.4 The Time Reference of the STFT
12.1.5 The STFT as a Heterodyne Filter Bank
12.1.6 Reconstruction Methods and Signal Models
12.1.7 Examples
12.1.8 Limitations of the STFT
12.2 Sinusoidal Models
12.2.1 Parametric Extension of the STFT
12.2.2 The Sinusoidal Signal Model
12.2.3 Sinusoidal Analysis and Synthesis
12.2.4 Signal Modeling by Matching Pursuit
12.2.5 Sinusoidal Matching Pursuits
12.2.6 Sinusoidal Analysis
12.2.7 Sinusoidal Synthesis
12.3 Speech Modification
12.3.1 Comparing the STFT and Sinusoidal Models
12.3.2 Linear Filtering
12.3.3 Enhancement
12.3.4 Time-Scale Modification
12.3.5 Pitch Modification
12.3.6 Cross-Synthesis and Other Modifications
12.3.7 Audio Coding with Decode-Side Modification
References
13 Adaptive Blind Multichannel Identification
13.1 Overview
13.2 Signal Model and Problem Formulation
13.3 Identifiability and Principle
13.4 Constrained Time-Domain Multichannel LMS and Newton Algorithms
13.4.1 Unit-Norm Constrained Multichannel LMS Algorithm
13.4.2 Unit-Norm Constrained Multichannel Newton Algorithm
13.5 Unconstrained Multichannel LMS Algorithm with Optimal Step-Size Control
13.6 Frequency-Domain Blind Multichannel Identification Algorithms
13.6.1 Frequency-Domain Multichannel LMS Algorithm
13.6.2 Frequency-Domain Normalized Multichannel LMS Algorithm
13.7 Adaptive Multichannel Exponentiated Gradient Algorithm
13.8 Summary
References
C Speech Coding
14 Principles of Speech Coding
14.1 The Objective of Speech Coding
14.2 Speech Coder Attributes
14.2.1 Rate
14.2.2 Quality
14.2.3 Robustness to Channel Imperfections
14.2.4 Delay
14.2.5 Computational and Memory Requirements
14.3 A Universal Coder for Speech
14.3.1 Speech Segment as Random Vector
14.3.2 Encoding Random Speech Vectors
14.3.3 A Model of Quantization
14.3.4 Coding Speech with a Model Family
14.4 Coding with Autoregressive Models
14.4.1 Spectral-Domain Index of Resolvability
14.4.2 A Criterion for Model Selection
14.4.3 Bit Allocation for the Model
14.4.4 Remarks on Practical Coding
14.5 Distortion Measures and Coding Architecture
14.5.1 Squared Error
14.5.2 Masking Models and Squared Error
14.5.3 Auditory Models and Squared Error
14.5.4 Distortion Measure and Coding Architecture
14.6 Summary
References
15 Voice over IP: Speech Transmission over Packet Networks
15.1 Voice Communication
15.1.1 Limitations of PSTN
15.1.2 The Promise of VoIP
15.2 Properties of the Network
15.2.1 Network Protocols
15.2.2 Network Characteristics
15.2.3 Typical Network Characteristics
15.2.4 Quality-of-Service Techniques
15.3 Outline of a VoIP System
15.3.1 Echo Cancelation
15.3.2 Speech Codec
15.3.3 Jitter Buffer
15.3.4 Packet Loss Recovery
15.3.5 Joint Design of Jitter Buffer and Packet Loss Concealment
15.3.6 Auxiliary Speech Processing Components
15.3.7 Measuring the Quality of a VoIP System
15.4 Robust Encoding
15.4.1 Forward Error Correction
15.4.2 Multiple Description Coding
15.5 Packet Loss Concealment
15.5.1 Nonparametric Concealment
15.5.2 Parametric Concealment
15.6 Conclusion
References
16 Low-Bit-Rate Speech Coding
16.1 Speech Coding
16.2 Fundamentals: Parametric Modeling of Speech Signals
16.2.1 Speech Production
16.2.2 Human Speech Perception
16.2.3 Vocoders
16.3 Flexible Parametric Models
16.3.1 Mixed Excitation Linear Prediction (MELP)
16.3.2 Sinusoidal Coding
16.3.3 Waveform Interpolation
16.3.4 Comparison and Contrast of Modeling Approaches
16.4 Efficient Quantization of Model Parameters
16.4.1 Vector Quantization
16.4.2 Exploiting Temporal Properties
16.4.3 LPC Filter Quantization
16.5 Low-Rate Speech Coding Standards
16.5.1 MIL-STD 3005
16.5.2 The NATO STANAG 4591
16.5.3 Satellite Communications
16.5.4 ITU 4 kb/s Standardization
16.6 Summary
References
17 Analysis-by-Synthesis Speech Coding
17.1 Overview
17.2 Basic Concepts of Analysis-by-Synthesis Coding
17.2.1 Definition of Analysis-by-Synthesis
17.2.2 From Conventional Predictive Waveform Coding to a Speech Synthesis Model
17.2.3 Basic Principle of Analysis by Synthesis
17.2.4 Generic Analysis-by-Synthesis Encoder Structure
17.2.5 Reasons for the Coding Efficiency of Analysis by Synthesis
17.3 Overview of Prominent Analysis-by-Synthesis Speech Coders
17.4 Multipulse Linear Predictive Coding (MPLPC)
17.5 Regular-Pulse Excitation with Long-Term Prediction (RPE-LTP)
17.6 The Original Code Excited Linear Prediction (CELP) Coder
17.7 US Federal Standard FS1016 CELP
17.8 Vector Sum Excited Linear Prediction (VSELP)
17.9 Low-Delay CELP (LD-CELP)
17.10 Pitch Synchronous Innovation CELP (PSI-CELP)
17.11 Algebraic CELP (ACELP)
17.11.1 ACELP Background
17.11.2 ACELP Efficient Search Methods
17.11.3 ACELP in Standards
17.12 Conjugate Structure CELP (CS-CELP) and CS-ACELP
17.13 Relaxed CELP (RCELP) - Generalized Analysis by Synthesis
17.13.1 Generalized Analysis by Synthesis Applied to the Pitch Parameters
17.13.2 RCELP in Standards
17.14 eX-CELP
17.14.1 eX-CELP in Standards
17.15 iLBC
17.16 TSNFC
17.16.1 Excitation VQ in TSNFC
17.16.2 TSNFC in Standards
17.17 Embedded CELP
17.18 Summary of Analysis-by-Synthesis Speech Coders
17.19 Conclusion
References
18 Perceptual Audio Coding of Speech Signals
18.1 History of Audio Coding
18.2 Fundamentals of Perceptual Audio Coding
18.2.1 General Background
18.2.2 Coder Structure
18.2.3 Perceptual Audio Coding Versus Speech Coding
18.3 Some Successful Standardized Audio Coders
18.3.1 MPEG-1
18.3.2 MPEG-2
18.3.3 MPEG-2 Advanced Audio Coding
18.3.4 MPEG-4 Advanced Audio Coding
18.3.5 Progress in Coding Performance
18.4 Perceptual Audio Coding for Real-Time Communication
18.4.1 Delay Sources in Perceptual Audio Coding
18.4.2 MPEG-4 Low-Delay AAC
18.4.3 ITU-T G.722.1-C
18.4.4 Ultra-Low-Delay Perceptual Audio Coding
18.5 Hybrid/Crossover Coders
18.5.1 MPEG-4 Scalable Speech/Audio Coding
18.5.2 ITU-T G.729.1
18.5.3 AMR-WB+
18.5.4 ARDOR
18.6 Summary
References
D Text-to-Speech Synthesis
19 Basic Principles of Speech Synthesis
19.1 The Basic Components of a TTS System
19.1.1 TTS Frontend
19.1.2 TTS Backend
19.2 Speech Representations and Signal Processing for Concatenative Synthesis
19.2.1 Time-Domain Pitch Synchronous Overlap Add (TD-PSOLA)
19.2.2 LPC-Based Synthesis
19.2.3 Sinusoidal Synthesis
19.3 Speech Signal Transformation Principles
19.3.1 Prosody Transformation Principles
19.3.2 Principle Methods for Changing Speaker Characteristics and Speaking Style
19.4 Speech Synthesis Evaluation
19.5 Conclusions
References
20 Rule-Based Speech Synthesis
20.1 Background
20.2 Terminal Analog
20.2.1 Formant Synthesizers
20.2.2 Higher-Level Parameters
20.2.3 Voice Source Models
20.3 Controlling the Synthesizer
20.3.1 Rule Compilers for Speech Synthesis
20.3.2 Data-Driven Parametric Synthesis
20.4 Special Applications of Rule-Based Parametric Synthesis
20.5 Concluding Remarks
References
21 Corpus-Based Speech Synthesis
21.1 Basics
21.2 Concatenative Synthesis with a Fixed Inventory
21.2.1 Diphone-Based Synthesis
21.2.2 Modifying Prosody
21.2.3 Smoothing Joints
21.2.4 Up from Diphones
21.3 Unit-Selection-Based Synthesis
21.3.1 Selecting Units
21.3.2 Target Cost
21.3.3 Concatenation Cost
21.3.4 Speech Corpus
21.3.5 Computational Cost
21.4 Statistical Parametric Synthesis
21.4.1 HMM-Based Synthesis Framework
21.4.2 The State of the Art and Perspectives
21.5 Conclusion
References
22 Linguistic Processing for Speech Synthesis
22.1 Why Linguistic Processing is Hard
22.2 Fundamentals: Writing Systems and the Graphical Representation of Language
22.3 Problems to be Solved and Methods to Solve Them
22.3.1 Text Preprocessing
22.3.2 Morphological Analysis and Word Pronunciation
22.3.3 Syntactic Analysis, Accenting, and Phrasing
22.3.4 Sense Disambiguation: Dealing with Ambiguity in Written Language
22.4 Architectures for Multilingual Linguistic Processing
22.5 Document-Level Processing
22.6 Future Prospects
References
23 Prosodic Processing
23.1 Overview
23.1.1 What Is Prosody?
23.1.2 Prosody in Human-Human Communication
23.2 Historical Overview
23.2.1 Rule-Based Approaches in Formant Synthesis
23.2.2 Statistical Approaches in Diphone Synthesis
23.2.3 Using as-is Prosody in Unit Selection Synthesis
23.3 Fundamental Challenges
23.3.1 Challenge to Unit Selection: Combinatorics of Language
23.3.2 Challenge to Target Prosody-Based Approaches: Multitude of Interrelated Acoustic Prosodic Features
23.4 A Survey of Current Approaches
23.4.1 Timing
23.4.2 Intonation
23.5 Future Approaches
23.5.1 Hybrid Approaches
23.6 Conclusions
References
24 Voice Transformation
24.1 Background
24.2 Source-Filter Theory and Harmonic Models
24.2.1 Harmonic Model
24.2.2 Analysis Based on the Harmonic Model
24.2.3 Synthesis Based on the Harmonic Model
24.3 Definitions
24.3.1 Source Modifications
24.3.2 Filter Modifications
24.3.3 Combining Source and Filter Modifications
24.4 Source Modifications
24.4.1 Time-Scale Modification
24.4.2 Pitch Modification
24.4.3 Joint Pitch and Time-Scale Modification
24.4.4 Energy Modification
24.4.5 Generating the Source Modified Speech Signal
24.5 Filter Modifications
24.5.1 The Gaussian Mixture Model
24.6 Conversion Functions
24.7 Voice Conversion
24.8 Quality Issues in Voice Transformations
24.9 Summary
References
25 Expressive/Affective Speech Synthesis
25.1 Overview
25.2 Characteristics of Affective Speech
25.2.1 Intentions and Emotions
25.2.2 Message and Filters
25.2.3 Coding and Expression
25.3 The Communicative Functionality of Speech
25.3.1 Multiple Layers of Prosodic Information
25.3.2 Text Data versus Speech Synthesis
25.4 Approaches to Synthesizing Expressive Speech
25.4.1 Emotion in Expressive Speech Synthesis
25.5 Modeling Human Speech
25.5.1 Discourse-Act Labeling
25.5.2 Expressive Speech and Emotion
25.5.3 Concatenative Synthesis Using Expressive Speech Samples
25.6 Conclusion
References
E Speech Recognition
26 Historical Perspective of the Field of ASR/NLU
26.1 ASR Methodologies
26.1.1 Issues in Speech Recognition
26.2 Important Milestones in Speech Recognition History
26.3 Generation 1 - The Early History of Speech Recognition
26.4 Generation 2 - The First Working Systems for Speech Recognition
26.5 Generation 3 - The Pattern Recognition Approach to Speech Recognition
26.5.1 The ARPA SUR Project
26.5.2 Research Outside of the ARPA Community
26.6 Generation 4 - The Era of the Statistical Model
26.6.1 DARPA Programs in Generation 4
26.7 Generation 5 - The Future
26.8 Summary
References
27 HMMs and Related Speech Recognition Technologies
27.1 Basic Framework
27.2 Architecture of an HMM-Based Recognizer
27.2.1 Feature Extraction
27.2.2 HMM Acoustic Models
27.2.3 N-Gram Language Models
27.2.4 Decoding and Lattice Generation
27.3 HMM-Based Acoustic Modeling
27.3.1 Discriminative Training
27.3.2 Covariance Modeling
27.4 Normalization
27.4.1 Mean and Variance Normalization
27.4.2 Gaussianization
27.4.3 Vocal-Tract-Length Normalization
27.5 Adaptation
27.5.1 Maximum A Posteriori (MAP) Adaptation
27.5.2 ML-Based Linear Transforms
27.5.3 Adaptive Training
27.6 Multipass Recognition Architectures
27.7 Conclusions
References
28 Speech Recognition with Weighted Finite-State Transducers
28.1 Definitions
28.2 Overview
28.2.1 Weighted Acceptors
28.2.2 Weighted Transducers
28.2.3 Composition
28.2.4 Determinization
28.2.5 Minimization
28.2.6 Speech Recognition Transducers
28.3 Algorithms
28.3.1 Preliminaries
28.3.2 Composition
28.3.3 Determinization
28.3.4 Weight Pushing
28.3.5 Minimization
28.4 Applications to Speech Recognition
28.4.1 Speech Recognition Transducers
28.4.2 Transducer Standardization
28.5 Conclusion
References
29 A Machine Learning Framework for Spoken-Dialog Classification
29.1 Motivation
29.2 Introduction to Kernel Methods
29.3 Rational Kernels
29.4 Algorithms
29.5 Experiments
29.6 Theoretical Results for Rational Kernels
29.7 Conclusion
References
30 Towards Superhuman Speech Recognition
30.1 Current Status
30.2 A Multidomain Conversational Test Set
30.3 Listening Experiments
30.3.1 Baseline Listening Tests
30.3.2 Listening Tests to Determine Knowledge Source Contributions
30.4 Recognition Experiments
30.4.1 Preliminary Recognition Results
30.4.2 Results on the Multidomain Test Set
30.4.3 System Redesign
30.4.4 Coda
30.5 Speculation
30.5.1 Proposed Human Listening Experiments
30.5.2 Promising Incremental Approaches
30.5.3 Promising Disruptive Approaches
References
31 Natural Language Understanding
31.1 Overview of NLU Applications
31.1.1 Context Dependence
31.1.2 Semantic Representation
31.2 Natural Language Parsing
31.2.1 Decision Tree Parsers
31.3 Practical Implementation
31.3.1 Classing
31.3.2 Labeling
31.4 Speech Mining
31.4.1 Word Tagging
31.5 Conclusion
References
32 Transcription and Distillation of Spontaneous Speech
32.1 Background
32.2 Overview of Research Activities on Spontaneous Speech
32.2.1 Classification of Spontaneous Speech
32.2.2 Major Projects and Corpora of Spontaneous Speech
32.2.3 Issues in Design of Spontaneous Speech Corpora
32.2.4 Corpus of Spontaneous Japanese (CSJ)
32.3 Analysis for Spontaneous Speech Recognition
32.3.1 Observation in Spectral Analysis
32.3.2 Analysis of Speaking Rate
32.3.3 Analysis of Factors Affecting ASR Accuracy
32.4 Approaches to Spontaneous Speech Recognition
32.4.1 Effect of Corpus Size
32.4.2 Acoustic Modeling
32.4.3 Models Considering Speaking Rate
32.4.4 Pronunciation Variation Modeling
32.4.5 Language Model
32.4.6 Adaptation of Acoustic Model
32.4.7 Adaptation of Language Model
32.5 Metadata and Structure Extraction of Spontaneous Speech
32.5.1 Sentence Boundary Detection
32.5.2 Disfluency Detection
32.5.3 Detection of Topic and Discourse Boundaries
32.6 Speech Summarization
32.6.1 Categories of Speech Summarization
32.6.2 Key Sentence Extraction
32.6.3 Summary Generation
32.7 Conclusions
References
33 Environmental Robustness
33.1 Noise Robust Speech Recognition
33.1.1 Standard Noise-Robust ASR Tasks
33.1.2 The Acoustic Mismatch Problem
33.1.3 Reducing Acoustic Mismatch
33.2 Model Retraining and Adaptation
33.2.1 Retraining on Corrupted Speech
33.2.2 Single-Utterance Retraining
33.2.3 Model Adaptation
33.3 Feature Transformation and Normalization
33.3.1 Feature Moment Normalization
33.3.2 Voice Activity Detection
33.3.3 Cepstral Time Smoothing
33.3.4 SPLICE - Normalization Learned from Stereo Data
33.4 A Model of the Environment
33.5 Structured Model Adaptation
33.5.1 Analysis of Noisy Speech Features
33.5.2 Log-Normal Parallel Model Combination
33.5.3 Vector Taylor-Series Model Adaptation
33.5.4 Comparison of VTS and Log-Normal PMC
33.5.5 Strategies for Highly Nonstationary Noises
33.6 Structured Feature Enhancement
33.6.1 Spectral Subtraction
33.6.2 Vector Taylor-Series Speech Enhancement
33.7 Unifying Model and Feature Techniques
33.7.1 Noise Adaptive Training
33.7.2 Uncertainty Decoding and Missing Feature Techniques
33.8 Conclusion
References
34 The Business of Speech Technologies
34.1 Introduction
34.1.1 Economic Value of Network-Based Speech Services
34.1.2 Economic Value of Device-Based Speech Applications
34.1.3 Technology Overview
34.2 Network-Based Speech Services
34.2.1 The Industry
34.2.2 The Service Paradigm and Historical View of Service Deployments
34.2.3 Paradigm Shift from Directed-Dialog- to Open-Dialog-Based Services
34.2.4 Technical Challenges that Lay Ahead for Network-Based Services
34.3 Device-Based Speech Applications
34.3.1 The Industry
34.3.2 The Device-Based Speech Application Marketplace
34.3.3 Technical Challenges that Enabled Mass Deployment
34.3.4 History of Device-Based ASR
34.3.5 Modern Use of ASR
34.3.6 Government Applications of Speech Recognition
34.4 Vision/Predications of Future Services - Fueling the Trends
34.4.1 Multimodal-Based Speech Services
34.4.2 Increased Automation of Service Development Process
34.4.3 Complex Problem Solving
34.4.4 Speech Mining
34.4.5 Mobile Devices
34.5 Conclusion
References
35 Spoken Dialogue Systems
35.1 Technology Components and System Development
35.1.1 System Architecture
35.1.2 Spoken Input Processing
35.1.3 Spoken Output Processing
35.1.4 Dialogue Management
35.2 Development Issues
35.2.1 Data Collection
35.2.2 Evaluation
35.3 Historical Perspectives
35.3.1 Large-Scale Government Programs
35.3.2 Some Example Systems
35.4 New Directions
35.4.1 User Simulation
35.4.2 Machine Learning and Dialogue Management
35.4.3 Portability
35.4.4 Multimodal, Multidomain, and Multilingual Application Development
35.5 Concluding Remarks
References
F Speaker Recognition
36 Overview of Speaker Recognition
36.1 Speaker Recognition
36.1.1 Personal Identity Characteristics
36.1.2 Speaker Recognition Definitions
36.1.3 Bases for Speaker Recognition
36.1.4 Extracting Speaker Characteristics from the Speech Signal
36.1.5 Applications
36.2 Measuring Speaker Features
36.2.1 Acoustic Measurements
36.2.2 Linguistic Measurements
36.3 Constructing Speaker Models
36.3.1 Nonparametric Approaches
36.3.2 Parametric Approaches
36.4 Adaptation
36.5 Decision and Performance
36.5.1 Decision Rules
36.5.2 Threshold Setting and Score Normalization
36.5.3 Errors and DET Curves
36.6 Selected Applications for Automatic Speaker Recognition
36.6.1 Indexing Multispeaker Data
36.6.2 Forensics
36.6.3 Customization: SCANmail
36.7 Summary
References
37 Text-Dependent Speaker Recognition
37.1 Brief Overview
37.1.1 Features
37.1.2 Acoustic Modeling
37.1.3 Likelihood Ratio Score
37.1.4 Speaker Model Training
37.1.5 Score Normalization and Fusion
37.1.6 Speaker Model Adaptation
37.2 Text-Dependent Challenges
37.2.1 Technological Challenges
37.2.2 Commercial Deployment Challenges
37.3 Selected Results
37.3.1 Feature Extraction
37.3.2 Accuracy Dependence on Lexicon
37.3.3 Background Model Design
37.3.4 T-Norm in the Context of Text-Dependent Speaker Recognition
37.3.5 Adaptation of Speaker Models
37.3.6 Protection Against Recordings
37.3.7 Automatic Impostor Trials Generation
37.4 Concluding Remarks
References
38 Text-Independent Speaker Recognition
38.1 Introduction
38.2 Likelihood Ratio Detector
38.3 Features
38.3.1 Spectral Features
38.3.2 High-Level Features
38.4 Classifiers
38.4.1 Adapted Gaussian Mixture Models
38.4.2 Support Vector Machines
38.4.3 High-Level Feature Classifiers
38.4.4 System Fusion
38.5 Performance Assessment
38.5.1 Task and Corpus
38.5.2 Systems
38.5.3 Results
38.5.4 Computational Considerations
38.6 Summary
References
G Language Recognition
39 Principles of Spoken Language Recognition
39.1 Spoken Language
39.2 Language Recognition Principles
39.3 Phone Recognition Followed by Language Modeling (PRLM)
39.4 Vector-Space Characterization (VSC)
39.5 Spoken Language Verification
39.6 Discriminative Classifier Design
39.7 Summary
References
40 Spoken Language Characterization
40.1 Language versus Dialect
40.2 Spoken Language Collections
40.3 Spoken Language Characteristics
40.4 Human Language Identification
40.5 Text as a Source of Information on Spoken Languages
40.6 Summary
References
41 Automatic Language Recognition Via Spectral and Token Based Approaches
41.1 Automatic Language Recognition
41.2 Spectral Based Methods
41.2.1 Shifted Delta Cepstral Features
41.2.2 Classifiers
41.3 Token-Based Methods
41.3.1 Tokens
41.3.2 Classifiers
41.4 System Fusion
41.4.1 Methods
41.4.2 Output Scores
41.5 Performance Assessment
41.5.1 Task and Corpus
41.5.2 Systems
41.5.3 Results
41.5.4 Computational Considerations
41.6 Summary
References
42 Vector-Based Spoken Language Classification
42.1 Vector Space Characterization
42.2 Unit Selection and Modeling
42.2.1 Augmented Phoneme Inventory (API)
42.2.2 Acoustic Segment Model (ASM)
42.2.3 Comparison of Unit Selection
42.3 Front-End: Voice Tokenization and Spoken Document Vectorization
42.4 Back-End: Vector-Based Classifier Design
42.4.1 Ensemble Classifier Design
42.4.2 Ensemble Decision Strategy
42.4.3 Generalized VSC-Based Classification
42.5 Language Classification Experiments and Discussion
42.5.1 Experimental Setup
42.5.2 Language Identification
42.5.3 Language Verification
42.5.4 Overall Performance Comparison
42.6 Summary
References
H Speech Enhancement
43 Fundamentals of Noise Reduction
43.1 Noise
43.2 Signal Model and Problem Formulation
43.3 Evaluation of Noise Reduction
43.3.1 Signal-to-Noise Ratio
43.3.2 Noise-Reduction Factor and Gain Function
43.3.3 Speech-Distortion Index and Attenuation Frequency Distortion
43.4 Noise Reduction via Filtering Techniques
43.4.1 Time-Domain Wiener Filter
43.4.2 A Suboptimal Filter
43.4.3 Subspace Method
43.4.4 Frequency-Domain Wiener Filter
43.4.5 Short-Time Parametric Wiener Filter
43.5 Noise Reduction via Spectral Restoration
43.5.1 MMSE Spectral Estimator
43.5.2 MMSE Spectral Amplitude and Phase Estimators
43.5.3 Maximum A Posteriori (MAP) Spectral Estimator
43.5.4 Maximum-Likelihood Spectral Amplitude Estimator
43.5.5 Maximum-Likelihood Spectral Power Estimator
43.5.6 MAP Spectral Amplitude Estimator
43.6 Speech-Model-Based Noise Reduction
43.6.1 Harmonic-Model-Based Noise Reduction
43.6.2 Linear-Prediction-Based Noise Reduction
43.6.3 Hidden-Markov-Model-Based Noise Reduction
43.7 Summary
References
44 Spectral Enhancement Methods
44.1 Spectral Enhancement
44.2 Problem Formulation
44.3 Statistical Models
44.4 Signal Estimation
44.4.1 MMSE Spectral Estimation
44.4.2 MMSE Log-Spectral Amplitude Estimation
44.5 Signal Presence Probability Estimation
44.6 A Priori SNR Estimation
44.6.1 Decision-Directed Estimation
44.6.2 Causal Recursive Estimation
44.6.3 Relation Between Causal Recursive Estimation and Decision-Directed Estimation
44.6.4 Noncausal Recursive Estimation
44.7 Noise Spectrum Estimation
44.7.1 Time-Varying Recursive Averaging
44.7.2 Minima-Controlled Estimation
44.8 Summary of a Spectral Enhancement Algorithm
44.9 Selection of Spectral Enhancement Algorithms
44.9.1 Choice of a Statistical Model and Fidelity Criterion
44.9.2 Choice of an A Priori SNR Estimator
44.9.3 Choice of a Noise Estimator
44.10 Conclusions
References
45 Adaptive Echo Cancelation for Voice Signals
45.1 Network Echoes
45.1.1 Network Echo Canceler
45.1.2 Adaptation Algorithms
45.2 Single-Channel Acoustic Echo Cancelation
45.2.1 The Subband Canceler
45.2.2 RLS for Subband Echo Cancelers
45.2.3 The Delayless Subband Structure
45.2.4 Frequency-Domain Adaptation
45.2.5 The Two-Echo-Path Model
45.2.6 Variable-Step Algorithm for Acoustic Echo Cancelers
45.2.7 Cancelers for Nonlinear Echo Paths
45.3 Multichannel Acoustic Echo Cancelation
45.3.1 Nonuniqueness of the Misalignment Vector
45.3.2 Solutions for the Nonuniqueness Problem
45.4 Summary
References
46 Dereverberation
46.1 Background and Overview
46.1.1 Why Speech Dereverberation?
46.1.2 Room Acoustics and Reverberation Evaluation
46.1.3 Classification of Speech Dereverberation Methods
46.2 Signal Model and Problem Formulation
46.3 Source Model-Based Speech Dereverberation
46.3.1 Speech Models
46.3.2 LP Residual Enhancement Methods
46.3.3 Harmonic Filtering
46.3.4 Speech Dereverberation Using Probabilistic Models
46.4 Separation of Speech and Reverberation via Homomorphic Transformation
46.4.1 Cepstral Liftering
46.4.2 Cepstral Mean Subtraction and High-Pass Filtering of Cepstral Frame Coefficients
46.5 Channel Inversion and Equalization
46.5.1 Single-Channel Systems
46.5.2 Multichannel Systems
46.6 Summary
References
47 Adaptive Beamforming and Postfiltering
47.1 Problem Formulation
47.2 Adaptive Beamforming
47.2.1 Frequency-Domain Frost Algorithm
47.2.2 Frequency-Domain Generalized Sidelobe Canceller
47.2.3 Time-Domain Generalized Sidelobe Canceller
47.3 Fixed Beamformer and Blocking Matrix
47.3.1 Using Acoustical Transfer Functions
47.3.2 Using Delay-Only Filters
47.3.3 Using Relative Transfer Functions
47.4 Identification of the Acoustical Transfer Function
47.4.1 Signal Subspace Method
47.4.2 Time Difference of Arrival
47.4.3 Relative Transfer Function Estimation
47.5 Robustness and Distortion Weighting
47.6 Multichannel Postfiltering
47.6.1 MMSE Postfiltering
47.6.2 Log-Spectral Amplitude Postfiltering
47.7 Performance Analysis
47.7.1 The Power Spectral Density of the Beamformer Output
47.7.2 Signal Distortion
47.7.3 Stationary Noise Reduction
47.8 Experimental Results
47.9 Summary
47.A Appendix: Derivation of the Expected Noise Reduction for a Coherent Noise Field
47.B Appendix: Equivalence Between Maximum SNR and LCMV Beamformers
References
48 Feedback Control in Hearing Aids
48.1 Problem Statement
48.1.1 Acoustic Feedback in Hearing Aids
48.1.2 Feedforward Suppression Versus Feedback Cancellation
48.1.3 Performance of a Feedback Canceller
48.2 Standard Adaptive Feedback Canceller
48.2.1 Adaptation of the CAF
48.2.2 Bias of the CAF
48.2.3 Reducing the Bias of the CAF
48.3 Feedback Cancellation Based on Prior Knowledge of the Acoustic Feedback Path
48.3.1 Constrained Adaptation (C-CAF)
48.3.2 Bandlimited Adaptation (BL-CAF)
48.4 Feedback Cancellation Based on Closed-Loop System Identification
48.4.1 Closed-Loop System Setup
48.4.2 Direct Method
48.4.3 Desired Signal Model
48.4.4 Indirect and Joint Input-Output Method
48.5 Comparison
48.5.1 Steady-State Performance
48.5.2 Tracking Performance
48.5.3 Measurement of the Actual Maximum Stable Gain
48.6 Conclusions
References
49 Active Noise Control
49.1 Broadband Feedforward Active Noise Control
49.1.1 Filtered-X LMS Algorithm
49.1.2 Analysis of the FXLMS Algorithm
49.1.3 Leaky FXLMS Algorithm
49.1.4 Feedback Effects and Solutions
49.2 Narrowband Feedforward Active Noise Control
49.2.1 Introduction
49.2.2 Waveform Synthesis Method
49.2.3 Adaptive Notch Filters
49.2.4 Multiple-Frequency ANC
49.2.5 Active Noise Equalization
49.3 Feedback Active Noise Control
49.4 Multichannel ANC
49.4.1 Principles
49.4.2 Multichannel FXLMS Algorithms
49.4.3 Frequency-Domain Convergence Analysis
49.4.4 Multichannel IIR Algorithm
49.4.5 Multichannel Adaptive Feedback ANC Systems
49.5 Summary
References
I Multichannel Speech Processing
50 Microphone Arrays
50.1 Microphone Array Beamforming
50.1.1 Delay-and-Sum Beamforming
50.1.2 Filter-and-Sum Beamforming
50.1.3 Arrays with Directional Elements
50.2 Constant-Beamwidth Microphone Array System
50.3 Constrained Optimization of the Directional Gain
50.4 Differential Microphone Arrays
50.5 Eigenbeamforming Arrays
50.5.1 Spherical Array
50.5.2 Eigenbeamformer
50.5.3 Modal Beamformer
50.6 Adaptive Array Systems
50.6.1 Constrained Broadband Arrays
50.7 Conclusions
References
51 Time Delay Estimation and Source Localization
51.1 Technology Taxonomy
51.2 Time Delay Estimation
51.2.1 Problem Formulation and Signal Models
51.2.2 The Family of the Generalized Cross-Correlation Methods
51.2.3 Adaptive Eigenvalue Decomposition Algorithm
51.2.4 Adaptive Blind Multichannel Identification Based Methods
51.2.5 Multichannel Spatial Prediction and Interpolation Methods
51.2.6 Multichannel Cross-Correlation Coefficient Algorithm
51.2.7 Minimum-Entropy Method
51.3 Source Localization
51.3.1 Problem Formulation
51.3.2 Measurement Model and Cramèr-Rao Lower Bound
51.3.3 Maximum-Likelihood Estimator
51.3.4 Least-Squares Estimators
51.3.5 Least-Squares Error Criteria
51.3.6 Spherical Intersection (SX) Estimator
51.3.7 Spherical Interpolation (SI) Estimator
51.3.8 Linear-Correction Least-Squares Estimator
51.4 Summary
References
52 Convolutive Blind Source Separation Methods
52.1 The Mixing Model
52.1.1 Special Cases
52.1.2 Convolutive Model in the Frequency Domain
52.1.3 Block-Based Model
52.2 The Separation Model
52.2.1 Feedforward Structure
52.2.2 Relation Between Source and Separated Signals
52.2.3 Feedback Structure
52.2.4 Example: The TITO System
52.3 Identification
52.4 Separation Principle
52.4.1 Higher-Order Statistics
52.4.2 Second-Order Statistics
52.4.3 Sparseness in the Time/Frequency Domain
52.4.4 Priors from Auditory Scene Analysis and Psychoacoustics
52.5 Time Versus Frequency Domain
52.5.1 Frequency Permutations
52.5.2 Time-Frequency Algorithms
52.5.3 Circularity Problem
52.5.4 Subband Filtering
52.6 The Permutation Ambiguity
52.6.1 Consistency of the Filter Coefficients
52.6.2 Consistency of the Spectrum of the Recovered Signals
52.6.3 Global Permutations
52.7 Results
52.8 Conclusion
References
53 Sound Field Reproduction
53.1 Sound Field Synthesis
53.2 Mathematical Representation of Sound Fields
53.2.1 Coordinate Systems
53.2.2 Wave Equation
53.2.3 Plane Waves
53.2.4 General Wave Fields in Two Dimensions
53.2.5 Spherical Waves
53.2.6 The Kirchhoff-Helmholtz Integral
53.3 Stereophony
53.3.1 Sine Law
53.3.2 Tangent Law
53.3.3 Application of Amplitude Panning
53.3.4 The Sweet Spot
53.4 Vector-Based Amplitude Panning
53.4.1 Two-Dimensional Vector-Based Amplitude Panning
53.4.2 Three-Dimensional Vector-Based Amplitude Panning
53.4.3 Perception of Vector-Based Amplitude Panning
53.5 Ambisonics
53.5.1 Two-Dimensional Ambisonics
53.5.2 Three-Dimensional Ambisonics
53.5.3 Extensions of Ambisonics
53.6 Wave Field Synthesis
53.6.1 Description of Acoustical Scenes by the Kirchhoff-Helmholtz Integral
53.6.2 Monopole and Dipole Sources
53.6.3 Reduction to Two Spatial Dimensions
53.6.4 Spatial Sampling
53.6.5 Determination of the Loudspeaker Driving Signals
References
Acknowledgements
About the Authors
Detailed Contents
Subject Index
47. Adaptive Beamforming and Postfiltering Adaptive Bea S. Gannot, I. Cohen 945 In this chapter, we explore many of the basic con- cepts of array processing with an emphasis on adaptive beamforming for speech enhancement applications. We begin in Sect. 47.1 by formu- lating the problem of microphone array in a noisy and reverberant environment. In Sect. 47.2, we derive the frequency-domain linearly con- strained minimum-variance (LCMV) beamformer, and its generalized sidelobe canceller (GSC) vari- ant. The GSC components are explored in Sect. 47.3, and several commonly used special cases of these blocks are presented. As the GSC struc- ture necessitates an estimation of the speech related acoustical transfer functions (ATFs), sev- eral alternative system identification methods are addressed in Sect. 47.4. Beamformers often suffer from sensitivity to signal mismatch. We analyze this phenomenon in Sect. 47.5 and ex- plore several cures to this problem. Although the GSC beamformer yields a significant improve- ment in speech quality, when the noise field is spatially incoherent or diffuse, the noise reduc- tion is insufficient and additional postfiltering is normally required. In Sect. 47.6, we present multi- microphone postfilters, based on either minimum mean-squared error (MMSE) or log-spectral am- plitude estimate criteria. An interesting relation between the GSC and the Wiener filter is derived in this Section as well. In Sect. 47.7, we analyze the performance of the transfer-function GSC (TF-GSC), and in Sect. 47.8 demonstrate the advantage of multichannel postfiltering over single-channel postfiltering in nonstationary noise conditions. 47.1 Problem Formulation ............................ 947 47.2 Adaptive Beamforming ......................... 948 47.2.1 Frequency-Domain Frost 47.2.2 Frequency-Domain Generalized Sidelobe Canceller ....................... 950 47.2.3 Time-Domain Generalized Sidelobe Canceller ....................... 952 47.3 Fixed Beamformer and Blocking Matrix .. 953 47.3.1 Using Acoustical Transfer Functions ................................... 953 47.3.2 Using Delay-Only Filters ............... 954 47.3.3 Using Relative Transfer Functions .. 954 47.4 Identification of the Acoustical Transfer Function ........ 955 47.4.1 Signal Subspace Method .............. 955 47.4.2 Time Difference of Arrival ............. 956 47.4.3 Relative Transfer Function Estimation.................................. 956 47.5 Robustness and Distortion Weighting..... 960 47.6 Multichannel Postfiltering ..................... 962 47.6.1 MMSE Postfiltering....................... 963 47.6.2 Log-Spectral Amplitude Postfiltering ............................... 964 47.7 Performance Analysis ............................ 967 47.7.1 The Power Spectral Density of the Beamformer Output ........... 967 47.7.2 Signal Distortion ......................... 968 47.7.3 Stationary Noise Reduction........... 968 47.8 Experimental Results ............................ 972 47.9 Summary ............................................. 972 47.A Appendix: Derivation of the Expected Noise Reduction for a Coherent Noise Field ..................... 973 47.B Appendix: Equivalence Between Maximum SNR and LCMV Beamformers.... 974 Algorithm................................... 948 References .................................................. 975 P a r t H 4 7
946 Part H Speech Enhancement Over the last four decades, array processing has be- come a well-established discipline, see e.g., [47.1–14]. In the mid 1980s, array processing and beamform- ing methods were adopted by the speech community to deal with data received by microphone arrays. Since then, beamforming techniques for microphone arrays have been used in many applications, such as speaker separation, speaker localization, speech dere- verberation, acoustic echo cancellation, and speech enhancement. Adaptive beamforming for speech signals requires particular consideration of problems that are specific to speech signals and to the acoustic environment. The speech signal is wide-band, highly nonstationary, and has a very wide dynamic range. An acoustic enclosure is usually modeled as a filter with very long impulse response due to multiple reflections from the room walls. In a typical office, the length of the filters may reach several thousand taps. Furthermore, the impulse response is often time varying due to speaker and objects movements. The term beamforming refers to the design of a spa- tiotemporal filter. Broadband arrays comprise a set of filters, applied to each received microphone signal, fol- lowed by a summation operation. The main objective of the beamformer is to extract a desired signal, imping- ing on the array from a specific position, out of noisy measurements thereof. Usually, the interference signals occupy the same frequency band as the desired signal, rendering temporal-only filtering useless. The simplest structure is the delay-and-sum beamformer, which first compensates for the relative delay between distinct mi- crophone signals and then sums the steered signal to form a single output. This beamformer, which is still widely used, can be very effective in mitigating nonco- herent, i. e., spatially white, noise sources, provided that the number of microphones is relatively high. However, if the noise source is coherent, the noise reduction (NR) is strongly dependent on the direction of arrival of the noise signal. Consequently, the performance of the de- lay and sum beamformer in reverberant environments is often insufficient. Jan and Flanagan [47.15, 16] and Rabinkin et al. [47.17] extended the delay and sum con- cept by introducing the filter-and-sum beamformer. This structure, designed for multipath environments, namely reverberant enclosures, replaces the simpler delay com- pensator with a matched filter. The array beam pattern can generally be designed to have a specified response. This can be done by properly setting the values of the multichannel filters’ weights. However, the application of data-independent design methods is very limited in dynamic acoustical environ- ments. Statistically optimal beamformers are designed based on the statistical properties of the desired and in- terference signals. In general, they aim at enhancing the desired signal, while rejecting the interference signal. Several criteria can be applied in the design of the beam- former, e.g., maximum signal-to-noise ratio (MSNR), minimum mean-squared error (MMSE), and linearly constrained minimum variance (LCMV). A summary of several design criteria can be found in [47.5, 7]. Beamforming methods use the signals’ statistics (at least second-order statistics), which is usually not available and should be estimated from the data. More- over, the acoustical environment is time varying, due to talker and objects movements, and abrupt changes in the noise characteristics (e.g., passing cars). Hence, adaptation mechanisms are required. An adaptive coun- terpart of each of the prespecified design criteria can be derived. Early contributions to the field of adap- tive beamformers design can be attributed to Sondhi and Elko [47.18], to Kaneda and Ohga [47.19], and to Van Compernolle [47.20]. Kellermann [47.21] ad- dressed the problem of joint echo cancellation and NR by incorporating echo cancellers into the beamformer design. Nordholm et al. [47.22, 23] used microphone arrays in a car environment, and designed a beam- former employing calibration signals to enhance the obtained performance. Martin [47.24] analyzed beam- forming techniques for small microphone arrays. Many other applications of microphone arrays such as hearing aids, blind source separation (BSS), and dereverberation are addressed elsewhere in this handbook. The minimization of the mean-squared error (MSE) in the context of array processing leads to the well- known multichannel Wiener filter [47.25]. Doclo and Moonen [47.26–28] proposed an efficient implemen- tation of the Wiener filter based on the generalized singular-value decomposition (GSVD) of the micro- phone data matrix. This method yields an optimal estimation (in the MMSE sense) of the desired sig- nal component of one of the microphone signals. The authors further proposed efficient schemes for recur- sive update of the GSVD. An optional, adaptive noise cancellation postfiltering stage is proposed as well. In that scheme, in addition to the optimal estimation of the desired speech signal, an optimal noise channel is also estimated. This estimated noise component can be used as a reference noise signal (similar to the one used in [47.25]), to further enhance the speech signal. Spriet et al. [47.29] proposed a subband implementa- tion of the GSVD-based scheme, and Rombouts and P a r t H 4 7
Adaptive Beamforming and Postfiltering 47.1 Problem Formulation 947 Moonen [47.30, 31] proposed to apply the efficient QR decomposition to the problem at hand. In many adaptive array schemes the acoustical transfer-function (ATF) relating the speech source and the microphone should be known in advance, or at least estimated from the received data (note that in case of delay-only propagation, the acoustical transfer function reduces to a steering vector, consisting of phase-only components.) In contrast, the multichannel Wiener fil- ter is uniquely based on estimates of the second-order statistics of the recorded noisy signal and the noise sig- nal (estimated during noise-only segments), and does not make any a priori assumptions about the signal model. Unfortunately, as pointed by Chen et al. [47.32], the Wiener filter, which is optimal in the MMSE sense, can- not guarantee undistorted speech signal at its output. This drawback can however be mitigated by modifying the MMSE criterion to control the amount of imposed speech distortion. A method that employs this modi- fication is presented in [47.33, 34]. It is also shown that the ATFs information (only a simple delay-only case is presented in the contributions) can be incorpo- rated into the Wiener filter scheme (called the spatially preprocessed Wiener filter), resulting in an improved performance. The Wiener filter and its application to speech enhancement is addressed in a separate chapter of this handbook (6; 43). In this chapter, we concentrate on a different adap- tive structure based on the LCMV criterion. The LCMV beamformer, proposed by Frost [47.35], is aiming at minimizing the output power under linear constraints on the response of the array towards the desired speech sig- nal. Frost proposed an adaptive scheme, which is based on a constrained least-mean-square (LMS)-type adapta- tion (for the LMS algorithm please refer to [47.25]). To avoid this constrained adaptation, Griffiths and Jim [47.36] proposed the GSC structure, which separates the output power minimization and the application of the constraint. The GSC structure is based on the assump- tion that the different sensors receive a delayed version of the desired signal, and therefore we refer to it as the delay generalized sidelobe canceller (D-GSC). The GSC structure was rederived in the frequency domain, and ex- tended to deal with, the more-complicated general ATFs case by Affes and Grenier [47.37] and later by Gannot et al. [47.38]. This frequency-domain version, which takes into account the reverberant nature of the enclo- sure, was nicknamed the transfer-function generalized sidelobe canceller (TF-GSC). The GSC comprises three blocks: a fixed beamformer (FBF), which aligns the de- sired signal components, a blocking matrix (BM), which blocks the desired speech components resulting in ref- erence noise signals, and a multichannel adaptive noise canceller (ANC), which eliminates noise components that leak through the sidelobes of the FBF. Nordholm and Leung [47.39] analyze the limits of the obtainable NR of the GSC in an isotropic noise field. Bitzer et al. address the problem in [47.40, 41] and [47.42]. In [47.40], the authors derive an expression for the NR as a function of the noise field and evaluate the degradation as a function of the reverberation time (T60). The special two-microphone case is treated in [47.41]. The additional NR due to the ANC branch of the GSC, implemented by a closed-form Wiener filter rather than the adaptive Widrow least-mean-square (LMS) proce- dure, is presented in [47.42]. The frequency-band nested subarrays structure is presented and its NR is theoreti- cally analyzed by Marro et al. [47.43]. A more-complex dual GSC structure employing calibration signals was suggested and analyzed by Nordholm et al. [47.44]. Huarng and Yeh [47.45] addressed the distortion issue by evaluating the desired signal leakage into the ref- erence noise branch of the GSC structure. However, the delay-only ATFs assumption is imposed and the expected degradation due to pointing errors alone is eval- uated. The performance degradation due to constraining the Wiener filters to an finite impulse response (FIR) structure is demonstrated by Nordholm et al. in [47.46]. The resulting performance limits of the GSC structure strongly depend on the cross-correlation between the sensors’ signals induced by the noise field, as shown in the above references and by Cox [47.47]. 47.1 Problem Formulation Consider an array of M sensors in a noisy and reverber- ant environment. The received signals generally include three components. The first is a desired speech signal, the second is some stationary interference signal, and the third is some nonstationary (transient) noise component. Our goal is to reconstruct the speech component from the received signals. Let s(t) denote the desired source signal, let am(t) represent the room impulse response (RIR) of the m-th sensor to the desired source, and let nm(t) denote the noise component at the m-th sensor. P a r t H 4 7 . 1
948 Part H Speech Enhancement The observed signal at the m-th sensor (m = 1, . . . , M) is given by zm(t) = am(t)∗ s(t)+ nm(t) = am(t)∗ s(t)+ ns m(t) and nt m(t)+ nt m(t) , (47.1) where ns m(t) represent the stationary and nonstationary noise components at the m-th sensor, re- spectively, and ∗ denotes convolution. We assume that both noise components may comprise coherent (direc- tional) noise component and diffused noise component. The observed signals are divided in time into over- lapping frames by the application of a window function and analyzed using the short-time Fourier transform (STFT). Assuming time-invariant transfer functions, we have in the time–frequency domain Zm(k, ) ≈ Am(k)S(k, )+ Nm(k, ) ≈ Am(k)S(k, )+ Ns m(k, )+ Nt m(k, ) , (47.2) where is the frame index and k = 1, 2, . . . , K repre- sents the frequency bin index. (The equality in (47.2) is only justified for segments which are longer than the RIR length. Since RIRs tend to be very long, the con- ditions allowing for this representation to hold cannot be exactly met. We assume, however, that the STFT re- lation is a reasonable approximation.) Zm(k, ), S(k, ), Nm(k, ), Ns m(k, ) are the STFT of the re- spective signals. Am(k) is the ATF relating the speech source with the m-th sensor. The vector formulation of the equation set (47.2) is m(k, ), and Nt Z(k, ) = A(k)S(k, )+ N(k, ) = A(k)S(k, )+ Ns(k, )+ Nt(k, ) , (47.3) where Z(k, ) = A(k) = N(k, ) = Ns(k, ) = Nt(k, ) = T , Z1(k, ) Z2(k, ) ··· Z M(k, ) A1(k) A2(k) ··· AM(k) N1(k, ) N2(k, ) ··· NM(k, ) Ns 1(k, ) Ns M(k, ) Nt 1(k, ) Nt M(k, ) 2(k, ) ··· Ns 2(k, ) ··· Nt T , T , T , T. P a r t H 4 7 . 2 47.2 Adaptive Beamforming Frost [47.35] proposed a beamformer that relies on the assumption that the ATFs between the desired source and the array of sensors can be uniquely determined by gain and delay values. In this section, we follow Frost’s approach in the STFT domain and derive a beam- forming algorithm for the arbitrary ATF case. We first obtain a closed form of the LCMV beamformer, and subsequently derive an adaptive solution. The outcome is a constrained LMS-type algorithm. We proceed, fol- lowing the seminal work of Griffiths and Jim [47.36], with the formulation of an unconstrained adaptive solu- tion namely, the transfer-function generalized sidelobe canceller (TF-GSC). We initially assume that the ATFs are known. Later, in Sect. 47.4, we present several alter- natives for estimating the ATFs. 47.2.1 Frequency-Domain Frost Algorithm Let W and define Optimal Solution m(k, ) ; m = 1, . . . , M denote a set of M filters, ∗ WH(k, ) = 2 (k, ) ··· W ∗ ∗ 1 (k, ) W ∗ M(k, ) W , where the superscript H denotes conjugation transpose. A filter-and-sum beamformer, depicted in Fig. 47.1, is ∗ realized by filtering each sensor signal by W m(k, ) and summing the outputs, Y(k, ) = WH(k, )Z(k, ) = WH(k, )A(k)S(k, )+ WH(k, )Ns(k, ) + WH(k, )Nt(k, ) = Ys(k, )+ Yn,s(k, )+ Yn,t(k, ) , (47.4) Z1(k,l) W *1(k,l) Z2(k,l) W *2(k,l) Ó Y (k,l) ZM(k,l) W *M (k,l) Fig. 47.1 Filter-and-sum beamformer
Adaptive Beamforming and Postfiltering 47.2 Adaptive Beamforming 949 ∗ where Ys(k, ) is the signal component and Yn,s(k, ) and Yn,t(k, ) are the stationary and nonstationary noise components, respectively. The output power of the beamformer is given by (k, )} E{Y(k, )Y = E{WH(k, )Z(k, )ZH(k, )W(k, )} = WH(k, )φZZ(k, )W(k, ) , (k, )S(k, ) , = E{Z(k, )ZH(k, )} is the power where φZZ(k, ) spectral density (PSD) matrix of the received signals. We want to minimize the output power subject to the following constraint on Ys(k, ): Ys(k, ) = WH(k, )A(k)S(k, ) = F where F (k, ) is some prespecified filter, usually a simple delay. Without loss of generality we assume hereinafter that F (k, ) = 1. Hence, the minimization problem can be stated as WH(k, )φZZ(k, )W(k, ) min W subject to WH(k, )A(k) = 1 . The minimization problem (47.5) is demonstrated in Fig. 47.2. The point where the equipower contours are tangent to the constraint plane is the optimum vector of beamforming filters. The perpendicular F(k) from the origin to the constraint plane will be calculated in the next section. (47.5) ∗ grangian, To solve (47.5) we first define the complex La- L(W) = WH(k, )ZZ(k, )W(k, ) + λ[WH(k, )A(k)− 1] + λ∗[AH(k, )W(k)− 1] , (47.6) where λ is a Lagrange multiplier. Setting the derivative with respect to W to 0 [47.48] yields ∗ ∇W∗ L(W)ZZ(k, )W(k, )+ λA(k) = 0. Now, recalling the constraint in (47.5), we obtain the LCMV optimal filter WLCMV(k, ) = −1 AH(k)−1 ZZ(k, )A(k) ZZ(k, )A(k) . (47.7) This closed-form solution is difficult to implement, and is not suitable for time-varying environments. Therefore we often have to resort to an adaptive solution, which is derived in the sequel. W H(k,l)Özz(k,l)W(k,l) W LCMV(k,l) F(k) = A(k) ||A(k)||2 W(k,l) Constraint plane: A H(k)W(k,l) = 1 Fig. 47.2 Constrained minimization It is interesting to show the equivalence between the LCMV solution (47.7) and the MSNR beam- former [47.7], which is obtained from |WH(k, )A(k)|2 max W WH(k, )NN(k, )W(k, ) . (47.8) The well-known solution to (47.8) is the (colored-noise) matched filter W(k, ) ∝ −1 the NN(k, )A(k) . is array response If constrained to fulfil WH(k, )A(k) = 1, i. e., no distortion in the desired di- rection, we have WMSNR(k, ) = −1 AH(k)−1 NN(k, )A(k) NN(k, )A(k) . (47.9) Using (47.3) it can be verified that ZZ = φss(k, )A(k)AH(k) + Ns Ns(k, )+ Nt Nt(k, ) = φss(k, )A(k)AH(k)+ NN(k, ) , (47.10) = Ns Ns(k, )+ Nt Nt(k, ), the over- where NN(k, ) all noise PSD matrix. Using the matrix inversion lemma, it is shown in Appendix 47.B that WLCMV(k, ) = −1 AH(k)−1 NN(k, )A(k) NN(k, )A(k) . (47.11) This solution is identical to the solution of the MSNR beamformer. While both methods are shown to be equal, provided that the ATFs A(k) are known, their behavior in the case of unknown ATFs is different. Analysis of these differences is given by Cox [47.47]. P a r t H 4 7 . 2
950 Part H Speech Enhancement Note also that, due to the nonstationary noise com- ponent, the term NN(k, ) depends on the frame index. This time dependence is one of the major factors lead- ing to performance degradation in beamforming. We address this problem by introducing the multichannel postfilter in Sect. 47.6. Adaptive Solution Consider the following steepest-descent adaptive algo- rithm: W(k, + 1) = W(k, )− µ∇W∗ L(k, ) = W(k, )− µ [ZZ(k, )W(k, )+ λA(k)] . Imposing the look-direction constraint on W(+ 1, k) yields 1 = AH(k)W(k, + 1) = AH(k)W(k, )− µAH(k)ZZ(k, )W(k, ) − µAH(k)A(k)λ . Solving for the Lagrange multiplier and applying further rearrangement of terms yields: W(k, + 1) = P(k)W(k, )− µP(k)ZZ(k, )W(k, )+ F(k) , where P(k) = I− A(k)AH(k) A(k)2 (47.12) (47.13) P a r t H 4 7 . 2 and . F(k) = A(k) A(k)2 Further simplification can be obtained by re- placing ZZ(k, ) by its instantaneous estimator, Z(k, )ZH(k, ), and recalling (47.4). We finally obtain, (47.14) W(k, + 1) = P(k)[W(k, )− µZ(k, )Y (k, )]+ F(k) . The entire algorithm is summarized in Table 47.1. ∗ Table 47.1 Frequency-domain Frost algorithm W( = 0, k) = F(k) W(t + 1, k) = P(k) = 0, 1, . . . [P(k) and F(k) are defined by (47.13) and (47.14)]. W(k, )− µZ(k, )Y (k, ) ∗ + F(k) 47.2.2 Frequency-Domain Generalized Sidelobe Canceller In [47.36], Griffiths and Jim considered the case where each ATF is a delay element (with gain). They obtained an unconstrained adaptive enhancement algorithm, us- ing the same linear constraint imposed by Frost [47.35]. The unconstrained algorithm is more tractable, reliable, and computationally more efficient in comparison with its constrained counterpart. In the adaptive solution Sec- tion, we obtained an adaptive algorithm for the case where each ATF is represented by an arbitrary linear time-invariant system. We now repeat the arguments of Griffiths and Jim for the arbitrary ATFs case, and de- rive an unconstrained adaptive enhancement algorithm. A detailed description can be found in [47.38]. Derivation Consider the null space of A(k), defined by = {W| AH(k)W = 0} . N (k) The constraint hyperplane, = {W| AH(k)W = 1} Λ(k) is parallel to N (k). In addition, let = {κ A(k)| for any real κ} R(k) be the column space. By the fundamental theorem of linear algebra [47.49], R(k) ⊥ N (k). In particular, F(k) is perpendicular to N (k), since F(k) = 1A(k)2 A(k) ∈ R(k). Furthermore, AH(k)F(k) = AH(k)A(k)(AH(k)A(k)) −1 = 1 . Thus, F(k) ∈ Λ(k) and F(k) ⊥ Λ(k). Hence, F(k) is the perpendicular from the origin to the constraint hyper- plane, Λ(k). The matrix P(k), defined in (47.13), is the projection matrix to the null space of A(k), N (k). A vector in linear space can be uniquely split into a sum of two vectors in mutually orthogonal subspaces [47.49]. Hence, W(k, ) = W0(k, )− V(k, ) , where W0(k, ) ∈ R(k) and −V(k, ) ∈ N (k). By the definition of N (k), V(k, ) = H(k)G(k, ) , (47.15) (47.16) where H(k) is a matrix such that its columns span the null space of A(k), i. e., AH(k)H(k) = 0 , rank {H(k)} ≤ M − 1 , (47.17)
Adaptive Beamforming and Postfiltering 47.2 Adaptive Beamforming 951 contains only noise components. In general, however, H H(k)A(k, ) = 0, hence desired speech components may leak into the noise reference signals. If the speech component is indeed completely eliminated (blocked) by H(k), YANC(k, ) becomes a pure noise term. The residual noise term in YFBF(k, ) can then be reduced by properly adjusting the filters G(k, ), using the mini- mum output power criterion. This minimization problem is in fact the classical multichannel noise cancellation problem. An adaptive LMS solution to the problem was proposed by Widrow [47.25]. To summarize, the beamformer is comprised of three parts. An FBF W0, which aligns the desired signal components, a BM H(k), which blocks the desired speech components resulting in the reference noise sig- nals U(k, ), and a multichannel ANC G(k, ), which eliminates the stationary noise that leaks through the sidelobes of the FBF. (47.20) Noise Canceller Adaptation where H(k) is usually called a BM (blocking matrix). The outputs of the BM will be denoted, for reasons that will be clear in the sequel, noise reference signals U(k, ), defined as U(k, ) = H H(k)Z(k, ) , (47.18) where U(k, ) = The vector G(k, ) U2(k, ) U3(k, ) . . . UM(k, ) is a rank{H(k)} × 1 vector of adjustable filters. We assume hereinafter that rank{H(k)} = M − 1. Hence, the set of filters is defined as T . G(k, ) = G2(k, ) G3(k, ) . . . G M(k, ) T . By the geometrical interpretation of Frost’s algorithm, (47.19) W0(k, ) = F(k) = A(k) A(k)2 is the perpendicular that F(k) (Recall from the origin to the constraint hyperplane, Λ(k).) Now, us- ing (47.4), (47.15), and (47.16) we obtain Y(k, ) = YFBF(k, )− YANC(k, ) , (47.21) (47.22) where YFBF(k, ) = WH YANC(k, ) = GH(k, )H H(k)Z(k, ) . 0 (k, )Z(k, ) The output of the constrained beamformer is a difference of two terms, both operating on the input signal Z(k, ). The first term, YFBF(k, ), utilizes only fixed compo- nents (which depend on the ATFs), so it can be viewed as a FBF. The FBF coherently sums the desired speech components, while in general it destructively sums the noise components. Hence, it is expected that the signal- to-noise ratio (SNR) at the FBF output will be higher than the input SNR. However, this result cannot be guar- anteed. We will elaborate on this issue while discussing the performance analysis in Sect. 47.7. that We now examine the second term YANC(k, ). Note U(k, ) = H H(k)Z(k, ) = H H(k)[A(k)S(k, ) + Ns(k, )+ Nt(k, )] = H H(k)[Ns(k, )+ Nt(k, )] . (47.23) The last transition is due to (47.17). It is worth mention- ing that, when a perfect BM is applied, U(k, ) indeed P a r t H 4 7 . 2 The reference noise signals are emphasized by the ANC and subtracted from the output of the FBF, yielding Z(k, ) . 0 (k, )− GH(k, )H H(k) WH Y(k, ) = (47.24) Let three hypotheses H0s, H0t, and H1 indicate, re- spectively, the absence of transients, the presence of an interfering transient, and the presence of a desired source transient at the beamformer output. The optimal solution for the filters G(k, ) is obtained by minimizing the power of the beamformer output during the station- ary noise frames (i. e., when H0s is true) [47.2]. We note, however, that no adaptation should be carried out during abrupt changes in the characteristics of the noise signal (e.g., a passing car). When the noise source position is constant and the noise statistics is slowly varying, the ANC filters can track the changes. Let Ns Ns(k, ) = E{Ns(k, )NH s (k, )} denote the PSD matrix of the input stationary noise. Then, the power of the stationary noise at the beamformer output is minimized by solving the unconstrained optimization problem: min G {[W0(k, )− H(k, )G(k, )]H × Ns Ns(k, )[W0(k, )− H(k, )G(k, )]} . (47.25) A multichannel Wiener solution is given by (see also [47.42, 46]) G(k, ) = −1 H H(k, )Ns Ns(k, )H(k) × H H(k, )Ns Ns(k, )W0(k, ) . (47.26)
952 Part H Speech Enhancement In practice, this optimization problem is solved by using the normalized LMS algorithm [47.2]: G(k, + 1) = ⎧ ⎨ ⎩ G(k, )+ µg G(k, ) , Pest(k,)U(k, )Y ∗ (k, ) H0s true, otherwise, (47.27) where Pest(k, ) = αp Pest(k, − 1)+ (1− αp)U(k, )2 (47.28) represents the power of the noise reference signals, µg is a step size that regulates the convergence rate, and αp is a smoothing parameter in the PSD estimation process. To allow for the use of the STFT, we further as- sume that the ANC filters gm have a time-varying finite impulse response (FIR) structure: m(t) = gT gm,−K L (t) . . . gm,K R (t) . (47.29) ˜G(+ 1, k) = G(k, )+ µ G(+ 1, k) FIR←− ˜G(+ 1, k) . ∗ (k, ) , U(k, )Y Pest(k, ) (47.30) The operator FIR←− includes the following three stages, applied per filter: transformation of ˜Gm(+ 1, k) to the time domain, truncation of the resulting impulse re- sponse to the interval[−K L , K R] (i. e., imposing the FIR constraint), and transformation back to the frequency domain. The various filtering operations involved in the algorithm (multiplications in the transform domain) are realized using the overlap-and-save method [47.50, 51]. The resulting algorithm is merely an extension of the original Griffiths and Jim algorithm for the arbitrary ATF case. Figure 47.3 depicts a block diagram of the algorithm. The steps involved in the computation are summarized in Table 47.2. The matched beamformer W0(k) and the BM H(k) are assumed to be known at this stage. Note, that the impulse responses are taken to be non- causal, to allow for relative delays between the FBF and the ANC branches. In order to fulfill the FIR structure constraint in (47.29), the filters update is now given by 47.2.3 Time-Domain Generalized Sidelobe Canceller The most commonly used GSC structure is the classi- cal time-domain counterpart of the algorithm, proposed P a r t H 4 7 . 2 Z1(k,l) Z2(k,l) Z3(k,l) ZM(k,l) H W 0 H YFBF (k,l) + Ó Y (k,l) – YANC (k,l) U2 (k,l) U3 (k,l) G2 (k,l) G3 (k,l) Ó UM (k,l) GM (k,l) Fig. 47.3 Linearly constrained adaptive beamformer
分享到:
收藏