logo资料库

语音识别介绍.pdf

第1页 / 共16页
第2页 / 共16页
第3页 / 共16页
第4页 / 共16页
第5页 / 共16页
第6页 / 共16页
第7页 / 共16页
第8页 / 共16页
资料共16页,剩余部分请下载后查看
I Introduction
II Acoustic Models Exploiting Variable-length Contextual Information
II-A Recurrent Neural Networks
II-B Convolutional Neural Networks
III Acoustic Models with End-to-end Optimization
III-A Automatically Learned Audio Feature Representation
III-B Connectionist Temporal Classification
III-C Attention-based Sequence-to-Sequence Models
IV Acoustic Model Robustness
IV-A Acoustic Model Adaptation
IV-B Speech Enhancement and Separation
IV-C Robust Training
V Acoustic Models with Efficient Decoding
VI Future Directions
References
Recent Progresses in Deep Learning based Acoustic Models (Updated) Dong Yu and Jinyu Li Tencent AI Lab, USA., Microsoft AI and Research, USA. dongyu@ieee.org, jinyli@microsoft.com 1 8 1 0 2 r p A 7 2 ] S A . s s e e [ 2 v 8 9 2 9 0 . 4 0 8 1 : v i X r a Abstract—In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques. We first discuss acoustic models that can effectively exploit variable-length contextual information, such as recurrent neural networks (RNNs), convo- lutional neural networks (CNNs), and their various combination with other models. We then describe acoustic models that are optimized end-to-end with emphasis on feature representations learned jointly with rest of the system, the connectionist temporal classification (CTC) criterion, and the attention-based sequence- to-sequence model. We further illustrate robustness issues in speech recognition systems, and discuss acoustic model adapta- tion, speech enhancement and separation, and robust training strategies. We also cover modeling techniques that lead to more efficient decoding and discuss possible future directions in acoustic model research. 1 Index Terms—Speech Processing, Deep Learning, LSTM, CNN, Speech Recognition, Speech Separation, Permutation In- variant Training, End-to-End, CTC, Attention Model, Speech Adaptation I. INTRODUCTION In the past several years, there has been significant progress in automatic speech recognition (ASR) [1]–[21]. These pro- gresses have led to ASR systems that surpassed the threshold for adoption in many real-world scenarios and enabled services such as Google Now, Microsoft Cortana, and Amazon Alexa. Many of these achievements are powered by deep learning (DL) techniques. Readers are referred to Yu and Deng 2014 [22] for a comprehensive summary and detailed description of the technology advancements in ASR made before 2015. In this paper, we survey new developments happened in the past two years with an emphasis on acoustic models. We discuss motivations behind, and core ideas of, each interesting work surveyed. More specifically, in Section II we illustrate improved DL/HMM hybrid acoustic models that employ deep recurrent neural networks (RNNs) and deep convolutional neural networks (CNNs). These hybrid models can better exploit contextual information than deep neural networks (DNNs) and thus lead to new state-of-the-art recognition accuracy. In Section III we describe acoustic models that are designed and optimized end-to-end with no or less non- learn-able components. We first discuss the models in which audio waveforms are directly used as the input feature so 1This is an updated version with latest literature until ICASSP2018 of the paper: Dong Yu and Jinyu Li, “Recent Progresses in Deep Learning based Acoustic Models,” vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 2017. that the feature representation layer is automatically learned instead of manually designed. We then depict models that are optimized using the connectionist temporal classification (CTC) criterion which allows for a sequence-to-sequence direct optimization. Following that we analyze models that are built using attention-based sequence-to-sequence model. We devote Section IV to discuss techniques that can improve robustness with focuses on adaptation techniques, speech enhancement and separation techniques, and robust training. In Section V we describe acoustic models that support efficient decoding techniques such as frame-skipping, teacher-student training based model compression, and quantization during training. We propose core problems to work on and potential future directions in solving them in Section VI. II. ACOUSTIC MODELS EXPLOITING VARIABLE-LENGTH CONTEXTUAL INFORMATION The DL/HMM hybrid model [1]–[5] is the first deep learning architecture that succeeded in ASR and is still the dominant model used in industry. Several years ago, most hybrid systems are DNN based. As reported in [3], one of the important factors that lead to superior performance in the DNN/HMM hybrid system is its ability to exploit contextual information. In most systems, a window of 9 to 13 frames (left/right context of 4-6 frames) of features are used as the input to the DNN system to exploit the information from neighboring frames to improve the accuracy. However, the optimal length of contextual information may vary for different phones and speaking speed. This indicates that using fixed-length context window, as in the DNN/HMM hybrid system, may not be the best choice to exploit contextual information. In recent years people have proposed new models that can exploit variable-length contextual information more effectively. The most important two models use deep RNNs and CNNs. A. Recurrent Neural Networks Feed-forward DNNs only consider information in a fixed- length sliding window of frames and thus cannot exploit long- range correlations in the speech signal. On the other hand, RNNs can encode sequence history in their internal states, and thus have the potential to predict phonemes based on all the speech features observed up to the current frame. Unfor- tunately, simple RNNs, depending on the largest eigenvalue of the state-update matrix, may have gradients which either
increase or decrease exponentially over time. Hence, the basic RNNs are difficult to train, and in practice can only model short-range effects. Long short-term memory (LSTM) RNNs [23] were devel- oped to overcome these problems. LSTM-RNNs use input, output and forget gates to control information flow so that gradients can be propagated in a stable fashion over relatively longer span of time. These networks have been shown to outperform DNNs on a variety of ASR tasks [8], [24]–[27]. Note that there is another popular RNN model, called gated recurrent unit (GRU), which is simpler than LSTM but is also able to model the long short-term correlation. Although GRU has been shown effective in several machine learning tasks [28], it is not widely used in ASR tasks. At the time step t, the vector formulas of the computation of LSTM units can be described as: it = σ(Wixxt + Wihht−1 + pi ct−1 + bi) ft = σ(Wf xxt + Wf hht−1 + pf ct−1 + bf ) ct = ft ct−1 + it φ(Wcxxt + Wchht−1 + bc) ot = σ(Woxxt + Wohht−1 + po ct + bo) ht = ot φ(ct) (1a) (1b) (1c) (1d) (1e) where xt is the input vector. The vectors it, ot, ft are the activation of the input, output, and forget gates, respectively. The W.x and W.h terms are the weight matrices for the inputs xt and the recurrent inputs ht−1, respectively. The pi, po, pf are parameter vectors associated with peephole connections. The functions σ and φ are the logistic sigmoid and hyperbolic tangent nonlinearity, respectively. The operation represents element-wise multiplication of vectors. It is popular to stack LSTM layers to get better modeling power [8]. However, an LSTM-RNN with too many vanilla LSTM layers is very hard to train and there still exists the gradient vanishing issue if the network goes too deep. This issue can be solved by using either highway LSTM or residual LSTM. In the highway LSTM [29], memory cells of adjacent layers are connected by gated direct links which provide a path for information to flow between layers more directly without decay. Therefore, it alleviates the gradient vanishing issue and enables the training of much deeper LSTM-RNN networks. Residual LSTM [30], [31] uses shortcut connections be- tween LSTM layers, and hence also provides a way to allevi- ating the gradient vanishing problem. Different from highway LSTM which uses gates to guide the information flow, residual LSTM is more straightforward with the direct shortcut path, similar to Residual CNN [32] which recently achieves great success in the image classification task. Typically, log Mel-filter-bank features are often used as the input to the neural-network-based acoustic model [33], [34]. Switching two filter-bank bins will not affect the performance of the DNN or LSTM. However, this is not the case when a human reads a spectrogram: a human relies on both patterns that evolve over time and frequency to predict phonemes. This inspired the proposal of a 2-D, time-frequency (TF) LSTM [35]–[37] which jointly scans the speech input over the time and frequency axes to model spectro-temporal warping, and 2 then uses the output activations as the input to the traditional time LSTM. The joint time-frequency modeling provides better normalized features for the upper layer time LSTMs. This has been verified effective and robust to distortion at both Microsoft and Google on large-scale tasks (e.g., Google home [38]). Note that the 2D-LSTM processes along time and frequency axis sequentially, hence it increases computational complexity. In [39], several solutions have been proposed to reduce the computational cost. Highway LSTM has gates on both the temporal and spatial directions while TF LSTM has gates on both the temporal and spectral directions. It is desirable to have a general LSTM structure that works along all directions. Grid LSTM [40] is such a general LSTM which arranges the LSTM memory cells into a multidimensional grid. It can be considered as a unified way of using LSTM for temporal, spectral, and spatial computation. Grid LSTM has been studied for temporal and spatial computation in [41] and temporal and spectral computation in [37]. Although bi-directional LSTMs (BLSTMs) perform better than uni-directional LSTMs by using the past and future context information [8], [42], they are not suitable for real-time systems since the recognition can happen only after the whole utterance has been observed. For this reason, models, such as latency-controlled BLSTM (LC-BLSTM) [29] and row- convolution BLSTM (RC-BLSTM), that bridge between uni- directional LSTMs and BLSTMs have been proposed. In these models, the forward LSTM is still kept as is. However, the backward LSTM is replaced by either a backward LSTM with at most N-frames of lookahead as in the LC-BLSTM case, or a row-convolution operation that integrates information in the N-frames of lookahead. By carefully choosing N we can balance between recognition accuracy and latency. Recently, LC-BLSTM was improved by [43] to speed up the evaluation and to enable real-time online speech recognition by using better network topology to initialize the BLSTM memory cell states. B. Convolutional Neural Networks Another model that can effectively exploit variable-length contextual information is the convolutional neural network (CNN) [44], in the center of which is the convolution operation (or layer). The input to the convolution operation is usually a three-dimensional tensor (row, column, channel) for speech recognition but can be lower or higher dimensional tensors for other applications. Each channel of the input and output of the convolution operation can be considered as a view of the same data. In most setups, all channels have the same size (height, width). The filters in the convolution operation are called kernels, tensors (kernel height, kernel which are four-dimensional width, input channel, output channel) in our case. There are in total Cx × Cv kernels, where Cx is the number of input channels and Cv is the number of output channels. The kernels are applied to local regions called receptive fields in an input
image along all channels. The value after the convolution operation is υij (K, X) = vec (Kn) · vec (Xijn) , (2) n for each output channel and input slice (i, j) (the i-th step along the vertical direction and j-th step along the horizontal direction), where Kn of size (Hk, Wk) is a kernel matrix associated with input channel n and output channel and has the same size as the input image patch Xijn of channel n, vec(˙) is the vector formed by stacking all the columns of the matrix, and · is the inner product of two vectors. It is obvious that each output pixel is a weighted sum of all pixels across all channels in an input patch. Since each input pixel can be considered as a weak patten detector, each output pixel is just a boosted detector exploiting all information in the input patch. The kernel is shared across all input patches and moves along the input image with strides Sr and Sc at the vertical and horizontal direction, respectively. When the strides are larger than 1, the convolution operation also subsamples, in additional to convolving, the input image and leads to a lower- resolution image that is less sensitive to the small pattern shift inside the input patch. The translational invariance can be fur- ther improved when some kind of aggregation operations are applied after the convolution operation. Typical aggregation operations are max-pooling and average-pooling. The aggrega- tion operations often go with subsampling to reduce resolution. Due to the built-in translational invariability CNNs can exploit variable-length contextual information along both frequency and time axes. It is obvious that if only one convolution layer is used the translational variability the system can tolerate is limited. To allow for more powerful exploitation of the variable-length contextual information, convolution operations (or layers) can be stacked. The time delay neural network (TDNN) [45] was the first model that exploits multiple CNN layers for ASR. In this model, convolution operations are applied to both time and frequency axes. However, the early TDNNs are neural network only solution that do not integrate with HMMs and are hard to be used in large vocabulary continuous speech recognition (LVCSR). After the successful application of DNNs to LVCSR, CNNs were reintroduced under the DL/HMM hybrid model architec- ture [5], [7], [11], [14], [17], [46], [47], [47], [48]. Because HMMs in the hybrid model already have strong ability to handle variable-length utterance problem in ASR, CNNs were reintroduced initially to deal with variability at the frequency axis only [5], [7], [46], [47]. The goal was to improve robust- ness against vocal tract length variability between different speakers. Only one to two CNN layers were used in these early models, stacked with additional fully-connected DNN layers. These models have shown around 5% relative recognition error rate reduction compared to the DNN/HMM systems [7]. Later, additional RNN layers, e.g., LSTMs, were integrated into the model to form so called CNN-LSTMM-DNN (CLDNN) [10] and CNN-DNN-LSTM (CDL) architectures. The RNNs in these models can help to exploit the variable-length contextual information since CNNs in these models only deal with 3 frequency-axis variability. CLDNN and CDL both achieved additional accuracy improvement over CNN-DNN models. information, attracted new attentions, Researchers quickly realized that dealing with variable- length utterance is different from exploiting variable-length contextual information. TDNNs, which convolve along both the frequency and time axes and thus exploit variable-length contextual this time under the DL/HMM hybrid architecture [13], [49] and with variations such as row convolution [15] and feedforward se- quential memory network (FSMN) [16]. Similar to the original TDNNs, these models stack several CNN layers along the frequency and time-axis, with a focus on the time-axis, to account for speaking rate variation. But unlike the original TDNNs, the TDNN/HMM hybrid systems can recognize large vocabulary continuous speech very effectively. More recently, primarily motivated by the successes in image recognition, various architectures of deep CNNs [14], [17], [48], [50] have been proposed and evaluated for ASR. The premise is that spectrograms can be seen as images with special patterns from which experienced people can tell what has been said. In deep CNNs, each higher layer is a weighted sum of nonlinear transformation of a window of lower layers and thus covers longer contexts and operates on more abstract patterns. Lower CNN layers capture local simple patterns while higher CNN layers detect broader, abstract, and more complicated patterns. Smaller kernels combined with more layers allow deep CNNs to exploit longer-range dependency information along both time and frequency axes more effectively. Empirically deep CNNs are compatible to BLSTMs [19], which in turn outperform unidirectional LSTMs. However, unlike BLSTMs which suffer from long latency, because you need to wait for the whole utterance to finish to start decoding, and cannot be deployed in real-time systems, deep CNNs have limited latency and are better suited for real-time systems if the computation cost can be controlled. Training and evaluation of deep CNNs is very time consum- ing, esp. if we treat each window of frames independently, under which condition there are significant duplication of computations. To speedup the computation we can treat the whole utterance as a single input image and thus reuse the intermediate computation results. Even better, if the deep CNN is designed so that the stride at each layer is long enough to cover the whole kernel, similar to the CNNs with layer-wise context expansion and attention (LACE) [17]. Such model, called dilated CNN [48], allows to exploit longer-range information with less number of layers and can significantly reduce the computational cost. Dilated CNN has outperformed other deep CNN models on the switchboard task [48]. Note that deep CNNs can be used together with RNNs and under frameworks such as connectionist temporal classification (CTC) that we will discuss in Section III-B. III. ACOUSTIC MODELS WITH END-TO-END OPTIMIZATION The models discussed in the previous section are DNN/HMM hybrid models in which the two components DNN and HMM are usually optimized separately. However,
speech recognition is a sequential recognition problem. It is not surprising that better recognition accuracy may be achieved if all components in a model are jointly optimized. Even better, if the model can remove all manually designed components such as basic feature representation and lexicon design. was reduced by converting the time-domain convolution into the frequency-domain product [61]. Later, the CNN layer was replaced by the 2D-LSTM layer [37] to improve the robustness, and Google home was built with this end-to-end system [38]. A. Automatically Learned Audio Feature Representation B. Connectionist Temporal Classification 4 It is always arguable that the manually-designed log Mel- filter-bank feature is optimal for speech recognition. Inspired by the end-to-end processing in the machine learning commu- nity, there are always efforts [51]–[54] trying to replace the Mel-filter-bank extraction by directly learning filters using a network to process the raw speech waveforms and training it jointly with the recognizer network. Among these efforts, the CLDNN [10] on raw waveform work [54] seems to be more promising as it got slight gain over the log Mel-filter- bank feature while the other works didn’t. More importantly, it serves a good foundation of the multichannel processing with raw waveforms. The most critical thing for raw waveform processing is using a representation that is invariant to small phase shift because the raw waveforms are perceptually identical if the only difference is a small phase shift. To achieve the phase invariance, a time convolutional layer is applied to the raw waveform and then pooling is done over the entire time length of the time-convolved output signal. This process reduces the temporal variation (hence phase invariant) and is very similar to the Gammatone filterbank extraction. The pooled outputs can be considered as the filter-bank outputs, on which the standard CLDNN [10] is applied. Interestingly, the same idea was recently applied to anti-spoofing speaker verification task with significant gain [55]. With the application of deep learning models, now the ASR systems on close-talking scenario perform very well. The research interest is shifted to the far-field ASR which needs to handle both additive noise and reverberation. The current dominant approach is still using the traditional beamforming method to process the waveforms from multiple microphones, and then inputting the beamformed signal into the acoustic model [56]. Efforts have also been made to use deep learning models to perform beamforming and to jointly train the beamforming and recognizer networks [57]–[60]. In [57], the beamforming network and recognizer networks were trained in a sequence by first training the beamforming network, then training the recognizer network with the beamformed signal, and finally jointly training both networks. In [58]–[60], both networks were jointly trained in a more end-to-end fashion by extending the aforementioned CLDNN to raw waveform. In the first layer, multiple time convolution filters are used to map the raw waveforms from multiple microphones into a single time-frequency representation [58]. Then the output is passed to the upper layer CLDNN for phoneme classification. Later, the joint network is improved by factorizing the spatial and spectral selectivity of bottom layer network into a spatial filtering layer and a spectral filtering layer. The factored network brings accuracy improvement with the cost of increasing computational cost, which later Speech recognition task is a sequence-to-sequence task, which maps the input waveform to a final word sequence or an intermediate phoneme sequence. What the acoustic modeling cares is the output of word or phoneme sequence, instead of the frame-by-frame labeling which the traditional cross entropy (CE) training criterion cares. Hence, the Connectionist Temporal Classification (CTC) approach [9], [62], [63] was introduced to map the speech input frames into an output label sequence. As the number of output labels is smaller than that of input speech frames, CTC path is introduced to force the output to have the same length as the input speech frames by adding blank as an additional label and allowing repetition of labels. Denote x as the speech input sequence, l as the original label sequence, and B−1(l) represents all the CTC paths mapped from l. Then, the CTC loss function is defined as the sum of negative log probabilities of correct labels as (3) (4) T LCT C = −lnP (l|x), where P (l|x) = P (z|x) z∈B−1(l) With the conditional independent assumption, P (z|x) can be decomposed into a product of posterior from each frame as P (z|x) = P (zt|x). (5) t=1 In [62], CTC with context-dependent phone output units have been shown to outperform CTC with monophone [9], and to perform in par with the LSTM model with cross entropy criterion when training data is large enough. One attractive characteristics of CTC is that because the output unit is usually phoneme or unit larger than phoneme, it can take larger input time step instead of single 10ms frame shift. For example, in [62], three 10ms frames are stacked together as the input to CTC models. By doing so, the acoustic score evaluation and decoding happen every 30ms, 3 times faster than the traditional systems that operate on 10ms frame shift. The most attractive characteristics of CTC is that it provides a path to end-to-end optimization of acoustic models. In the deep speech [15], [64] and EESEN [65], [66] work, the end-to- end speech recognition system is explored to directly predict characters instead of phonemes, hence removing the need of using lexicons and decision trees which are the building blocks in [9], [62], [63]. This is one step toward removing expert knowledge when building an ASR system. Another advantage of character-based CTC is that it is more robust to the accented speech as the graphoneme sequence of words is less affected by accents than the phoneme pronunciation [67]. Other output
units that are larger than characters but smaller than words have also been studied [68]. It is a design challenge to determine the basic output unit to use for CTC prediction. In all the aforementioned works, the decomposition of a target word sequence into a sequence of basic units is fixed. However, the pre-determined fixed decomposition is not necessarily optimal. In [69], gram- CTC was proposed to automatically learn the most suitable decomposition of target sequences. Gram-CTC is based on characters, but allows to output variable number of characters (i.e., gram) at each time step. This not only boosts the modeling flexibility but also improves the final ASR system accuracy. As the goal of ASR is to generate a word sequence from the speech waveform, word unit is the most natural output unit for network modeling. In [62], CTC with word output targets was explored but the accuracy is far from the phoneme-based CTC system. In [18], it was shown that by using 100k words as the output targets and by training the model with 125k hours of data, the CTC system with word units can beat the CTC system with phoneme unit. Figure 1 gives an example of the posterior output of word CTC. In the figure, the units with the maximum posterior values are blanks and silences at most of time steps. All other posterior spikes come from word units. Hence, the ASR task becomes very simple: the output word sequence is constructed by taking the words corresponding to posterior spikes. No language model or complex decoding process is involved. A big challenge in the word-based CTC is the out- of-vocabulary (OOV) issue. In [18], [62], [70], only the most frequent words in the training set were used as targets whereas the remaining words were just tagged as OOVs. All these OOV words can neither be further modeled nor be recognized during evaluation. To solve this OOV issue in the word-based CTC, a hybrid CTC was proposed [71] to use the output from the word-based CTC as the primary ASR result and consults a character-based CTC at the segment level where the word-based CTC emits an OOV token. In [72], a spell and recognize model was used to learn to first spell a word and then recognize it. Whenever an OOV is detected, the decoder consults the character sequence from the speller. In [71], [72], the displayed hypothesis is more meaningful than OOV to users. However, both methods cannot improve the overall recognition accuracy too much due to the two-stage (OOV-detection and then character-sequence- consulting) process. In [73], a better solution was proposed by decomposing the OOV word into a mixed-unit sequence of frequent words and characters at the training stage. During testing, the whole system uses greedy decoding to generate hypotheses in a single step without the need of using the two-stage processing. Combined with attention modeling for CTC [74], such an end-to-end model without using any LM or complex decoder can significantly outperform the traditional context-dependent phoneme CTC which has strong LM and decoder. Compared to traditional cross-entropy training of LSTM, CTC is harder to train. First, the network initialization is very important. In [9], the LSTM network for CTC training was initialized from the LSTM network trained with cross entropy 5 Fig. 1. An example of word CTC. criterion. This can be circumvented by using very large amount of training data which also helps to prevent overfitting [62]. If the CTC network is randomly initialized, when presented very difficult samples, the CTC network tends to be very hard to train. In [15], a learning strategy called SortaGrad was proposed by first presenting the CTC network with shorter utterances (easy samples) and then longer utterances (hard samples) in the first training epoch. In the later epochs, the utterances are given to CTC network randomly. This significantly improves the convergence of CTC training. The spike patterns in Figure 1 is general to CTC modeling with any basic units. Therefore, at the time steps that blank symbol dominates, it may be redundant to do the search as no information is provided. Given this observation, phone synchronous decoding [75] was proposed by skipping the search of blank-dominated time steps during CTC decoding. 2-3 times speedup was obtained without accuracy loss. Note that the occurrence of spikes in CTC usually has a delay compared to the ground-truth location of the symbol. Such a delay introduces latency during the runtime decoding which is not desirable to the systems with realtime require- ment. Therefore, a delay-constrained training was proposed in [63] by restricting the search paths used in the forward- backward process during CTC training to those in which the delay between CTC labels and the ground-truth alignment does not exceed a threshold. This constraint degrades the CTC training a little, but the loss was recovered after sequence training. The frame-independence assumption in CTC is most criti- cized. There are several attempts to improve CTC modeling by relaxing or removing such assumption. In [74], attention modeling was directly integrated into the CTC framework by using time convolution features, non-uniform attention, implicit language modeling, and component attention. Such an attention CTC model relaxes the frame-independence assump- tion by working on model hidden layers without changing the CTC objective function and training process, hence enjoying the simplicity of CTC modeling. On the other hand, RNN transducer [76] and RNN aligner [77] extend CTC modeling by changing the objective function and the training process to remove the frame-independence assumption of CTC. Specif-
6 The AttentionDecoder network has three components: a multinomial distribution generator (9), an RNN decoder (10), and an attention network (11)-(16) as follows: lu = Generate(lu−1, su, cu), su = Recurrent(su−1, lu−1, cu), T cu = Annotate(αu, h) = αu,tht (9) (10) (11) t=1 αu = Attend(su−1, αu−1, h). (12) Here, lu ∈ UK, ht, cu ∈ Rn, αu ∈ UT , and for simplicity su ∈ Rn. Generate(.) is a feedforward network with a soft- max operation generating the probability of the target output p(lu|lu−1, su, cu). Recurrent(.) is an RNN decoder operating on the output time axis indexed by u and has hidden state su. Annotate(.) computes the context vector cu (also called the soft alignment) using attention probability vector αu. Attend(.) computes the attention weight αu,t using a single layer feedforward network as, T eu,t = Score(su−1, αu−1, ht), (13) αu,t = (14) where eu,t ∈ R. Score(.) can either be content or hybrid. It is computed using, exp(eu,t) t=1 exp(eu,t) , eu,t = vT tanh(Usu−1 + Wht + b), (content) vT tanh(Usu−1 + Wht + vfu,t + b), (hybrid) where, fu,t = f ∗ αu−1. (15) (16) The operation ∗ denotes convolution. U, W, v, f , b are train- able attention parameters . ically, RNN transducer was shown very effective [78], [79] as it incorporates acoustic model with its encoder, language model with its prediction network, and decoding process with its joint network. Inspired by the CTC work, lattice-free maximum mutual information (LFMMI) [80] was recently proposed to train deep networks from scratch without initializing from cross- entropy networks. This single-step training has great advantage over current popular two-step training: first cross-entropy training and then sequence training. Lots of efforts have been developed to make LFMMI work, including a topology that the first frame of a phoneme has a different label than the remaining frames; a phoneme n-gram language model used to create denominator graph; a time-constraint similar to the delay-constrain used in CTC; several regularization methods to reduce overfitting; stacking multiple input frames as what CTC does; etc. LFMMI has been proven effective on tasks with different scale and underlying models. Although there are lots of models proposed in recent years, clearly there is a major AM developing line from DNN to LSTM (temporal modeling) and then to CTC (end-to- end modeling). Although some models can achieve similar performance as CTC when modeling phonemes, they may not fit the trend of end-to-end modeling very well as these models still require expert knowledge to design and need components such as language model and lexicon. C. Attention-based Sequence-to-Sequence Models Attention-based sequence-to-sequence model is another end-to-end model [81], [82]. It roots from the successful atten- tion model in machine learning [83], [84] which extends the encoder-decoder framework [85] using an attention decoder. The attention model calculates the probability at step i as P (l|x) = P (lu|x, l1:u−1), u with P (lu|x, l1:u−1) = AttentionDecoder(h, l1:u−1), h = Encoder(x). (6) (7) (8) The training criterion is to minimize −lnP (l|x). The flowchart of attention-based model is given in Figure 2. Different from the encoder in [85] which only takes the hidden vector of last time step, the encoder in Eq. (8) transforms the whole speech input sequence x to a high-level hidden vector sequence h = (h1, h2, ......, hL), L ≤ T . Then, at each step in generating an output label li, an attention mechanism in Eq. (7) selects/weights the hidden vector sequence h so that the most related hidden vectors are used for the prediction. Comparing Eq. (6) with Eq. (4), we can see the attention-based model doesn’t have the frame-independence assumption imposed by CTC, which is the advantage of the attention model. Fig. 2. The flowchart of attention-based model. The attention-based model is even harder to train than the CTC model. There are plenty of tricks to be applied. For example, the vanilla attention-based model is highly complex during training if all the hidden vectors at all time steps are used in Eq. (7). Therefore, windowing method is used in [81] to reduce the number of candidates used in attention decoder. In [82], a pyramid structure is used in the encoder network so that only L high-level hidden vectors are generated instead of T hidden vectors from all the input time steps. Due to the high complexity and slow speed of training,
the majority of attention-based works were majorly done at Google [82], compared to the CTC works reported from many sites. However, recently Google significantly advanced the research of attention-based model by • modeling with word piece units [86] which are more stable and helpful to LM modeling; • including scheduled sampling [87] which feeds the pre- vious predicted label instead of the ground truth during training so that training and testing are consistent; • having multi-head attention [88] so that each head can generate a different attention distribution and play a different role; • applying label smoothing [89] to prevent the model from making over-confident predictions; • integrating external LM which was trained with more text • using minimum word error rate sequence discriminative data [90]; training [91]. With all these improvements, the final attention-based end-to- end system clearly outperformed the traditional hybrid system [92]. Different from CTC, a challenge to attention-based model is the attention was performed on top of the whole input utterance which means it cannot be performed in a streaming fashion even if the encoder can be done in a streaming mode. The frame-independence assumption in CTC is the most criticized assumption as speech frames are correlated. On the other hand, the attention-based model has its drawback of not having monotonic left-to-right alignment and slow convergence. In [93], the attention training is combined with CTC training in a multi-task learning way by using CTC objective function as the auxiliary function. Such a training strategy greatly improves the convergence of attention-based model and mitigates the alignment issue. In [94], it was further advanced with jointly decoding with the scores from both attention-based model and CTC model. IV. ACOUSTIC MODEL ROBUSTNESS Current state of the art systems can achieve remarkable recognition accuracy when the test and training sets match, esp. when both under quiet close-talk condition. However, the performance dramatically degrades under mismatched or complicated environments such as higher noise condition, including music or interfering talkers, or speech with strong accents [95], [96]. The solutions to this problem include adaptation, speech enhancement, and robust modeling. A. Acoustic Model Adaptation In this section, we use speaker adaptation as an example scenario to describe acoustic model adaptation technologies. The same technology should be easily applied to the adaptation of new environments and tasks, etc. Typically the speaker in- dependent (SI) models are trained from a large dataset with an objective to work best for all speakers. Speaker adaptation can significantly boost the performance of an individual speaker [97], [98]. However, we typically have limited adaptation data, and unsupervised adaptation is the main stream given 7 prohibitive transcription cost. Current research focus is unsu- pervised adaption with limited amount of speaker-dependent data, which can be addressed with better adaptation criterion and model topology. Since the adapted models are speaker dependent (SD), the size of the SD parameters is critical if we want to scale to millions of speakers. This requires solutions to minimizing the SD model footprint while maintaining the adaptation benefits. Given the limited amount of adaptation data, the SD model should not be far away from the SI model. [99] adds Kullback- Leibler divergence (KLD) regularization to the training crite- rion to prevent the adapted model from straying too far away from the SI model. This KLD adaptation criterion has been proven very effective dealing with limited adaption data. Most state-of-the-art SI models use senone (tied triphone states) as the output units. When limited amount of adaptation data is available, only very small amount of senones have been observed. In such a case, the adaptation turns to overfit the data distribution of these senones thus cannot generalize very well. In [100], a multi-task learning (MTL) framework was proposed by adding auxiliary monophone classification as the second task in addition to the primary seone classification task. As a result, the network adaptation is backed off to improving monophone classification accuracy when senones are not observed, hence increasing the generalization ability. In contrast to adjusting the adaptation criterion, most of works focus on how to use very small amount of parameters to represent speaker characteristics. One solution is the singular value decomposition (SVD) bottleneck adaptation [101] which produces low-footprint SD models by making use of the SVD-restructured topology [102]. The linear transformation is applied to each of the bottleneck layer by adding a kXk SD matrices. The advantage of this approach is that only a couple of small matrices need to be updated for each speaker as k is the low-rank value of the SVD reconstruction and usually is very small. This dramatically reduces the deployment cost for speaker personalization while producing more reliable estimate of the adapted model [101]. Works have been done to further reduce the size of the kXk SD matrices. For example, when the adaptation data is very limited, the kXk matrix can be reduced to a diagonal matrix, such as learning hidden unit contribution (LHUC) [103], [104] and sigmoid adaptation [105]. This is a tradeoff between the modeling capacity and generalization. The LHUC and sigmoid adaptation have much smaller number of adaptation parameters compared to the SVD adaption, but they may not get similar accuracy improvement when the amount of adaptation data is increased. The observation that kXk SD matrices usually are diagonally dominant matrices inspired the proposal of low-rank plus diagonal (LRPD) decomposition which decomposes the kXk SD matrices into a diagonal matrix plus the multiplication of two low-rank matrices. By varying the low-rank values, the LPRD matrix generalizes the full-rank and the diagonal adaptation matrix, and hence can automatically utilize the adaptation data well instead of making tradeoff between model capacity and generalization. The subspace methods are another type of methods that also aim to find a low dimensional subspace of the transformations,
so that each transformation can be specified by a small number of parameters. One popular method in this category is the use of auxiliary features, such as i-vector [106], [107], speaker code [108] , and noise estimate [109] which are concatenated with the standard acoustic features. It can be shown that the augmentation of auxiliary features is equivalent to confining the adapted bias vectors into a speaker subspace [110]. Fur- thermore, networks can be used to transform speaker features such as i-vectors into a bias to offset the speech feature into a speaker-normalized space [111]. In addition to augmenting features in the input space, the acoustic-factor features can also be appended in any layer of deep networks [112]. Other subspace methods include cluster adaptive training (CAT) [113], [114] and factorized hidden layer (FHL) [115], [116], where the transformations are confined into the speaker subspace. Similar to the eigenvoice [117] or cluster adaptive training [118] in the Gaussian mixture model era, CAT [113], [114] in DNN training constructs multiple DNNs to form the bases of a canonical parametric space. During adaptation, an interpolation vector which is associated to a target speaker or environment is estimated online to combine the multiple DNN bases into a single adapted DNN. Because only the combination vector is estimated, the adaptation only needs very small amount of data for fast adaptation. However, this is again a tradeoff from the model capacity. In contrast with online estimation of the combination vector, [119], [120] directly uses the posterior vectors of the acoustic context to enable fast unsupervised adaptation. The acoustic context factor can be speaker, gender, or acoustic environments such as noise and reverberation. The posterior calculation can be either independent [119] or dependent [120] on the recognizer network. An issue in the CAT-style methods is that the bases are full- rank matrices, which require very large amount of training data. Therefore, the number of bases in CAT is usually constrained to few [113], [114]. A solution is to use FHL [115], [116] which constrains the bases to be rank-1 matrices. In such a way, the training data for each basis is significantly reduced, enabling the use of larger number of bases. Also, FHL initializes the combination vector from i-vector for speaker adaptation, which helps to give the adaptation a very good starting point. In [121], LRPD was extended into the subspace- based approach to further reduce the speaker-specific footprint in a very similar way to FHL. B. Speech Enhancement and Separation It is well known that the current ASR systems perform poorly when the speech is corrupted with heavy noise or interfering speech [122], [123]. Although human listeners also suffer from poor audio signals, the performance degradation is significantly smaller than that in ASR systems. In recent years, many works have been done to enhance speech under these conditions. Although majority of the works are focused on single-channel speech enhancement and sepa- ration, the same techniques can be easily extended to multi- channel signals. In the monaural speech enhancement and separation tasks, it is assumed that a linearly mixed single-microphone signal y[n] = S 8 s=1 xs[n] is known and the goal is to recover the S streams of audio sources xs[n], s = 1,··· , S. If there are only two audio sources, one for speech and one for noise (or music, etc.) and the goal is to recover the speech it’s often called speech enhancement. If there are source, multiple speech sources, is often referred to as speech separation. The enhancement and separation is usually carried out in which the task can be cast as recovering the short-time Fourier transformation (STFT) of the source signals Xs(t, f ) for each time frame t and frequency bin f, given the STFT of the mixed speech in the time-frequency domain, it Y (t, f ) =S s=1 Xs(t, f ). Obviously, given only the mixed spectrum Y (t, f ), the problem of recovering Xs(t, f ) is under-determined (or ill- posed), as there are an infinite number of possible Xs(t, f ) combinations that lead to the same Y (t, f ). To overcome this problem, the system has to learn a model based on some training set S that contains parallel sets of mixtures Y (t, f ) and their constituent target sources Xs(t, f ), s = 1, ..., S [20], [21], [124]–[129]. Over the decades, many attempts have been made to at- tack this problem. Before the deep learning era, the most popular techniques include computational auditory scene anal- ysis (CASA) [130]–[132], non-negative matrix factorization (NMF) [133]–[135], and model based approach [136]–[138], such as factorial GMM-HMM [139]. Unfortunately these techniques only led to very limited success. Recently, researchers have developed many deep learning techniques for speech enhancement and separation. The core of these techniques is to cast the enhancement or separation problem into a supervised learning problem. More specifically, the deep learning models are optimized to predict the source belonging to the target class, usually for each time-frequency bin, given the pairs of (usually artificially) mixed speech and source streams. Compared to the original setup of unsuper- vised learning, this is a significant step forward and leads to great progress in speech enhancement. This simple strategy, however, is still not satisfactory, as it only works for separating audios with very different characteristics, such as separating speech from (often challenging) background noise (or music) or speech of a specific speaker from other speakers [127]. It does not work well for speaker-independent multi-talker speech separation. The difficulty in speaker-independent multi-talker speech separation comes from the label ambiguity or permutation problem. Because audio sources are symmetric given the mixture (i.e., x1 + x2 equals to x2 + x1 and both x1 and x2 have the same characteristics), there is no pre-determined way to assign the correct source target to the corresponding output layer during supervised training. As a result, the model cannot be well trained to separate speech. Fortunately, several techniques have been proposed to ad- dress the label ambiguity problem [20], [21], [123], [128], [129], [140]. In Weng et al. [123] the instantaneous energy was used to solve the label ambiguity problem and a two-speaker joint-decoder with a speaker switching penalty was used to separate and trace speakers. This work achieved the best result on the dataset used in 2006 monaural speech separation
分享到:
收藏