科大讯飞FSMN论文.pdf

发布时间：2022-06-01 发布人：admin 分类：说明书资料大小：0.43M 资料格式：pdf 举报版权申诉

jason553553-9702501-4744302543017161228.pdf-第1页.png

第1页 / 共4页

jason553553-9702501-4744302543017161228.pdf-第2页.png

第2页 / 共4页

jason553553-9702501-4744302543017161228.pdf-第3页.png

第3页 / 共4页

jason553553-9702501-4744302543017161228.pdf-第4页.png

第4页 / 共4页

1 Introduction

2 Our Approach

2.1 Feedforward Sequential Memory Neural Networks

2.2 Implement FSMN for language models

3 Experiments

3.1 Results

4 Conclusions and Future Work

5 1 0 2 t c O 9 ] E N . s c [ 1 v 3 9 6 2 0 . 0 1 5 1 : v i X r a Feedforward Sequential Memory Neural Networks without Recurrent Feedback Shiliang Zhang1, Hui Jiang2, Si Wei3, Lirong Dai1 1National Engineering Laboratory for Speech and Language Information Processing University of Science and Technology of China, Hefei, Anhui, China 2Department of Electrical Engineering and Computer Science York University, 4700 Keele Street, Toronto, Ontario, M3J 1P3, Canada 3IFLYTEK Research, Hefei, Anhui, China zsl2008@mail.ustc.edu.cn, hj@cse.yorku.ca, siwei@iflytek.com, lrdai@ustc.edu.cn Abstract We introduce a new structure for memory neural networks, called feedforward se- quential memory networks (FSMN), which can learn long-term dependency with- out using recurrent feedback. The proposed FSMN is a standard feedforward neural networks equipped with learnable sequential memory blocks in the hid- den layers. In this work, we have applied FSMN to several language modeling (LM) tasks. Experimental results have shown that the memory blocks in FSMN can learn effective representations of long history. Experiments have shown that FSMN based language models can signiﬁcantly outperform not only feedforward neural network (FNN) based LMs but also the popular recurrent neural network (RNN) LMs. 1 Introduction When machine learning methods are applied to model sequential data such as text, speech and video, it is very important to take advantage of the long-term dependency. Traditional approaches have explored to capture the long-term structure within the sequential data using recurrent feedbacks such as in regular recurrent neural networks (RNNs) or LSTM-based models. [1, 2, 3, 4]. RNNs can learn and carry out complicated transformations of data over extended periods of time and store the memory in the weights of the network. Therefore, they are gaining more and more popular in sequential data modeling tasks. More recently, different from RNNs, there has also been a surge in constructing neural computing models with varying forms of explicit memory units [5, 6, 7, 8] . For example, in [6], the proposed memory networks employ a memory component that can be read from and written to. In [5], the proposed neural turing machines (NTM) improve the memory of neural networks by coupling with external memory resources, which can learn to sort a small set of numbers as well as other symbolic manipulation tasks. In this work, we have proposed a simpler structure for memory neural networks, namely feedfor- ward sequential memory networks (FSMN), which can learn long-term dependency in sequential data without using the recurrent feedback. For FSMN, we extend the standard feedforward neural networks by introducing memory blocks in the hidden layers. Different from RNNs, the overall FSMNs remain as a feed-forward structure so that they can be learned in much more efﬁcient and stable ways than RNNs and LSTMs. In our work, we have evaluated the performance of FSMN on two language modeling tasks: Penn Tree Bank (PTB) and English wiki in Large Text Compres- sion Benchmark (LTCB). In both tasks, FSMN based language models can signiﬁcantly outperform not only the standard FNN-LMs but also the popular recurrent neural network (RNN) LMs by a signiﬁcant margin. 1

Figure 1: Illustration of feedforward sequential memory networks and comparison with RNNs. 2 Our Approach 2.1 Feedforward Sequential Memory Neural Networks Feedforward sequential memory network (FSMN) is a standard feedforward neural network with single or multiple memory blocks in the hidden layer. For instance, Figure 1 (a) shows an FSMN with one memory block added into its second hidden layer. Given a sequence, X = {x1, x2,··· , xT}, each xt ∈ RD×1 represents an input data at time instance t. The corresponding hidden layer outputs are denoted as H = {h1, h2,··· , hT}, each ht ∈ RD×1. As shown in Figure 1 (b), we can use a tapped-delay structure to encode ht and its previous N histories into a ﬁxed-sized representation in the memory block: N i=0 ˜ht = f ( ai · ht−i) (1) where all coefﬁcients form an N-dimension learnable vector a = {a0, a1, a2,··· , aN}, and f (·) is the activation function (sigmoid or RELU). Furthermore, as shown in Figure 1 (a), ˜ht may be fed into next hidden layer in the same way as ht. From the viewpoint of signal processing, the memory block in FSMN can be regarded as a high- order ﬁnite impulse response (FIR) ﬁlter while the recurrent layer in RNNs, namely ˜ht = f (ht + W · ˜ht−1), may be viewed as a ﬁrst-order inﬁnite impulse response (IIR) ﬁlter, see Figure 1 (c). Obviously, the vector a may be regarded as the coefﬁcients of an N-order FIR ﬁlter. We know IIR ﬁlters are more compact than FIR ﬁlters. However, IIR ﬁlters may be difﬁcult to implement. In some cases, IIR ﬁlters may become unstable but FIR ﬁlters are always stable. Moreover, the learning of IIR-ﬁlter-like RNNs requires to use the so-called back-propagation through time (BPTT) which signiﬁcantly increases the computational complexity of the learning and also causes the problems of gradient vanishing and exploding [10]. On the other hand, the proposed FIR-ﬁlter-like FSMNs can be efﬁciently learned using the standard back-propagation procedure. Therefore, the learning of FSMNs is more stable and efﬁcient than that of RNNs. 2.2 Implement FSMN for language models The goal in language modeling is to predict the next word in a text sequence given all previous words. We now explain how to implement FSMNs for this task. FSMN is a standard feedforward neural network (FNN) except the additional memory blocks. We will show that the memory block can be efﬁciently implemented as sentence-by-sentence matrix multiplications, which are suitable for the mini-batch based stochastic gradient descent (SGD) method running on GPUs. Suppose the N-order FIR ﬁlter coefﬁcients in the memory block are denoted as a = {a0, a1, a2,··· , aN}. For a given sentence X consisting of T words, we may construct a T × T 2 Hidden layer Output layerInput layer!Memory Block（a）Feedforward sequential memory neural networksht∑!htz−1⊗W（c）Recurrent layer in RNN as IIR ﬁlterf(·)z−1z−1z−1!⊗⊗⊗⊗∑∑∑!a0a1a2aN!htht（b) Memory block in FSMN as FIR ﬁlterf(·)ht−1ht−2ht−N!hthtxtyt

square matrix M as follow:  ... a0 a1 ··· aN 0 ··· 0··· 0 a0 a1 ··· aN 0··· ... ... a0 ··· aN ... a0 ... 0 ... 0  M1 M2 ... MK  ) = f ( ¯H · ¯M) 0 M = ··· 0 ··· ··· ··· Therefore, the sequential memory operations in eq.(1) for the whole sequence can be computed with one matrix multiplication as: ˜H = f (H · M). Similarly, we may extend the idea to a mini-batch composed of K sentences, L = {X1 X2 ··· XK}, we can compute the sequential memory representation for all sentences in the mini-batch as follows: ... ˜H = f ([H1, H2 ··· HK] · (2) In the backward pass, except the weights in the network, we also need to calculate the gradients of ¯M, which is then used to update the ﬁlter coefﬁcients a. We can calculate the gradients using the standard back-propagation (BP) algorithm. Therefore, all computation in FSMNs can be formulated as large matrix multiplications, which can be efﬁciently conducted in GPUs. As a result, FSMN based LMs have the same computational complexity as the standard NN LMs in training, which is much more efﬁcient than RNN-LMs. 3 Experiments We have evaluated FSMNs on two benchmark LM tasks: i) the Penn Treebank (PTB) corpus of about 1M words, following the same setup as [11]. ii) The Large Text Compression Benchmark (LTCB) [12]. In LTCB, we use the enwik9 dataset, which is composed of the ﬁrst 109 bytes of enwiki- 20060303-pages-articles.xml. We split it into three parts: training (153M), validation (8.9M) and test (8.9M) sets. We limit the vocabulary size to 80k for LTCB and replace all out-of-vocabulary words by . For FSMNs, the hidden units employ the rectiﬁed linear activation function, i.e., f (x) = max(0, x). The nets are initialized based on the normalized initialization in [13], without using any pre-training. We use SGD with a mini-batch size of 200 and 500 for PTB and LTCB tasks respectively. The initial learning rate is 0.4 and 0.002 for the weights and ﬁlter coefﬁcients respectively, which is kept ﬁxed as long as the perplexity on the validation set decreases by at least 1. After that, we continue six more epochs of training, where the learning rate is halved after each epoch. In PTB task, we also use momentum (0.9) and weight decay (0.00004) to avoid overﬁtting. 3.1 Results We have ﬁrst evaluated the performance of FSMN-LMs on the PTB task. We have trained FSMN with an input context window of two, where the previous two words are used to predict the next word. The FSMN contains a linear projection layer (of 200 nodes), two hidden layers (of 400 nodes pre layer) and a memory block in the ﬁrst hidden layer. For the memory block, we use a 20-order FIR ﬁlter. In Table 1, we have summarized the perplexities on the PTB test set for various language models. For the LTCB task, we have trained several baseline systems: i) two n-gram LMs (3-gram and 5-gram) using the modiﬁed Kneser-Ney smoothing without count cutoffs; ii) several traditional feedforward NNLMs with different model sizes and input context windows (bigram, trigram, 4- gram and 5-gram); iii) an RNN-LM with one hidden layer of 600 nodes using the toolkit in [15]; iv) 3

Test PPL Table 1: Perplexities on PTB for various LMs. Model KN 5-gram [11] RNNLM [11] LSTM [2] MemN2N[7] trigram FNNLM[14] 6-gram FNNLM[14] FOFE-FNNLM[14] FSMN-LM 141 123 117 111 131 113 108 102 Table 2: Perplexities on LTCB for various language models. (M) denotes a hidden layer with memory block. Model KN 3-gram KN 5-gram RNN-LM FNNLM Architecture - - [1*600]-80k [2*200]-3*600-80k [2*200]-1200-2*600-80k Test PPL FOFE-FNNLM [2*200]-3*600-80k FSMN-LM [2*200]-1200-2*600-80k [2*200]-600(M)-600-600-80k [2*200]-600-600(M)-600-80k [2*200]-600(M)-600(M)-600-80k 156 132 112 155 154 104 107 95 96 92 2nd-order FOFE based FNNLM [14] with different hidden layer sizes. Moreover, we have examined our FSMN based LMs with different architectures. We have trained a 3-hidden-layer FSMN with a memory block in the ﬁrst hidden layer, second hidden layer or both. The order of the FIR ﬁlter is 30 in these experiments. In Table 2, we have summarized the perplexities on the LTCB test set for various models. Experimental results in Table 1 and Table 2 have shown that the FSMN based LMs can signiﬁcantly outperform the baseline higher-order feedforward neural network (FNN) LMs, FOFE-based FNN LMs as well as the popular RNN-based LMs by a quite signiﬁcant margin. 4 Conclusions and Future Work In this work, we have proposed a novel neural network architecture, namely feedforward sequential memory networks (FSMN), which use FIR-ﬁlter-like memory blocks in the hidden layer of standard feedforward neural networks. Experimental results on language modeling tasks have shown that the FSMN can effectively learn the long term history. For the future work, we will try to apply this model to other sequential data modeling tasks, such as acoustic modeling in speech recognition. References [1] [2] [3] S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural computation, 1997, 9(8): 1735-1780. A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013. J. Chung, C. Gulcehre, K. H. Cho, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014. T. Mikolov, A. Joulin, S. Chopra, et al. Learning longer memory in recurrent neural networks. arXiv:1412.7753, 2014. A. Graves, G. Wayne, I. Danihelka. Neural Turing Machines. arXiv:1410.5401, 2014. J. Weston, S. Chopra, A. Bordes. Memory networks. arXiv:1410.3916, 2014. S. Sukhbaatar, A. Szlam, J. Weston. End-To-End Memory Networks. arXiv:1503.08895, 2015. A. Joulin, T. Mikolov. Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets. arXiv:1503.01007, 2015. A. Bordes, N. Usunier, S. Chopra, J. Weston. Large-scale Simple Question Answering with Memory Networks. arXiv:1506.02075, 2015. Y. Bengio, P. Simard, P. Frasconi. Learning long-term dependencies with gradient descent is difﬁcult. IEEE Transactions on Neural Networks. volume 5, no 2, pages 157-166, 1994. T. Mikolov, S. Kombrink, L. Burget, J. Cernocky and S. Khudanpur. Extensions of recurrent neural network language model. In Proc. of International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 5528-5531, 2011. [4] [5] [6] [7] [8] [9] [10] [11] [14] [12] M. Mahoney. Large Text Compression Benchmark. In http://mattmahoney.net/dc/textdata.html, 2011. G. Xavier, Y. Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In [13] Proc. of AISTATS, 2010. S. Zhang, H. Jiang, M. Xu, J. Hou and L.R. Dai. The Fixed-Size Ordinally-Forgetting Encoding Method for Neural Network Language Models. in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), pp.495-500, 2015. T. Mikolov, M. Karaﬁ´at, L. Burget, J. Cernock`y and S. Khudanpur. Recurrent neural network based language model. In Proc. of Interspeech, pages 1045-1048, 2010. [15] 4

分享到：

赞收藏

资料库

科大讯飞FSMN论文.pdf

相关推荐

行业

热门标签

最新资料