5
1
0
2
t
c
O
9
]
E
N
.
s
c
[
1
v
3
9
6
2
0
.
0
1
5
1
:
v
i
X
r
a
Feedforward Sequential Memory Neural Networks
without Recurrent Feedback
Shiliang Zhang1, Hui Jiang2, Si Wei3, Lirong Dai1
1National Engineering Laboratory for Speech and Language Information Processing
University of Science and Technology of China, Hefei, Anhui, China
2Department of Electrical Engineering and Computer Science
York University, 4700 Keele Street, Toronto, Ontario, M3J 1P3, Canada
3IFLYTEK Research, Hefei, Anhui, China
zsl2008@mail.ustc.edu.cn, hj@cse.yorku.ca, siwei@iflytek.com, lrdai@ustc.edu.cn
Abstract
We introduce a new structure for memory neural networks, called feedforward se-
quential memory networks (FSMN), which can learn long-term dependency with-
out using recurrent feedback. The proposed FSMN is a standard feedforward
neural networks equipped with learnable sequential memory blocks in the hid-
den layers. In this work, we have applied FSMN to several language modeling
(LM) tasks. Experimental results have shown that the memory blocks in FSMN
can learn effective representations of long history. Experiments have shown that
FSMN based language models can significantly outperform not only feedforward
neural network (FNN) based LMs but also the popular recurrent neural network
(RNN) LMs.
1
Introduction
When machine learning methods are applied to model sequential data such as text, speech and
video, it is very important to take advantage of the long-term dependency. Traditional approaches
have explored to capture the long-term structure within the sequential data using recurrent feedbacks
such as in regular recurrent neural networks (RNNs) or LSTM-based models. [1, 2, 3, 4]. RNNs
can learn and carry out complicated transformations of data over extended periods of time and store
the memory in the weights of the network. Therefore, they are gaining more and more popular in
sequential data modeling tasks. More recently, different from RNNs, there has also been a surge
in constructing neural computing models with varying forms of explicit memory units [5, 6, 7, 8] .
For example, in [6], the proposed memory networks employ a memory component that can be read
from and written to. In [5], the proposed neural turing machines (NTM) improve the memory of
neural networks by coupling with external memory resources, which can learn to sort a small set of
numbers as well as other symbolic manipulation tasks.
In this work, we have proposed a simpler structure for memory neural networks, namely feedfor-
ward sequential memory networks (FSMN), which can learn long-term dependency in sequential
data without using the recurrent feedback. For FSMN, we extend the standard feedforward neural
networks by introducing memory blocks in the hidden layers. Different from RNNs, the overall
FSMNs remain as a feed-forward structure so that they can be learned in much more efficient and
stable ways than RNNs and LSTMs. In our work, we have evaluated the performance of FSMN
on two language modeling tasks: Penn Tree Bank (PTB) and English wiki in Large Text Compres-
sion Benchmark (LTCB). In both tasks, FSMN based language models can significantly outperform
not only the standard FNN-LMs but also the popular recurrent neural network (RNN) LMs by a
significant margin.
1
Figure 1: Illustration of feedforward sequential memory networks and comparison with RNNs.
2 Our Approach
2.1 Feedforward Sequential Memory Neural Networks
Feedforward sequential memory network (FSMN) is a standard feedforward neural network with
single or multiple memory blocks in the hidden layer. For instance, Figure 1 (a) shows an
FSMN with one memory block added into its second hidden layer. Given a sequence, X =
{x1, x2,··· , xT}, each xt ∈ RD×1 represents an input data at time instance t. The corresponding
hidden layer outputs are denoted as H = {h1, h2,··· , hT}, each ht ∈ RD×1. As shown in
Figure 1 (b), we can use a tapped-delay structure to encode ht and its previous N histories into a
fixed-sized representation in the memory block:
N
i=0
˜ht = f (
ai · ht−i)
(1)
where all coefficients form an N-dimension learnable vector a = {a0, a1, a2,··· , aN}, and f (·) is
the activation function (sigmoid or RELU). Furthermore, as shown in Figure 1 (a), ˜ht may be fed
into next hidden layer in the same way as ht.
From the viewpoint of signal processing, the memory block in FSMN can be regarded as a high-
order finite impulse response (FIR) filter while the recurrent layer in RNNs, namely ˜ht = f (ht +
W · ˜ht−1), may be viewed as a first-order infinite impulse response (IIR) filter, see Figure 1 (c).
Obviously, the vector a may be regarded as the coefficients of an N-order FIR filter. We know
IIR filters are more compact than FIR filters. However, IIR filters may be difficult to implement. In
some cases, IIR filters may become unstable but FIR filters are always stable. Moreover, the learning
of IIR-filter-like RNNs requires to use the so-called back-propagation through time (BPTT) which
significantly increases the computational complexity of the learning and also causes the problems
of gradient vanishing and exploding [10]. On the other hand, the proposed FIR-filter-like FSMNs
can be efficiently learned using the standard back-propagation procedure. Therefore, the learning of
FSMNs is more stable and efficient than that of RNNs.
2.2 Implement FSMN for language models
The goal in language modeling is to predict the next word in a text sequence given all previous
words. We now explain how to implement FSMNs for this task. FSMN is a standard feedforward
neural network (FNN) except the additional memory blocks. We will show that the memory block
can be efficiently implemented as sentence-by-sentence matrix multiplications, which are suitable
for the mini-batch based stochastic gradient descent (SGD) method running on GPUs.
Suppose the N-order FIR filter coefficients in the memory block are denoted as a =
{a0, a1, a2,··· , aN}. For a given sentence X consisting of T words, we may construct a T × T
2
Hidden layer Output layerInput layer!Memory Block(a)Feedforward sequential memory neural networksht∑!htz−1⊗W(c)Recurrent layer in RNN as IIR filterf(·)z−1z−1z−1!⊗⊗⊗⊗∑∑∑!a0a1a2aN!htht(b) Memory block in FSMN as FIR filterf(·)ht−1ht−2ht−N!hthtxtyt
square matrix M as follow:
...
a0 a1 ··· aN 0 ··· 0···
0 a0 a1 ··· aN 0···
...
...
a0 ··· aN
...
a0
...
0
...
0
M1
M2
...
MK
) = f ( ¯H · ¯M)
0
M =
··· 0
···
···
···
Therefore, the sequential memory operations in eq.(1) for the whole sequence can be computed with
one matrix multiplication as: ˜H = f (H · M).
Similarly, we may extend the idea to a mini-batch composed of K sentences, L = {X1 X2 ··· XK},
we can compute the sequential memory representation for all sentences in the mini-batch as follows:
...
˜H = f ([H1, H2 ··· HK] ·
(2)
In the backward pass, except the weights in the network, we also need to calculate the gradients of
¯M, which is then used to update the filter coefficients a. We can calculate the gradients using the
standard back-propagation (BP) algorithm. Therefore, all computation in FSMNs can be formulated
as large matrix multiplications, which can be efficiently conducted in GPUs. As a result, FSMN
based LMs have the same computational complexity as the standard NN LMs in training, which is
much more efficient than RNN-LMs.
3 Experiments
We have evaluated FSMNs on two benchmark LM tasks: i) the Penn Treebank (PTB) corpus of about
1M words, following the same setup as [11]. ii) The Large Text Compression Benchmark (LTCB)
[12]. In LTCB, we use the enwik9 dataset, which is composed of the first 109 bytes of enwiki-
20060303-pages-articles.xml. We split it into three parts: training (153M), validation (8.9M) and
test (8.9M) sets. We limit the vocabulary size to 80k for LTCB and replace all out-of-vocabulary
words by .
For FSMNs, the hidden units employ the rectified linear activation function, i.e., f (x) = max(0, x).
The nets are initialized based on the normalized initialization in [13], without using any pre-training.
We use SGD with a mini-batch size of 200 and 500 for PTB and LTCB tasks respectively. The initial
learning rate is 0.4 and 0.002 for the weights and filter coefficients respectively, which is kept fixed
as long as the perplexity on the validation set decreases by at least 1. After that, we continue six
more epochs of training, where the learning rate is halved after each epoch. In PTB task, we also
use momentum (0.9) and weight decay (0.00004) to avoid overfitting.
3.1 Results
We have first evaluated the performance of FSMN-LMs on the PTB task. We have trained FSMN
with an input context window of two, where the previous two words are used to predict the next
word. The FSMN contains a linear projection layer (of 200 nodes), two hidden layers (of 400 nodes
pre layer) and a memory block in the first hidden layer. For the memory block, we use a 20-order
FIR filter. In Table 1, we have summarized the perplexities on the PTB test set for various language
models.
For the LTCB task, we have trained several baseline systems:
i) two n-gram LMs (3-gram and
5-gram) using the modified Kneser-Ney smoothing without count cutoffs; ii) several traditional
feedforward NNLMs with different model sizes and input context windows (bigram, trigram, 4-
gram and 5-gram); iii) an RNN-LM with one hidden layer of 600 nodes using the toolkit in [15]; iv)
3
Test PPL
Table 1: Perplexities on PTB for
various LMs.
Model
KN 5-gram [11]
RNNLM [11]
LSTM [2]
MemN2N[7]
trigram FNNLM[14]
6-gram FNNLM[14]
FOFE-FNNLM[14]
FSMN-LM
141
123
117
111
131
113
108
102
Table 2: Perplexities on LTCB for various language models.
(M) denotes a hidden layer with memory block.
Model
KN 3-gram
KN 5-gram
RNN-LM
FNNLM
Architecture
-
-
[1*600]-80k
[2*200]-3*600-80k
[2*200]-1200-2*600-80k
Test PPL
FOFE-FNNLM [2*200]-3*600-80k
FSMN-LM
[2*200]-1200-2*600-80k
[2*200]-600(M)-600-600-80k
[2*200]-600-600(M)-600-80k
[2*200]-600(M)-600(M)-600-80k
156
132
112
155
154
104
107
95
96
92
2nd-order FOFE based FNNLM [14] with different hidden layer sizes. Moreover, we have examined
our FSMN based LMs with different architectures. We have trained a 3-hidden-layer FSMN with a
memory block in the first hidden layer, second hidden layer or both. The order of the FIR filter is
30 in these experiments. In Table 2, we have summarized the perplexities on the LTCB test set for
various models.
Experimental results in Table 1 and Table 2 have shown that the FSMN based LMs can significantly
outperform the baseline higher-order feedforward neural network (FNN) LMs, FOFE-based FNN
LMs as well as the popular RNN-based LMs by a quite significant margin.
4 Conclusions and Future Work
In this work, we have proposed a novel neural network architecture, namely feedforward sequential
memory networks (FSMN), which use FIR-filter-like memory blocks in the hidden layer of standard
feedforward neural networks. Experimental results on language modeling tasks have shown that the
FSMN can effectively learn the long term history. For the future work, we will try to apply this
model to other sequential data modeling tasks, such as acoustic modeling in speech recognition.
References
[1]
[2]
[3]
S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural computation, 1997, 9(8): 1735-1780.
A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
J. Chung, C. Gulcehre, K. H. Cho, et al. Empirical evaluation of gated recurrent neural networks on
sequence modeling. arXiv:1412.3555, 2014.
T. Mikolov, A. Joulin, S. Chopra, et al. Learning longer memory in recurrent neural networks.
arXiv:1412.7753, 2014.
A. Graves, G. Wayne, I. Danihelka. Neural Turing Machines. arXiv:1410.5401, 2014.
J. Weston, S. Chopra, A. Bordes. Memory networks. arXiv:1410.3916, 2014.
S. Sukhbaatar, A. Szlam, J. Weston. End-To-End Memory Networks. arXiv:1503.08895, 2015.
A. Joulin, T. Mikolov. Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets.
arXiv:1503.01007, 2015.
A. Bordes, N. Usunier, S. Chopra, J. Weston. Large-scale Simple Question Answering with Memory
Networks. arXiv:1506.02075, 2015.
Y. Bengio, P. Simard, P. Frasconi. Learning long-term dependencies with gradient descent is difficult.
IEEE Transactions on Neural Networks. volume 5, no 2, pages 157-166, 1994.
T. Mikolov, S. Kombrink, L. Burget, J. Cernocky and S. Khudanpur. Extensions of recurrent neural
network language model. In Proc. of International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), pages 5528-5531, 2011.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[14]
[12] M. Mahoney. Large Text Compression Benchmark. In http://mattmahoney.net/dc/textdata.html, 2011.
G. Xavier, Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In
[13]
Proc. of AISTATS, 2010.
S. Zhang, H. Jiang, M. Xu, J. Hou and L.R. Dai. The Fixed-Size Ordinally-Forgetting Encoding Method
for Neural Network Language Models. in Proceedings of the 53rd Annual Meeting of the Association
for Computational Linguistics (ACL), pp.495-500, 2015.
T. Mikolov, M. Karafi´at, L. Burget, J. Cernock`y and S. Khudanpur. Recurrent neural network based
language model. In Proc. of Interspeech, pages 1045-1048, 2010.
[15]
4