JID: NEUCOM
ARTICLE IN PRESS
Neurocomputing 0 0 0 (2017) 1–8
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
[m5G; March 8, 2017;1:24 ]
Joint entity and relation extraction based on a hybrid neural network
Suncong Zheng
Hongwei Hao
a , Bo Xu
a , c
a , ∗
a , Yuexing Hao
a , Dongyuan Lu
b , Hongyun Bao
, Jiaming Xu
a ,
a Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Sciences, China
b The School of Information Technology and Management, University of International Business and Economics, China
c Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, China
a r t i c l e
i n f o
a b s t r a c t
Entity and relation extraction is a task that combines detecting entity mentions and recognizing en-
tities’ semantic relationships from unstructured text. We propose a hybrid neural network model to
extract entities and their relationships without any handcrafted features. The hybrid neural network
contains a novel bidirectional encoder-decoder L STM module (BiL STM-ED) for entity extraction and a
CNN module for relation classification. The contextual information of entities obtained in BiLSTM-ED
further pass though to CNN module to improve the relation classification. We conduct experiments
on the public dataset ACE05 (Automatic Content Extraction program) to verify the effectiveness of our
method. The method we proposed achieves the state-of-the-art results on entity and relation extraction
task.
© 2017 Elsevier B.V. All rights reserved.
Article history:
Received 15 July 2016
Revised 20 December 2016
Accepted 25 December 2016
Available online xxx
Keywords:
Neural network
Information extraction
Tagging
Classification
1. Introduction
Entity and relation extraction is to detect entity mentions and
recognize their semantic relationships from text. It is an important
issue in knowledge extraction and plays a vital role in automatic
construction of knowledge base.
Traditional systems treat this task as a pipeline of two sepa-
rated tasks, i.e., named entity recognition (NER) [1] and relation
classification (RC) [2] . This separated framework makes the task
easy to deal with, and each component can be more flexible. But it
pays little attention to the relevance of two sub-tasks. Joint learn-
ing framework is an effective approach to correlate NER and RC,
which can also avoid cascading of errors [3] . However, most ex-
isting joint methods are feature-based structured systems [3–7] .
They need complicated feature engineering and heavily rely on
the supervised NLP toolkits, which might also lead to error prop-
agation. In order to reduce the manual work in feature extrac-
tion, recently, Miwa and Bansal [8] present a neural network-based
method for the end-to-end entity and relation extraction. How-
ever, when detecting the entity, they use a NN structure to pre-
dict the entity tags, which neglects the long relationships between
tags.
Based on the above analysis, we propose a hybrid neural net-
work model to settle these problems which contains a named en-
∗
Corresponding author.
E-mail address: hongyun.bao@ia.ac.cn (H. Bao).
http://dx.doi.org/10.1016/j.neucom.2016.12.075
0925-2312/© 2017 Elsevier B.V. All rights reserved.
tity recognition (NER) module and a relation classification (RC)
module. NER and RC share a same bidirectional LSTM encoding
layer, which is used to encode each input word by taking into
account the context on both sides of the word. Although bidirec-
tional LSTM can capture long distance interactions between words,
each output entity tag is predicted independently. Hence, we also
adopt a LSTM structure to explicitly model tag interactions. It can
capture the long distance relationships between tags when com-
paring with NN decoding manner [8] . As for relation classification,
sub-sentence between two entities has been proven to effectively
reflect the entities relationship [9,10] . Besides, bidirectional LSTM
encoding layer can obtain entities’ contextual information that is
also benefit for identifying relationships between entities. Hence,
we adopt a CNN model, which has achieved great success in ex-
tracting relations, to extract relation based on the encoding infor-
mation of entities and the sub-sentence information.
Our model not only considers the relevance of NER module
and RC module when comparing with classical pipeline methods,
but also considers the long distance relationships between entity
tags and without complicated feature engineering, when compar-
ing with existing joint learning methods. We conduct experiments
on the public dataset ACE05 (Automatic Content Extraction pro-
1 . Our method achieves the state-of-the-art results on entity
gram)
and relation extraction task. Besides, we also analyze the perfor-
mance of the two modules alone. On the entity detection task, our
1 http://www.itl.nist.gov/iad/mig//tests/ace/
Please cite this article as: S. Zheng et al., Joint entity and relation extraction based on a hybrid neural network, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2016.12.075
JID: NEUCOM
2
ARTICLE IN PRESS
S. Zheng et al. / Neurocomputing 0 0 0 (2017) 1–8
[m5G; March 8, 2017;1:24 ]
NER module achieves 2% improvements when comparing with dif-
ferent kinds of LSTM structures, which verifies the effectiveness of
NER module. On the task of relation classification, it shows that
the entities’ contextual information, obtained at the encoding pro-
cedure, can promote the accuracy of relation classification.
The remainder of the paper is structured as follows. In
Section 2 , we review related work about named entity recogni-
tion, relation classification and neural networks used in this pa-
per. Section 3 present our hybrid neural network in detail. In
Section 4 , we describe details about the setup of experiment and
presents the experimental results. Finally, we analyze the model in
Section 5 and make conclusion in Section 6 .
2. Related works
Entity and relation extraction is an important step to con-
struct a knowledge base, which can be benefit for many NLP tasks
[11] and social media analysis tasks [12,13] . There are two main
frameworks to solve the problem of extracting entity and their
relationships: the pipeline method and the joint learning model.
The pipeline method treats this task as a pipeline of two sepa-
rated tasks, i.e., named entity recognition (NER) [14–17] and rela-
tion classification (RC) [2,9,10,18,19] . The joint model extracts enti-
ties and relations simultaneously. Hence, in this paper, the problem
we focused is related to named entity recognition, relation classi-
fication and joint entity and relation extraction. The methods we
used are related to long short term memory networks (LSTM) and
convolutional neural network (CNN).
2.1. Named entity recognition
Named entity recognition is a classic task in NLP. Most existing
NER models are traditional linear statistical models, such as Hid-
den Markov Models (HMM) and Conditional Random Fields (CRF)
[14,20] . Their performances rely heavily on hand-crafted features
extracted by NLP tools and external knowledge resources. Recently,
several neural network architectures have been successfully ap-
plied to NER, which is regarded as a sequential token tagging task.
Collobert et al. [21] used a CNN and a CRF on top with word em-
beddings. Nowadays, Recurrent Neural Networks (RNN) has shown
better performance than other neural networks in many sequence-
to-sequence tasks. Chiu and Nichols [15] proposed a hybrid model
by learning both character-level and word-level features. They de-
coded each tag independently base on a linear layer and a log-
softmax layer. [16,17,22] proposed a BiLSTM and a CRF on top for
jointly tag decoding. Miwa and Bansal [8] proposed a BiLSTM for
encoding and a single incrementally NN structure to decode tags
jointly. These RNN models all utilized BiLSTM as encoding models,
but the decoding manners were different.
2.2. Relation classification
Relation classification is a widely studied task in the NLP com-
munity. Various approaches have been proposed to accomplish
the task. Existing methods for relation classification can be di-
vided into handcrafted feature based methods [2,23] , neural net-
work based methods [19,24–27] and the other valuable methods
[25,28] .
The handcrafted feature based methods focus on using different
natural language processing (NLP) tools and knowledge resources
to obtain effective handcrafted features. Kambhatla [23] employs
Maximum Entropy model to combine diverse lexical, syntactic and
semantic features derived from the text. It is the early work for re-
lation classification. The features they used are not comprehensive.
Rink [2] designs 16 kinds of features that are extracted by using
many supervised NLP toolkits and resources including POS, Word-
Net, dependency parse, etc. It can get the best result at SemEval-
2010 Task 8 when compared with other handcrafted features based
methods. However, it relied heavily on other NLP tools and it also
requires a lot of work to design and extract features.
In recent years, deep neural models have made significant
progress in the task of relation classification. These models can
learn effective relation features from the given sentence with-
out complicated feature engineering. The most common neural-
network based models applied in this task are Convolutional Neu-
ral Networks (CNN) [18,19,27,29,30] and sequential neural net-
works such as Recurrent Neural Networks (RNN) [31] , Recursive
Neural Networks (RecNN) [24,32] and Long Short Term Memory
Networks (LSTM) [26,33] . There also exists other valuable meth-
ods such as the kernel-based methods [28,34] and compositional
model [25] . Nguyen et al. [28] explore the use of innovative kernels
based on syntactic and semantic structures for the task and Sun
and Han [34] propose a new tree kernel, called feature-enriched
tree kernel (FTK) for relation extraction. The compositional model
FCM [25] learns representations for the substructures of an anno-
tated sentence. Compared to existing compositional models, FCM
can easily handle arbitrary types of input and global information
for composition.
2.3. Joint entity and relation extraction
Although, pipeline method can be more flexible to design the
system, it neglects the relevance of sub-tasks and may also lead to
the error propagation [3] . Most existing joint methods are feature-
based structured systems [3,4,35–37] , which need complicated fea-
ture engineering. [35,36] proposed a joint model that uses opti-
mal results of subtasks and seeks a globally optimal solution. Singh
et al. [37] proposed a single joint graphical model that represents
the various dependencies between subtasks. Li and Ji [3] proposed
the first model to incrementally predict entities and relations us-
ing a single joint model, which is a structured perceptron with
efficient beam search. Miwa and Sasaki [4] introduced a table to
represent the entity and relation structures in sentences, and pro-
posed a history-based beam search structured learning model. Re-
cently, Miwa and Bansal [8] used a LSTM-based model to extract
entities and relations, which can reduce the manual work.
2.4. LSTM and CNN models On NLP
The methods used in this paper are based on neural network
models: Convolutional neural networks (CNN) and Long Short-
Term Memory (LSTM). CNN is originally invented for computer vi-
sion [38] and it always be used to extract image’s features [39,40] .
In recent years, CNNs have been successfully applied to different
NLP tasks and have also shown the effectiveness on extracting sen-
tence semantic and keywords information [27,41–43] . Long-Short
Term Memory (LSTM) model is a specific kind of recurrent neu-
ral networks (RNNs). LSTM replaces the hidden vector of a recur-
rent neural network with memory blocks which are equipped with
gates. It can keep long term memory by training proper gating
weights [44,45] . LSTM have also shown powerful capacity on many
NLP tasks such as machine translation [46] , sentence representa-
tion [47] and relation extraction [26] .
In this paper, we propose a hybrid neural network based on
joint learning the entities and their relationships. It can learn re-
lated features from given sentences without complicated feature
engineering work, when compared with handcrafted feature based
methods. When comparing with the other neural network based
method [8] , our method considers the long distance relationships
between entity tags.
Please cite this article as: S. Zheng et al., Joint entity and relation extraction based on a hybrid neural network, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2016.12.075
JID: NEUCOM
ARTICLE IN PRESS
[m5G; March 8, 2017;1:24 ]
S. Zheng et al. / Neurocomputing 0 0 0 (2017) 1–8
3
NER
S
RC
PART-WHOLE
...
O
T1
s1
B
Tt
st
L
...
Tt+1
st+1
Tn
sn
R
CNN
SoŌmax
CNN Layer
∑
Wt,Wt+1,...Wn
∑
Input
h1
...
ht
ht+1
...
hn
h1
ht
ht+1
hn
h1
ht
ht+1
hn
W1
...
Wt
Wt+1
...
Wn
ct-1
xt
ht-1
.
f
σ
Forget gate
From
New
York
America
LSTM Block
Block gate
SoŌmax
LSTM
Decode
Merge
Layer
Backward
LSTM
Forward
LSTM
Word
Embedding
Input
Sentence
h
.
Output gate
σo
xt
ht-1
tanh
c
Cell
. i
z
tanh
xt
ht-1
σ
Input gate
ct-1
xtht-1
Fig. 1. The framework of the hybrid neural network for jointly extracting entities
and relations.
3. Our method
The framework of hybrid neural network is shown in Fig. 1 . The
first layer of hybrid neural network is a bidirectional LSTM en-
coding layer, which is shared by both named entity recognition
(NER) module and relation classification (RC) module. There are
two “channels” after encoding layer, one links to the NER mod-
ule which is a LSTM decoding layer, the other feeds into a CNN
layer to extract the relations. In following parts, we describe these
components in detail.
3.1. Bidirectional LSTM encoding layer
.
.
.
=
{
,
w 1
}
...w n
w t+1
,
w t
∈
,
where w t
The Bi-LSTM encoding layer contains word embedding layer,
forward lstm layer, backward lstm layer and the concatenate layer.
The word embedding layer converts the word with 1-hot represen-
tation to an embedding vector. Hence, a sequence of words can be
d is the d -
represented as W
dimensional word vector corresponding to the t th word in the sen-
tence and n is the length of the given sentence. After word embed-
ding layer, there are two parallel LSTM layers: forward lstm layer
and backward lstm layer. For each word w t , the forward layer will
−→
encode w t by considering the contextual words information from
w 1
h t . In the similar way, the backward
layer will encode w t based on the contextual words information
from w n to w t , which marked as
to w t , which marked as
←
−
h t .
R
The LSTM architecture consists of a set of recurrently connected
subnets, known as memory blocks. Each time-step in forward hid-
den layer and backward hidden layer is a LSTM memory block.
A block contains one or more self-connected memory cells and
three multiplicative units the input, output and forget gates that
provide continuous analogues of write, read and reset operations
for the cells [45] . Fig. 2 provides an illustration of a LSTM mem-
ory block with a single cell. At each time-step, a lstm memory
block is used to compute current hidden vector h t based on the
previous hidden vector h t−1
the previous cell vector c t−1 and the
,
←−−
current input word embedding w t , which can be shortly denoted
h t+1
,
as:
. The de-
tail operation of lstm can be defined as follows:
←−−
c t+1
)
,
w t
−−→
c t−1
)
,
w t
−−→
h t−1
,
−→
h t
←
−
h t
(
lstm
(
lstm
and
=
=
=
i t
+
+
+
δ(W xi x t
W ci c t−1
W hi h t−1
)
b i
,
(1)
Fig. 2. LSTM memory block with one cell.
=
f t
+
+
+
δ(W x f x t
W c f c t−1
W h f h t−1
)
b f
,
=
tanh
z t
+
+
W hc h t−1
(W xc x t
)
b c
,
=
+
f t c t−1
,
i t z t
c t
=
o t
+
+
+
δ(W xo x t
W ho h t−1
)
b o
W co c t
,
=
o t tanh
h t
(c t
)
,
(2)
(3)
(4)
(5)
(6)
where i, f and o are the input gate, forget gate and output gate
· denotes
respectively, b is the bias term, c is the cell memory,
element-wise multiplication and W (.)
are the parameters. Finally,
−→
we concatenate
=
h t ] .
tion, which is denoted as h t
−→
h t to represent word t ’s encoded informa-
←
−
h t and
←
−
,
h t
[
3.2. Named entity recognition (NER) module
Each word will be assigned an entity tag. The tags are com-
monly used encoding scheme: BILOS (Begin, Inside, Last, Outside,
Single) [22,48] . Each tag contains the position information of a
word in the entity. We also adopt a LSTM structure to explicitly
model tag interactions. When detecting the entity tag of word t ,
the inputs of decoding layer are: h t obtained from Bi-LSTM encod-
ing layer, former tag predicted vector T t−1
,
and the former hidden
state of decoding LSTM s t−1 . Each unit of the decoding LSTM is the
same as the encoding lstm memory block except for the input gate,
which can be rewritten as:
=
i t
+
+
+
δ(W xi h t
W ti T t−1
W hi s t−1
)
b i
,
(7)
where the tag predicted vector T is transformed from the hidden
state s as follows:
The final softmax layer computes normalized entity tag probabili-
ties based on the tag predicted vector T t :
=
+
.
b ts
W ts s t
T t
=
+
,
b y
W y T t
y t
(8)
(9)
Please cite this article as: S. Zheng et al., Joint entity and relation extraction based on a hybrid neural network, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2016.12.075
JID: NEUCOM
4
he1
we1
we1+1
we2
he2
Input
Sequence
ARTICLE IN PRESS
S. Zheng et al. / Neurocomputing 0 0 0 (2017) 1–8
[m5G; March 8, 2017;1:24 ]
l
R
e
a
Ɵ
o
n
T
a
g
Wc1 Wc2 Wcn
Embedding
Layer
ConvoluƟon
Layer
Max Pooling
Layer
SoŌmax
Layer
0.3
0.25
l
s
e
p
m
a
s
0.2
0.15
0.1
f
o
n
o
i
t
r
o
p
o
r
P
Fig. 3. The convolutional module for relation classification.
0.05
0
0
10
=
i
p
t
,
exp(y
)
i
t
nt
j=1
j
exp(y
)
t
20
30
The distance between two enƟƟes
40
50
(10)
Fig. 4. The distribution of dataset based on the distance between two entities. The
horizontal axis is the distance between two entities and the vertical axis represents
the number of samples corresponding to distance.
where W y is the softmax matrix, nt is the total number of entity
tags. Because the T is similar to tag embedding and LSTM is ca-
pable of learning long-term dependencies, this manner can model
tag interactions.
3.3. Relation classification (RC) module
Bernoulli with probability
i
which makes the model more robust. In Formula 15 , p
r
probability that the sentence describes relation i .
ρ. Dropout guards against overfitting,
means the
3.4. Training and implementation
When recognizing entities’ semantic relationships, we merge
the encoding information of entities and the sub-sentence between
entities, then feed them into the CNN model [49] . It can be repre-
sented as:
=
CN N
R
(11)
w e 1+1
,
([ h e 1
,
h e 2 ])
.
,
w e 2
,
w e 1
,
.
.
where R is the relation label, h e is the encoding information of en-
tity, w is the word embedding. Especially, a entity may contain two
or more words, we sum up these words’ encoding information to
represent the whole entity information. Besides, the CNN denotes
the convolutional operations which shown in Fig. 3 .
×d to represent the i th
convolution filter and br
to represent the bias term accord-
ingly, where k is the context window size of the filter. Filter
=
(i
)
.
W
will slide through the input sequence S
c
i . The sliding process can be
,
w e 2
h e 2 ] to get the latent features z
represented as:
(i
)
In convolution layer, we use W
c
w e 1+1
,
,
,
w e 1
[ h e 1
∈
∈
(i
)
k
R
R
.
.
We train our models to maximize the log-likelihood of the data
and the optimization method we used is RMSprop proposed by
Hinton in [52] . The objective function of NER module can be de-
fined as:
( j)
log(p
t
=
( j)
y
t
|
,
x j
)
ner
,
(16)
|
|
=
max
L ner
D
L j
j=1
t=1
|
where
|
D
is the size of dataset, L j
is the length of sentence x j
( j)
and p
t
is the label of word t in sentence x j
is the normalized
entity tag probabilities which defined in Formula 10 . Besides, RC
module’s objective function is:
|
|
=
max
L rc
D
j=1
( j)
log(p
r
=
( j)
y
r
|
,
x j
)
rc
,
(17)
( j)
, y
t
=
(i
)
z
l
(i
)
σ (W
c
∗ s l : l
+
+
−1
br
k
(i
)
)
,
(12)
( j)
where p
r
is defined in Formula 15 .
(i
)
where z
l
word s l+
−1
k
denoted as: z
=
(i
)
(i
)
is the feature extracted by filter W
to
c
. Hence, the latent features of the given sequence S are
from word s l
(i
)
,
[ z
1
.
.
.
(i
)
,
−k
z
L
+1
] . We then apply the max-pooling
(i
)
operation to reserve the most prominent feature of filter W
c
denote it as:
=
max
{
}
=
max
{
(i
)
z
(i
)
z
max
(i
)
,
z
1
.
.
.
,
(i
)
−k
z
L
}
+1
.
and
(13)
We use multiple filters to extract multiple features. Therefore,
=
the relation features of the given sequence is represented as: R s
[ z
(nr)
,
]
where nr is the number of filters.
max
(1)
,
max
,
z
. . .
After that we set a soft-max layer [50] with dropout [51] to
classify the relations based on relation features R s , which is defined
as:
=
W R
y r
+
,
b R
· (R s
◦ r)
(14)
=
i
p
r
exp(y
)
i
r
nc
j=1
j
exp(y
)
r
,
(15)
∈
where W R
of relation classes, symbol
∈
cation operator and r
nr·nc is the softmax matrix, nc is the total number
◦ denotes the element-wise multipli-
nr is a binary mask vector drawn from
R
R
We firstly train NER module to recognize the entities and ob-
tain the encoding information of entities, then further train the RC
module to classify relations based on the encoding information and
the entities combinations.
Specially, we find that if there is a relationship between the two
entities, the distance of two entities always smaller than about 20
words, which is shown in Fig. 4 . Hence, when determining the re-
lationship between the two entities, we also make full use of this
property that if the distance of two entities is larger than L max ,
we don’t think there exists a relationship between them. L max is
around 20 in the ACE05 dataset based on the statistical results of
Fig. 4 .
4. Experiment
4.1. Experimental setting
Datasets. We use public dataset ACE05 for entity and rela-
tion extraction, which 6 coarse-grained relation types and an
additional “other” relation to denote non-entity or non-relation
classes. The 6 coarse-grained relation types are “ART (artifact)”, “G-
A (Gen-affiliation)”, “O- A (Org-affiliation)”, “P-W (PART-WHOLE)”,
“P-S (person-social)” and “PHYS (physical)”. The same relation type
Please cite this article as: S. Zheng et al., Joint entity and relation extraction based on a hybrid neural network, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2016.12.075
JID: NEUCOM
ARTICLE IN PRESS
[m5G; March 8, 2017;1:24 ]
S. Zheng et al. / Neurocomputing 0 0 0 (2017) 1–8
5
Table 1
Hyper parameters of the hybrid neural network.
Parameter
Parameter description
Parameter value
d
ne
nd
k
nr
ρ
Dimension of word embedding
The number of hidden units in encode layer
The number of hidden units in decode layer
Context window size of CNN module
The filter number of CNN
The ratio of dropout
300
300
300
3
100
0.3
Table 2
Comparisons with the baselines on the ACE05 test
set.
Model
P (%)
R (%)
F 1(%)
ME) [3]
+
Pipeline (CRF
Joint w/Global [3]
SPTree [8]
Our method
65.1
65.4
65.8
61.9
38.1
39.8
42.9
45.0
48.0
49.5
51.9
52.1
with opposite directions are considered to be two classes. For ex-
ample, “PART-WHOLE (e1,e2)” and “PART-WHOLE(e2,e1)” are differ-
ent relations. The “PART-WHOLE(e1,e2)” means that e1 is a part of
e2 and “PART-WHOLE(e2,e1)” means e1 contains e2. Hence, there
are 13 relation classes in total. The data pre-processing and set-
tings in experiments are the same as [3] .
Baselines. The baselines we used are recent methods for the
ACE05 dataset, which include a classical pipeline model [3] , a joint
feature-based model called Joint w/Global [3] , and an end-to-end
NN-based model SPTree [8] .
•
•
•
Pipeline (CRF+ME) [3] trained a linear-chain Conditional Ran-
dom Fields model [53] for entity mention extraction and a
Maximum Entropy model [54] for relation extraction. It is a
classical pipeline method for the task.
Joint w/Global [3] incrementally extract entity mentions to-
gether with relations using a single model. They developed a
number of new and effective global features as soft constraints
to capture the interdependency among entity mentions and re-
lations.
SPTree [8] presented a novel end-to-end relation extraction
model that represents both word sequence and dependency
tree structures by using bidirectional sequential and bidirec-
tional tree-structured LSTM-RNNs.
Metrics. To compare our model with baselines, we use Precision
(P), Recall (R) and F- Measure (F1) in the task of joint entity and
relation extraction. A relation instance is regarded as correct when
its relation type and the head offsets of two corresponding entities
are both correct.
Hyper parameters. In this paper, we propose a hybrid neural
network to extract entities and their relations. The hyper parame-
ters used in the model are summarized in Table 1 .
4.2. Results
The predicted results on test set are shown in Table 2 . Our
method achieves F 1 of 52.1%, which is the best result when com-
paring with the existing methods. It illustrates the effectiveness of
our proposed hybrid neural network on the task of jointly extract-
ing the entities and their relationships.
Besides, the Joint w/Global [3] approach outperforms the
pipelined method and the neural network based methods (SPTree
[8] and our model) can get a higher F 1 results than these feature
based methods [3] . It shows that neural network model accompa-
nied with joint learning manner is a feasible way to extract entities
and their relationships.
Especially, the precision results of these methods are similar
and the difference is mainly concentrated in the recall results. Our
method can balance the precision and recall, which achieve a bet-
ter F 1 result.
5. Analysis and discussions
5.1. Analysis of named entity recognition module
The NER module contains a bidirectional LSTM encoding layer
and a LSTM decoding layer. We use BiLSTM-ED to represent the
structure of NER module. In order to further illustrate the effec-
tiveness of BiLSTM-ED on the task of entity extraction, we com-
pare BiLSTM-ED with its different variations and other effective se-
quence labeling models. The contrast methods are:
•
•
•
•
•
Forward-LSTM uses a unidirectional LSTM to encode the input
to w n , then also applies a LSTM structure to
sentence from w 1
decode the entity tags.
Backward-LSTM has the similar manner of Forward-LSTM, the
difference is the encoding order which is from w n to w 1
.
BiLSTM-NN uses a bi-directional LSTM to encode the input sen-
tence and uses a feed-forward neural network (NN) architecture
to predict the entity tags. It neglects the relationship between
tags.
BiLSTM-NN-2 [8] uses a bi-directional LSTM to encode the in-
put sentence and uses a novel feed-forward neural network
(NN) by considering adjacent tags information instead of the
long distance relationships between tags.
CRF [53] is a classic and effective sequence labeling model. In
this section, we use CRF as one of the powerful comparison
method and the feature used in CRF are the same as [3] used.
We use the standard F1 to evaluate the performance of these
methods and treat an entity as correct when its type and the re-
gion of its head are correct. Table 3 shows the results of the above
methods on the task of name entity recognition. When comparing
with Forward-LSTM and Backward-L STM, the bi-directional L STM
encoding manner can have significant improvements. Bi-LSTM en-
coding considers the whole sentence information when comparing
with the ui-LSTM encoding, hence it can achieve much higher ac-
curacies in the tagging task. BiLSTM-NN-2 is better than BiLSTM-
NN, which shows the need of considering the relationship between
tags. Besides, BiLSTM-ED is batter than BiLSTM-NN-2 which means
that considering the long distance relationships between tags can
be better than only considering the adjacent tags information. We
also compare BiLSTM-ED with the famous sequential model of CRF.
The result also shows the effectiveness of BiLSTM-ED.
5.2. Analysis of relation classification module
In the relation classification module, we use two kinds of infor-
mation: the sub-sentence between entities and the encoding infor-
mation of entities obtained from bidirectional LSTM layer. In order
to illustrate the effectiveness of these information we considered,
Table 3
Comparisons with the different methods on
the task of entity detection.
Methods
P (%)
R (%)
F 1(%)
Forward-LSTM
Backward-LSTM
CRF
BiLSTM-NN
BiLSTM-NN-2
BiLSTM-ED
63.8
65.3
83.2
83.3
85.5
85.2
59.2
60.0
73.6
83.0
81.2
85.4
60.0
61.0
78.1
82.2
83.3
84.2
Please cite this article as: S. Zheng et al., Joint entity and relation extraction based on a hybrid neural network, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2016.12.075
JID: NEUCOM
6
ARTICLE IN PRESS
S. Zheng et al. / Neurocomputing 0 0 0 (2017) 1–8
[m5G; March 8, 2017;1:24 ]
Table 4
Comparisons of different information on
the task of relation classification.
Methods
P (%)
R (%)
F 1(%)
Full-CNN
Sub-CNN
Sub-CNN-H
30.8
57.7
58.3
34.9
51.9
54.8
32.7
54.6
56.5
51
50.5
50
1
F
49.5
49
48.5
48
0
5
10
15
20
25
30
35
Lmax
Fig. 5. The F 1 results of different L max values. The horizontal axis is the distance
between two entities and its range is from 5 to 30. The vertical axis represents F 1
value on the relation classification tasks. In order to exclude the effect of encoding
information, we use Sub-CNN to obtain the F 1 results.
we compare our method with its different variations. We firstly
use the NER module to detect the entities in sentence, then uses
the right entity recognition results of step 1 to test the RC mod-
ule. We report the effect of these information on the relation clas-
sification task as Table 4 shows. Full-CNN uses a whole sentence
to recognize the relationships of entities. Sub-CNN only uses the
sub-sentence between two entities. Sub-CNN-H uses both the sub-
sentence and the encoding information of entities obtained from
bidirectional encoding layer. When comparing Full-CNN with Sub-
+20% im-
CNN, the result shows that sub-sentence can achieve a
provement. This result matches [9] ’s analysis that most relation-
ships can be reflected by the sub-sentence between the given two
entities instead of full sentence. When adding the encoding infor-
mation of entities into the Sub-CNN, Sub-CNN-H can further pro-
mote the accuracy of relation classification. It verifies that entities’
contextual information is also benefit for identifying relationships
between entities.
5.3. The effect of two entities’ distance
From Fig. 4 , we know that the data distribution shows the long
tail property when the horizontal axis is the distance between two
entities. Hence, we set a threshold L max to filter the data. If two
entities’ distance is larger than L max , we treat that these two enti-
ties have no relationship. In order to analyze the effect of thresh-
old L max , we use Sub-CNN to predict entities relationships based
on different L max values. The effect is shown in Fig. 5 . The smaller
of L max is , the more data will be filtered. So if L max is too small, it
may filter the right data and make the F 1 results decline. If L max is
too large, it cannot filter the noisy data which may also hurt the
final results. Fig. 5 shows that when L max is between 10 and 25, it
can perform well. The range also matches the statistical results of
Fig. 4 .
5.4. Error analysis
To analyze the errors of our method, we visualized the model’s
predicted results on relation classification task as Fig. 6 shows.
The diagonal region indicates the correct prediction results and the
other regions reflect the distribution of error samples. The high-
lighted diagonal region means that our method can perform well
on each relation class except for the relation “P-S”. Because the test
dataset contains a few samples whose relation labels are “P-S”, the
predicted distribution of “P-S” cannot fully reflect the true situa-
tion. Besides, “P-S” means the relationship of “person-social”. The
entity “person” and entity “social” always are pronoun words in
the dataset, so it is hard to recognize “P-S” relationship based on
these pronoun words.
Further more, from Fig. 6 , we also can see that the distribu-
tion of predicted relation is relatively dispersed on the first row
of “OTHER”, which means that most of the specific relation classes
can be predicted as the “OTHER”. Namely, we cannot identify some
relationships and it directly leads to relatively low recall. From the
first column of “OTHER”, we can see that if there is no relation-
ship between the two entities the model can be effectively dis-
criminated.
Apart from the class “OTHER”, the other problem is that the
same relation type with opposite directions are ease to mix up,
Fig. 6. The distribution of the predicted results for each relation class. The horizontal axis is the target relation and each target relation corresponds to a column of predicted
relations. Point (X,Y) means the ratio that the target relation is X and the predicted relation is Y. The sum of each column value equal to 1.
Please cite this article as: S. Zheng et al., Joint entity and relation extraction based on a hybrid neural network, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2016.12.075
JID: NEUCOM
ARTICLE IN PRESS
[m5G; March 8, 2017;1:24 ]
S. Zheng et al. / Neurocomputing 0 0 0 (2017) 1–8
7
such as: P-W(e2e1) and P-W(e1e2), ART(e1e1)and ART(e2e1), O-
A(e1e1)and O- A(e2e1). The reason is that the same relation type
always have similar description even if they are not in the same
direction.
6. Conclusion
Entity and relation extraction is an important issue in knowl-
edge extraction and plays a vital role in automatic construction of
knowledge base. In this paper, we propose a hybrid neural network
model to extract entities and their semantic relationships without
any handcrafted features. When comparing with the other neural
network based method, our method considers the long distance re-
lationships between entity tags. In order to illustrate our methods’
effectiveness, we conduct experiments on the public dataset ACE05
(Automatic Content Extraction program). The experimental results
on the public dataset ACE05 verify the effectiveness of our method.
In the future, we will explore how to better link these two
modules based on the neural network, so that it can perform bet-
ter. Besides, we also need to solve the problem of neglecting some
relationships and try to promote the recall value.
Acknowledgment
We thank Qi Li and Miwa for dataset details and helpful dis-
cussions. We also thank Qi Li for providing the partition of dataset
so that we can conduct contrast experiment in a fair environ-
ment. This work is also supported by the National High Technol-
ogy Research and Development Program of China (863 Program)
(Grant No. 2015AA015402), the Hundred Talents Program of Chi-
nese Academy of Sciences (No. Y3S4011D31) and National Natural
Science Foundation (Grant No. 71402178 ).
References
[15] J.P. Chiu, E. Nichols, Named entity recognition with bidirectional lstm-cnns,
arXiv: 1511.08308 (2015).
[16] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf models for sequence tagging,
arXiv: 1508.01991 (2015).
[17] G. Lample , M. Ballesteros , S. Subramanian , K. Kawakami , C. Dyer , Neural ar-
chitectures for named entity recognition, in: Proceedings of the Annual Con-
ference of the North American Chapter of the Association for Computational
Linguistics, 2016 .
[18] K. Xu Y. Feng, S. Huang, D. Zhao, Semantic relation classification via con-
volutional neural networks with simple negative sampling, arXiv: 1506.07650
(2015).
[19] D. Zeng , K. Liu , G. Zhou , J. Zhao , Relation classification via convolutional deep
neural network, in: Proceedings of the 25th COLING International Conference,
2014, pp. 2335–2344 .
[20] A. Passos , V. Kumar , A. McCallum , Lexicon infused phrase embeddings for
named entity resolution, in: Proceedings of the International Conference on
Computational Linguistics, 2014, pp. 78–86 .
[21] R. Collobert , J. Weston , L. Bottou , M. Karlen , K. Kavukcuoglu , P. Kuksa , Natu-
ral language processing (almost) from scratch, J. Mach. Learn. Res. 12 (2011)
2493–2537 .
[22] X. Ma, E. Hovy, End-to-end sequence labeling via bi-directional lstm-cnns-crf,
arXiv: 1603.01354 (2016).
[23] N. Kambhatla , Combining lexical, syntactic, and semantic features with maxi-
mum entropy models for extracting relations, in: Proceedings of the 43th ACL
International Conference, 2004, p. 22 .
[24] R. Socher , B. Huval , C.D. Manning , A.Y. Nq , Semantic compositionality through
recursive matrix-vector spaces, in: Proceedings of the EMNLP International
Conference, 2012, pp. 1201–1211 .
[25] M. Yu , M. Gormleyl , M. Dredze , Factor-based compositional embedding mod-
els, in: Proceedings of the NIPS Workshop on Learning Semantics, 2014 .
[26] X. Yan , L. Moul , G. Li , Y. Chen , H. Peng , Z. Jin , Classifying relations via long
short term memory networks along shortest dependency paths, in: Proceed-
ings of EMNLP International Conference, 2015 .
[27] C.N. dos Santos , B. Xiangl , B. Zhou , Classifying relations by ranking with con-
volutional neural networks, in: Proceedings of the 53th ACL International Con-
ference, vol. 1, 2015, pp. 626–634 .
[28] T.-V.T. Nguyen , A. Moschittil , G. Riccardi , Convolution kernels on constituent,
dependency and sequential structures for relation extraction, in: Proceedings
of the EMNLP International Conference, 2009, pp. 1378–1387 .
[29] P. Qin , W. Xu , J. Guo , An empirical convolutional neural network approach for
semantic relation classification, Neurocomputing 190 (2016) 1–9 .
[30] S. Zheng , J. Xu , P. Zhou , H. Bao , Z. Qi , B. Xu , A neural network framework for re-
lation extraction: Learning entity semantic and relation pattern, Knowl. Based
Syst. 114 (2016) 12–23 .
[31] D. Zhang D. Wang, Relation classification via recurrent neural network,
arXiv: 1508.01006 (2015).
[1] D. Nadeau , S. Sekine , A survey of named entity recognition and classification,
[32] J. Ebrahimi , D. Dou , Chain based RNN for relation classification, in: Proceedings
Lingvisticae Investigationes 30 (1) (2007) 3–26 .
of the NAACL International Conference, 2015, pp. 1244–1249 .
[2] B. Rink , Utd: classifying semantic relations by combining lexical and seman-
tic resources, in: Proceedings of the 5th International Workshop on Semantic
Evaluation, 2010, pp. 256–259 .
[33] S. Zhang , D. Zheng , X. Hu , M. Yang , Bidirectional long short-term memory net-
works for relation classification, in: Proceedings of the Pacific Asia Conference
on Language, Information and Computation, 2015, pp. 73–78 .
[3] Q. Li , H. Ji , Incremental joint extraction of entity mentions and relations., in:
Proceedings of the 52rd Annual Meeting of the Association for Computational
Linguistics, 2014, pp. 402–412 .
[34] L. Sun , X. Han , A feature-enriched tree kernel for relation extraction, in:
Proceedings of the 52th ACL International Conference, 2014, pp. pages 61–
67 .
[4] M. Miwa , Y. Sasaki , Modeling joint entity and relation extraction with table
representation., in: Proceedings of Conference on Empirical Methods in Natural
Language Processing, 2014, pp. 1858–1869 .
[35] D. Roth , W.-t. Yih , Global inference for entity and relation identification via
a linear programming formulation, in: Introduction to Statistical Relational
Learning, 2007, pp. 553–580 .
[5] Y.S. Chan , D. Roth , Exploiting syntactico-semantic structures for relation ex-
traction, in: Proceedings of the 49rd Annual Meeting of the Association for
Computational Linguistics, 2011, pp. 551–560 .
[36] B. Yang , C. Cardie , Joint inference for fine-grained opinion extraction., in: Pro-
ceedings of the 51rd Annual Meeting of the Association for Computational Lin-
guistics, 2013, pp. 1640–1649 .
[6] X. Yu , W. Lam , Jointly identifying entities and extracting relations in encyclo-
pedia text via a graphical model approach, in: Proceedings of the 21th COLING
International Conference, 2010, pp. 1399–1407 .
[37] S. Singh , S. Riedel , B. Martin , J. Zheng , A. McCallum , Joint inference of enti-
ties, relations, and coreference, in: Proceedings of the 2013 Workshop on Au-
tomated Knowledge Base Construction, ACM, 2013, pp. 1–6 .
[7] L. Li , J. Zhang , L. Jin , R. Guo , D. Huang , A distributed meta-learning system for
[38] Y. LeCun , L. Bottou , Y. Bengio , P. Haffner , Gradient-based learning applied to
chinese entity relation extraction, Neurocomputing 149 (2015) 1135–1142 .
document recognition, Proc. IEEE 86 (11) (1998) 2278–2324 .
[8] M. Miwa , M. Bansal , End-to-end relation extraction using lstms on sequences
and tree structures, in: Proceedings of the 54rd Annual Meeting of the Associ-
ation for Computational Linguistics, 2016 .
[39] J. Yu, X. Yang, F. Gao, D. Tao, Deep multimodal distance metric learning us-
ing click constraints for image ranking, IEEE Trans. Cybern. (2016), doi: 10.1109/
TCYB.2016.2591583 .
[9] C.N. dos Santos , B. Xiang , B. Zhou , Classifying relations by ranking with con-
volutional neural networks, in: Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Con-
ference on Natural Language Processing, vol. 1, 2015, pp. 626–634 .
[40] J. Yu , B. Zhang , Z. Kuang , D. Lin , J. Fan , Image privacy protection by identify-
ing sensitive objects via deep multi-task learning, in: Proceedings of the IEEE
Transactions on Information Forensics and Security, 2016 .
[41] Y. Kim , Convolutional neural networks for sentence classification, in: Proceed-
[10] Y. Xu , L. Mou , G. Li , Y. Chen , H. Peng , Z. Jin , Classifying relations via long short
term memory networks along shortest dependency paths, in: Proceedings of
Conference on Empirical Methods in Natural Language Processing, 2015 .
[11] L. Zou , R. Huang , H. Wang , J.X. Yu , W. He , D. Zhao , Natural language question
answering over RDF: a graph data driven approach, in: Proceedings of the 2014
ACM SIGMOD international conference on Management of data, ACM, 2014,
pp. 313–324 .
[12] J. Sang , C. Xu , J. Liu , User-aware image tag refinement via ternary semantic
analysis, IEEE Trans. Multimed. 14 (3) (2012) 883–895 .
[13] J. Sang , C. Xu , Right buddy makes the difference: An early exploration of social
relation analysis in multimedia applications, in: Proceedings of the 20th ACM
International Conference on Multimedia, ACM, 2012, pp. 19–28 .
[14] G. Luo , X. Huang , C.-Y. Lin , Z. Nie , Joint entity recognition and disambiguation,
in: Proceedings of the Conference on Empirical Methods in Natural Language
Processing, 2015, pp. 879–888 .
ings of the EMNLP International Conference, 2014 .
[42] N. Kalchbrenner , E. Grefenstette , P. Blunsom , A convolutional neural network
for modelling sentences, in: Proceedings of the 52th ACL International Confer-
ence, 2014 .
[43] P. Wang , B. Xu , J. Xu , G. Tian , C.-L. Liu , H. Hao , Semantic expansion using word
embedding clustering and convolutional neural network for improving short
text classification, Neurocomputing 174 (2016) 806–814 .
[44] X. Zhu , P. Sobihani , H. Guo , Long short-term memory over recursive struc-
tures, in: Proceedings of the 32nd International Conference on Machine Learn-
ing (ICML-15), 2015, pp. 1604–1612 .
[45] A. Graves , Supervised Sequence Labelling, Springer, 2012 .
[46] M.-T. Luong , I. Sutskever , Q.V. Le , O. Vinyals , W. Zaremba , Addressing the rare
word problem in neural machine translation, in: Proceedings of the 53rd An-
nual Meeting of the Association for Computational Linguistics and the 7th In-
ternational Joint Conference on Natural Language Processing, 2015, pp. 11–19 .
Please cite this article as: S. Zheng et al., Joint entity and relation extraction based on a hybrid neural network, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2016.12.075
JID: NEUCOM
8
ARTICLE IN PRESS
S. Zheng et al. / Neurocomputing 0 0 0 (2017) 1–8
[47] R. Kiros , Y. Zhu , R.R. Salakhutdinov , R. Zemel , R. Urtasun , A. Torralba , S. Fidler ,
Skip-thought vectors, in: Proceedings of the Advances in Neural Information
Processing Systems, 2015, pp. 3276–3284 .
[48] L. Ratinov , D. Roth , Design challenges and misconceptions in named entity
recognition, in: Proceedings of the Thirteenth Conference on Computational
Natural Language Learning, Association for Computational Linguistics, 2009,
pp. 147–155 .
[49] N. Kalchbrenner , E. Grefenstette , P. Blunsom , A convolutional neural network
for modelling sentences, in: Proceedings of Conference on Empirical Methods
in Natural Language Processing, 2014 .
[50] K. Duan , S.S. Keerthi , W. Chu , S.K. Shevade , A.N. Poo , Multi-category classi-
fication by soft-max combination of binary classifiers, in: Multiple Classifier
Systems, Springer, 2003, pp. 125–134 .
[51] G.E. Dahl , T.N. Sainath , G.E. Hinton , Improving deep neural networks for LVCSR
using rectified linear units and dropout, in: Proceedings of the ICASSP, 2013,
pp. 8609–8613 .
[52] T. Tieleman , G. Hinton , Lecture 6.5-rmsprop, COURSERA: Neural networks for
machine learning (2012) .
[53] J. Lafferty , A. McCallum , F. Pereira , Conditional random fields: Probabilistic
models for segmenting and labeling sequence data, in: Proceedings of the
Eighteenth International Conference on Machine Learning, ICML, vol. 1, 2001,
pp. 282–289 .
[54] S.J. Phillips , R.P. Anderson , R.E. Schapire , Maximum entropy modeling of
species geographic distributions, Ecol. Modell. 190 (3) (2006) 231–259 .
Suncong Zheng is a Ph.D candidate in Institute of Au-
tomation Chinese Academy of Sciences. He received his
B.S. degree in School of TianJin University, China, in 2012.
His research interests include information extraction and
web/text mining.
Yuexing Hao is a M.S. candidate in Institute of Automa-
tion Chinese Academy of Sciences. She received his B.S.
degree in School of University of Science Technology Bei-
jing, China, in 2014. Her research interests include infor-
mation extraction and web/text mining.
Dongyuan Lu is a Lecturer with the School of Information
Technology and Management, University of International
Business and Economics, Beijing, China. She received the
B.S. degree from Beijing Normal University, Beijing, China,
in 2007, and the Ph.D. degree from the Institute of Au-
tomation, Chinese Academy of Sciences, Beijing, China, in
2012. Then she continued her research work in National
University of Singapore as a research fellow for 2 years.
Her research interests include social media analysis, in-
formation retrieval, and data mining.
[m5G; March 8, 2017;1:24 ]
Hongyun Bao is an assistant researcher in the Insti-
tute of Automation Chinese Academy of Sciences. She
received his B.S. degree in School of Mathematical Sci-
ences from Capital Normal University, China, in 2008, and
Ph.D. degree from Chinese Academy of Sciences, in 2013.
Her research interests include information extraction and
web/text mining.
Jiaming Xu is an assistant researcher in the Institute of
Automation Chinese Academy of Sciences. He received his
M.S. degree in School of University of Science Technol-
ogy Beijing, China, in 2012, and Ph.D. degree from Chi-
nese Academy of Sciences, in 2016. His research interests
include information extraction and web/text mining.
Hongwei Hao is the deputy director of Interactive Digital
Media Technology Research Center, Institute of Automa-
tion, Chinese Academy of Sciences. His research interests
include semantic computation, pattern recognition, ma-
chine learning, and image processing. He has published
over 50 papers in Chinese Journals, international journals
and conferences.
Bo Xu is a Professor at the Institute of Automation Chi-
nese Academy of Sciences. He received the B.S. degree
in Zhejiang University in 1988. From 1988 he joined the
speech recognition research and received the Master’s
and Doctor’s Degree in the field in 1992 and 1997 respec-
tively. Now he is the President of CASIA and takes posi-
tion in committee of National high-tech Program in fields
of Chinese Information Processing, Multimedia and Virtu-
ality. He has published more than 100 papers on major
journals and processdings including IEEE Trans.
Please cite this article as: S. Zheng et al., Joint entity and relation extraction based on a hybrid neural network, Neurocomputing (2017),
http://dx.doi.org/10.1016/j.neucom.2016.12.075